Business intelligence tools frequently utilize queries with multiple outer joins. This article presents evidence supporting the vulnerability of outer joins to skewing. We will illustrate that, regardless of whether both tables are skewed, there exists a possibility that the plan step in the initial joins may be biased.
Let me carve out the details based on a simple test scenario:
CREATE TABLE table1 (pk INTEGER, a INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, b INTEGER, c INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, d INTEGER, e INTEGER) PRIMARY INDEX (pk);
SELECT a,b,c,d,e FROM table1 t01 LEFT JOIN table2 t02 ON t01.a = t02.b LEFT JOIN table3 t03 ON t02.c = t03.d;
The Optimizer can generate an execution plan for the aforementioned setup (tables and query) that consists of the subsequent steps:
1. Redistribute table1 on column a, and redistribute table2 on column b (in parallel).
2. Merge join (outer join) the resulting spools (from step 1). The columns of the resulting spool will be padded with NULL values for columns not matching rows.
3. Let’s assume that the resulting spool (i.e., the spool which holds the joined result of table1 and table2) is still not skewed. We will call this spool s(table1,table2) for simplicity.
4. The result spool s(table1,table2) is redistributed on column c.
5. Table3 is redistributed on column d.
The skewing issues will commence as the join of table1 and table2 may result in numerous non-matching rows, causing multiple NULL values to populate column c of the s(table1,table2).
The rowhash value of a NULL value is deterministic and consistent – every NULL value will always hash to the same AMP.
If only a few rows match during the join of table1 and table2, redistributing s(table1,table2) based on column c (necessary for joining with table3) will relocate any unmatched rows to the AMP holding NULL values.
Initially, our situation was ideal with three unskewed tables. However, executing the second outer join resulted in a skewed spool.
Business intelligence tools often generate queries containing many outer joins, which can significantly increase the likelihood of skewing.
Teradata 14.10 employs techniques such as PDPR to prevent join skewing.
Statistics are essential to ensure the availability of PDPR as the Optimizer may not recognize all applicable situations.
When manually coding queries, only use outer joins when business logic dictates.
Avoid using outer joins unnecessarily to prevent losing rows, which has become a prevalent practice.