Business intelligence tools often create queries with many outer joins. This article proves why outer joins are vulnerable to skewing. We will demonstrate that even if both tables are not skewed, it can be that the plan step by the first join step is biased.
Let me carve out the details based on a simple test scenario:
CREATE TABLE table1 (pk INTEGER, a INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, b INTEGER, c INTEGER) PRIMARY INDEX (pk);
CREATE TABLE table2 (pk INTEGER, d INTEGER, e INTEGER) PRIMARY INDEX (pk);
SELECT a,b,c,d,e FROM table1 t01 LEFT JOIN table2 t02 ON t01.a = t02.b LEFT JOIN table3 t03 ON t02.c = t03.d;
For the above setup (tables and query), the Optimizer may create an execution plan with the following steps:
1. Redistribute table1 on column a, redistribute table2 on column b (in parallel).
2. Merge join (outer join) the resulting spools (from step 1). The columns of the resulting spool will be padded with NULL values for columns not matching rows.
3. Let’s assume that the resulting spool (i.e., the spool which holds the joined result of table1 and table2) is still not skewed. We will call this spool s(table1,table2) for simplicity.
4. The result spool s(table1,table2) is redistributed on column c.
5. Table3 is redistributed on column d.
Now the skewing problems will start. If many rows do not match during the join of table1 and table2, column c of s(table1,table2) will contain many NULL values.
Like any other value, the rowhash value of the NULL value is deterministic – NULL values always hash to the same AMP!
If only a few rows match during the join process of table1 and table2, the redistribution of s(table1,table2) on column c (required to allow the join with table3) will move all unmatched rows to the AMP holding NULL values.
We started with a perfect situation. We had three unskewed tables. We ended up with a skewed spool after executing the second outer join!
Business intelligence tools tend to create queries with dozens of outer joins. These queries are at high risk for skewing.
Teradata 14.10 is doing its best to avoid join skewing. It uses features like partial duplication / partial redistribution (PDPR).
The Optimizer will not recognize all situations where PDPR is available. Therefore statistics are essential to make sure that PDPR can be used.
If you are hand-coding queries, you should use outer joins only if the business logic requires this:
Don’t use outer joins “just in case” to “avoid losing rows” (a relatively common approach nowadays).