Sometimes natural skew can become a huge problem, causing bad join performance. The Partial Duplication & Partial Redistribution (PDPR) feature on Teradata 14 (and above) helps reduce this issue. Still, it will not always be able to detect all possible applications (for example, if statistics are not revealing the skewed values).
If you are stuck on a version below Teradata 14, you will not have PDPR. A technical trick comes into play: Skew reduction with the RANDOM() function.
Teradata Joins are vulnerable to Skewing.
Let’s carve out the details step by step.
We deal with a join between the two tables below in our example. There are some known facts about these tables:
- Table 1 doesn’t contain any row with ‘A’ in the join column. Furthermore, it’s ensured that there will never be value ‘A’ added to this column
- The join column of Table 2 contains many rows with the value ‘A’. It’s skewed (our example tables only include a few rows, but that’s enough to get the idea)
- Both tables are evenly distributed across all AMPs on their primary index column (called “primary key” in our example)
- There exists a range of join column values which never is available in the regular data of Table 2 (we will explain later why this is a must)
Above listed facts is a requirement for our test setup to work.
When left joining table 1 with table 2 on the join column, rows of both tables have to be rehashed on this column. We assume in our example that Teradata redistributes both tables.
The alert reader knows about the Teradata architecture and will recognize that Teradata will move all rows with the value ‘A’ in the join column to the same AMP. The spool of table 2 will be heavily skewed. On the other hand, will the redistribution of table 1 most likely result in an evenly distributed spool? During the joining step, the AMP holding table 2 rows with value ‘A’ will have to do much more work than the other AMPs.
How can we avoid skewing in this case? The RANDOM() function can help us:
We know that table 1 is not containing any rows with the value ‘A’ in the join column (and never will, by definition). This information allows us to conclude that we will have no matches for any table 2 row containing ‘A’.
We can easily replace ‘A’ in table 2 with a range of values not available as regular data (see requirement four above). Here comes the RANDOM() function into play.
We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which are replacing the ‘A’ values of Table 2. As these values are non-existent in regular data, this will not have any impact on the result set:
CREATE VOLATILE TABLE NOT_SKEWING_TABLE2
SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2
) PRIMARY INDEX (PrimaryKey);
Instead of joining Table2 with Table1, we will join NOT_SKEWING_TABLE2 with table1. The redistribution of table NOT_SKEWING_TABLE2 will not cause a skewed spool as the randomly generated values will spread evenly across all AMPs.
As a result, we will have an evenly distributed workload across all AMPs; the join will perform much better than without this trick. I improved the run time of a query from 4 hours to 3 minutes by applying the above method.