Teradata Joins – Reduce Skewing with RANDOM()
Sometimes natural skew can become a huge problem, causing bad join performance. The Partial Duplication & Partial Redistribution (PDPR) feature on Teradata 14 (and above) helps to reduce this issue, but will not always be able to detect all possible applications (for example, if statistics are not revealing the skewed values).
If you are stuck on a version below Teradata 14, you will not have PDPR available at all. A technical trick comes into play: Skew reduction with the RANDOM() function.
Teradata Joins are vulnerable to Skewing
Let’s carve out the details step by step.
In our example, we deal with a join between the two tables below. There are some known facts about these tables:
- Table 1 doesn’t contain any row with value ‘A’ in the join column. Furthermore, it’s ensured that there will never be value ‘A’ added to this column
- The join column of Table 2 contains an enormous amount of rows with value ‘A’. It’s skewed (our example tables only include a few rows, but that’s enough to get the idea)
- Both tables are evenly distributed across all AMPs on their primary index column (called “primary key” in our example)
- There exists a range of join column values which never is available in the regular data of Table 2 (we will explain later why this is a must)
Above listed facts is a requirement for our test setup to work.
When left joining table 1 with table 2 on the join column, rows of both tables have to be rehashed on this column. We assume in our example that Teradata redistributes both tables.
The alert reader having some knowledge about the Teradata architecture will recognize that all rows with value ‘A’ in the join column will be moved to the same AMP. The spool of table 2 will be heavily skewed. On the other hand, will the redistribution of table 1 most likely result in an evenly distributed spool. During the joining step, the AMP holding table 2 rows with value ‘A’ will have to do much more work than the other AMPs.
How can we avoid skewing in this case? The RANDOM() function can help us:
We know that table 1 is not containing any rows with value ‘A’ in the join column (and never will by definition). This information allows us to conclude, that we will have no matches for any table 2 row containing ‘A’.
We can easily replace ‘A’ in table 2 with a range of values not available as regular data (see requirement four above). Here comes the RANDOM() function into play.
We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which are replacing the ‘A’ values of Table 2. As these values are non-existent in regular data, this will not have any impact on the result set:
CREATE VOLATILE TABLE NOT_SKEWING_TABLE2
SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2
) PRIMARY INDEX (PrimaryKey);
Instead of joining Table2 with Table1 we will join NOT_SKEWING_TABLE2 with table1. The redistribution of table NOT_SKEWING_TABLE2 will not cause a skewed spool as the randomly generated values will spread evenly across all AMPs.
As a result, we will have evenly distributed workload across all AMPs; the join will perform much better than without this trick. Just today, I was able to improve the run time of a query from 4 hours to 3 minutes, simply by applying above method.