Natural skew may hinder join performance, but the PDPR capability in Teradata 14 can alleviate this problem. However, PDPR may not detect all potential applications if skewed values are not evident in the statistics.
If your Teradata version is below 14, PDPR is unavailable. To address this, you can utilize a technical method using the RANDOM() function to reduce skew.
Teradata Joins are vulnerable to Skewing.
Let’s delineate the specifics incrementally.
Our example involves a join between the two tables shown below. Certain facts are already established about these tables:
- Table 1 contains no row with ‘A’ in the join column. Furthermore, it’s ensured that there will never be value ‘A’ added to this column
- The join column of Table 2 contains many rows with the value ‘A’. It’s skewed (our example tables only include a few rows, but that’s enough to get the idea)
- Both tables are evenly distributed across all AMPs on their primary index column (called “primary key” in our example)
- There exists a range of join column values which never is available in the regular data of Table 2 (we will explain later why this is a must)
The aforementioned facts are necessary for our testing framework to operate.
By performing a left join of table 1 and table 2 on the common column, both tables’ rows are hashed on this column. Assuming Teradata redistributes both tables, this is exemplified.
The informed reader understands Teradata’s architecture and identifies that all rows containing ‘A’ in the join column will be relocated to a single AMP, which will cause table 2’s spool to be imbalanced. Conversely, the redistribution of table 1 should yield an evenly dispersed spool. When joining, the AMP retaining table 2’s rows with ‘A’ will bear a greater workload than the other AMPs.
We can employ the RANDOM() function to prevent skewing in this scenario.
Table 1 does not contain any rows with the value ‘A’ in the join column, meaning there will be no matches for any row in table 2 containing ‘A’.
We can easily replace ‘A’ in table 2 with a range of values unavailable as regular data (see requirement four above). Here comes the RANDOM() function into play.
We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which are replacing the ‘A’ values of Table 2. As these values are non-existent in regular data, this will not have any impact on the result set:
CREATE VOLATILE TABLE NOT_SKEWING_TABLE2
SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2
) PRIMARY INDEX (PrimaryKey);
We will join NOT_SKEWING_TABLE2 with table1 instead of Table2. Redistributing NOT_SKEWING_TABLE2 will not result in a skewed spool because the randomly generated values will be evenly distributed across all AMPs.
By implementing this technique, our workload will be uniformly distributed among all AMPs, resulting in improved performance for the join operation. Applying this approach reduced the query’s runtime from 4 hours to just 3 minutes.