Natural skew may hinder join performance, but the PDPR capability in Teradata 14 can alleviate this problem. However, PDPR may not detect all potential applications if skewed values are not evident in the statistics.
If your Teradata version is below 14, PDPR is unavailable. To address this, you can utilize a technical method using the RANDOM() function to reduce skew.
Teradata Joins are vulnerable to Skewing.
Let’s delineate the specifics incrementally.
Our example involves a join between the two tables shown below. Certain facts are already established about these tables:
- Table 1 contains no row with ‘A’ in the join column. Furthermore, it’s ensured that there will never be value ‘A’ added to this column
- The join column of Table 2 contains many rows with the value ‘A’. It’s skewed (our example tables only include a few rows, but that’s enough to get the idea)
- Both tables are evenly distributed across all AMPs on their primary index column (called “primary key” in our example)
- There exists a range of join column values which never is available in the regular data of Table 2 (we will explain later why this is a must)
The aforementioned facts are necessary for our testing framework to operate.
By performing a left join of table 1 and table 2 on the common column, both tables’ rows are hashed on this column. Assuming Teradata redistributes both tables, this is exemplified.
The informed reader understands Teradata’s architecture and identifies that all rows containing ‘A’ in the join column will be relocated to a single AMP, which will cause table 2’s spool to be imbalanced. Conversely, the redistribution of table 1 should yield an evenly dispersed spool. When joining, the AMP retaining table 2’s rows with ‘A’ will bear a greater workload than the other AMPs.
We can employ the RANDOM() function to prevent skewing in this scenario.
Table 1 does not contain any rows with the value ‘A’ in the join column, meaning there will be no matches for any row in table 2 containing ‘A’.
We can easily replace ‘A’ in table 2 with a range of values unavailable as regular data (see requirement four above). Here comes the RANDOM() function into play.
We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which are replacing the ‘A’ values of Table 2. As these values are non-existent in regular data, this will not have any impact on the result set:
CREATE VOLATILE TABLE NOT_SKEWING_TABLE2
AS
(
SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2
) PRIMARY INDEX (PrimaryKey);
We will join NOT_SKEWING_TABLE2 with table1 instead of Table2. Redistributing NOT_SKEWING_TABLE2 will not result in a skewed spool because the randomly generated values will be evenly distributed across all AMPs.
By implementing this technique, our workload will be uniformly distributed among all AMPs, resulting in improved performance for the join operation. Applying this approach reduced the query’s runtime from 4 hours to just 3 minutes.
Select ……………… from table1
left outer join table2
on
case when table2.Joincolumn = ‘A’ then cast(random(1,100) as varchar(3))
else table1.Joincolumn = table2.Joincolumn end:
Thanks for the detailed explanation. You are absolutely right, a more useful application of the RANDOM() function is your last example. Thanks again. Roland.
Instead of using the Volatile table, the quick way to get rid of skew value is to add a filter condition in the join clause as “Table2.JoinColumn ‘A’ “, since Table 2 is an inner table only matching records will be projected from this table.
This optimization usually applies when the outer table has skewed values in join columns. Example:
Select ……………… from table1 left outer join table2 on
table1.Joincolumn = table2.Joincolumn;
Let’s consider the table1 join column has a particular skewed value like ‘A’ which does not exist in table2. We cannot filter out this value as table1 is an outer table and all matching and non-matching values need to be projected.
We can solve this skew issue using below random() function technique,
Select ……………… from table1 left outer join table2 on
case when table1.Joincolumn = ‘A’ then cast(random(1,100) as varchar(3)) else table1.Joincolumn = table2.Joincolumn;