3

Teradata Joins – Reduce Skewing with RANDOM()

Sometimes natural skew can become a huge problem, causing bad join performance. The Partial Duplication & Partial Redistribution (PDPR)  feature on Teradata 14 (and above) helps to reduce this issue, but will not always be able to detect all possible applications (for example, if statistics are not revealing the skewed values).

If you are stuck on a version below Teradata 14, you will not have PDPR available at all. A technical trick comes into play: Skew reduction with the RANDOM() function.

Teradata Joins are vulnerable to Skewing

Let’s carve out the details step by step.

In our example, we deal with a join between the two tables below. There are some known facts about these tables:

random

  1. Table 1 doesn’t contain any row with value ‘A’ in the join column. Furthermore, it’s ensured that there will never be value ‘A’ added to this column
  2. The join column of Table 2 contains an enormous amount of rows with value ‘A’. It’s skewed (our example tables only include a few rows, but that’s enough to get the idea)
  3. Both tables are evenly distributed across all AMPs on their primary index column (called “primary key” in our example)
  4. There exists a range of join column values which never is available in the regular data of Table 2 (we will explain later why this is a must)

Above listed facts is a requirement for our test setup to work.

When left joining table 1 with table 2 on the join column, rows of both tables have to be rehashed on this column. We assume in our example that Teradata redistributes both tables.

The alert reader having some knowledge about the Teradata architecture will recognize that all rows with value ‘A’ in the join column will be moved to the same AMP. The spool of table 2 will be heavily skewed. On the other hand, will the redistribution of table 1 most likely result in an evenly distributed spool. During the joining step, the AMP holding table 2 rows with value ‘A’ will have to do much more work than the other AMPs.

How can we avoid skewing in this case? The RANDOM() function can help us:

We know that table 1 is not containing any rows with value ‘A’ in the join column (and never will by definition).  This information allows us to conclude, that we will have no matches for any table 2 row containing ‘A’.

We can easily replace ‘A’ in table 2 with a range of values not available as regular data (see requirement four above). Here comes the RANDOM() function into play.

We can generate a range of values, such as TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))), which are replacing the ‘A’ values of Table 2. As these values are non-existent in regular data, this will not have any impact on the result set:

CREATE VOLATILE TABLE NOT_SKEWING_TABLE2
AS
(
SELECT PrimaryKey, TRIM(CAST(RANDOM(1,100) AS VARCHAR(3))) AS JoinColumn FROM Table2
) PRIMARY INDEX (PrimaryKey);

Instead of joining Table2 with Table1 we will join NOT_SKEWING_TABLE2 with table1. The redistribution of table NOT_SKEWING_TABLE2 will not cause a skewed spool as the randomly generated values will spread evenly across all AMPs.

As a result, we will have evenly distributed workload across all AMPs; the join will perform much better than without this trick. Just today, I was able to improve the run time of a query from 4 hours to 3 minutes, simply by applying above method.

Roland Wenzlofsky
 

Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 15 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

  • Avatar Siddesh Pawar says:

    Instead of using Volatile table, the quick way to get rid of skew value is to add a filter condition in join clause as “Table2.JoinColumn ‘A’ “, since Table 2 is a inner table only matching records will be projected from this table.

    This optimization usually applies when outer table has skewed values in join columns. Example:

    Select ……………… from table1 left outer join table2 on
    table1.Joincolumn = table2.Joincolumn;

    Lets consider table1 joincolumn has a particular skewed value like ‘A’ which does not exists in table2. We cannot filter out this value as table1 is a outer table and all matching and non-matching values need to be projected.

    We can solve this skew issue using below random() function technique,

    Select ……………… from table1 left outer join table2 on
    case when table1.Joincolumn = ‘A’ then cast(random(1,100) as varchar(3)) else table1.Joincolumn = table2.Joincolumn;

  • Thanks for the detailed explanation. You are absolutely right, a more useful application of the RANDOM() function is your last example. Thanks again. Roland.

  • Avatar sobhan bhaduri says:

    Select ……………… from table1
    left outer join table2
    on
    case when table2.Joincolumn = ‘A’ then cast(random(1,100) as varchar(3))
    else table1.Joincolumn = table2.Joincolumn end:

  • >