Skewed Teradata Joins – The Initial Situation
Let’s take the following scenario as an example. One table contains currencies. The other table has our customers’ accounts, including the currency in which the account is managed.
This means that the ISO code of the currency is the foreign key of the account table:
CREATE TABLE Currency
(
CURRENCY_CD VARCHAR(20) NOT NULL,
CURRENCY_NAME VARCHAR(200),
) PRIMARY INDEX (CURRENCY_CD);
CREATE TABLE Customer
(
CUSTOMER_ID INTEGER NOT NULL,
CUSTOMER_NAME VARCHAR(255),
...
CURRENCY_CD CHAR(20)
) PRIMARY INDEX (CUSTOMER_ID);
CUSTOMER TABLE
CUSTOMER_ID | CUSTOMER_NAME | CURRENCY_CD |
1 | Nina Lowery | EUR |
2 | Alexia Neal | USD |
3 | Kyla Chan | NULL |
4 | Alesha Ferrell | NULL |
5 | Cara Adams | NULL |
6 | Abigail Larsen | NULL |
7 | Amie Massey | NULL |
CURRENCY TABLE
CURRENCY_CD | CURRENCY_NAME |
EUR | Euro |
USD | US Dollar |
AUD | Australian Dollar |
HUF | Hungarian Forint |
Join using the Column CURRENCY_CD requires that both tables have this column as their primary index and that the rows of both tables are available on the relevant AMPs.
As you can see from our example table for accounts, most accounts have no currency assigned (NULL). Therefore, the join will be skewed.
Even though current versions of Teradata make appropriate optimizations to reduce this problem (e.g., by re-hashing the rows of well-distributed values but copying rows with a NULL value to all AMPs), this problem persists.
Especially if the optimizer does not have the corresponding statistics about biased values or only has them obsolete.
A Possible Solution To This Problem
One possible solution is to replace the NULL value with a value that is as distinct as possible.
In our case, for example, the CUSTOMER_ID can be used. We could cast it to a CHARACTER column, add a “special string’ in front of it, and write it into the Column CURRENCY_CD:
CUSTOMER_ID | CUSTOMER_NAME | CURRENCY_CD |
1 | Nina Lowery | EUR |
2 | Alexia Neal | USD |
3 | Kyla Chan | #NULL#3 |
4 | Alesha Ferrell | #NULL#4 |
5 | Cara Adams | #NULL#5 |
6 | Abigail Larsen | #NULL#6 |
7 | Amie Massey | #NULL#7 |
However, a disadvantage of this method is that NULL values are no longer recognizable. So you might need some logic in the queries.
Do you want to know everything about joins in Teradata? Then read this article: