Teradata Joins – A Simple Guide to avoid Skew

Skewed Teradata Joins – The initial situation

Let’s take the following scenario as an example. One table contains currencies, the other table contains the accounts of our customers including the currency in which the account is managed.
This means that the ISO code of the currency is the foreign key of the account table:

CREATE TABLE Currency
(
   CURRENCY_CD  VARCHAR(20) NOT NULL,
   CURRENCY_NAME VARCHAR(200),  
) PRIMARY INDEX (CURRENCY_CD);

CREATE TABLE Customer
(
  CUSTOMER_ID  INTEGER NOT NULL,
  CUSTOMER_NAME VARCHAR(255),
  ...
  CURRENCY_CD CHAR(20)
) PRIMARY INDEX (CUSTOMER_ID);

CUSTOMER TABLE

CUSTOMER_IDCUSTOMER_NAMECURRENCY_CD
1 Nina Lowery EUR
2 Alexia Neal USD
3 Kyla Chan NULL
4 Alesha Ferrell NULL
5 Cara Adams NULL
6 Abigail Larsen NULL
7 Amie Massey NULL

CURRENCY TABLE

CURRENCY_CDCURRENCY_NAME
EUREuro
USDUS Dollar
AUDAustralian Dollar
HUFHungarian Forint

Join using the Column CURRENCY_CD requires that both tables have this column as their primary index and that the rows of both tables are available on the relevant AMPs. As you can see immediately from our example table for accounts, most accounts have no currency assigned (NULL). Therefore, the join will be skewed.

Even though current versions of Teradata make appropriate optimizations to reduce this problem (e.g. by re-hashing the rows of well-distributed values, but copying rows with a NULL value to all AMPs), this problem still persists. Especially if the optimizer does not have the corresponding statistics about biased values or only has them obsolete.

A possible solution to this Problem

One possible solution is to replace the NULL value with a value that is as distinct as possible. In our case, for example, the CUSTOMER_ID can be used. We could simply cast it as a CHARACTER column, add a “special string’ in front of it, and write it into the Column CURRENCY_CD:

CUSTOMER_IDCUSTOMER_NAMECURRENCY_CD
1 Nina Lowery EUR
2 Alexia Neal USD
3 Kyla Chan #NULL#3
4 Alesha Ferrell #NULL#4
5 Cara Adams #NULL#5
6 Abigail Larsen #NULL#6
7 Amie Massey #NULL#7

A disadvantage of this method, however, is that NULL values are no longer recognizable as such. So you might need some logic in the queries.

DWH Pro Admin
 

>