How You Can Reduce Teradata Hash Collisions

DWH Pro Admin

February 21, 2020

minutes reading time


I will briefly overview how rows are distributed in Teradata to understand the problem of hash collisions in Teradata. If you are not familiar with Teradata Architecture or have forgotten the details, I recommend reading the following article first:

As you already know, a table’s rows are distributed to the AMPs using a hashing algorithm.

The Foundations

This hashing algorithm has as input one or more column values and generates a 32-bit integer called ROWHASH.

Two essential characteristics of this hashing function are critical when we will talk about hash collisions:

  • The same input values always produce the same ROWHASH
  • Different input values can generate the same ROWHASH

First of all, it is the second feature that interests us in this blog post. As already mentioned, a ROWHASH is a 32-bit integer.

This means that more than 4 billion different ROWHASH values can be generated. So we have to expect that with each 4 billion rows, the same ROWHASH value is reused for another input value. This is called Hash Collision.

Hash collisions are why it is insufficient to know the ROWHASH to locate a row in a data block.

Therefore, every time a row is searched, both the ROWHASH and the searched value must be passed to the AMP in the parsing engine’s message.

In addition, a 32-bit integer is appended to the Rowhash, which uniquely identifies the row. This 32-bit integer, together with the ROWHASH, is called ROWID.

The first 16 or 20 bits (depending on the Teradata System) of the ROWHASH are the so-called Hash Bucket. The hash bucket determines which entry in the hash map is responsible for this row. Finally, the hash map shows the target amp.

A Case Study On Teradata Hash Collisions

Although the Teradata Hashing Algorithm provides a good distribution and is designed to prevent hash collisions, care should be taken when defining character columns as the primary index. Here is an example.

We first create two tables. The primary index of one table is of data type BIGINT. The second table has a primary index of data type VARCHAR(30):

CREATE TABLE DWHPRO.COLLISION_YES
(
    PK VARCHAR(30)  NOT NULL
) PRIMARY INDEX (PK);

CREATE TABLE DWHPRO.COLLISION_NO
(
    PK BIGINT  NOT NULL
) PRIMARY INDEX (PK);

We populate the two tables, as shown below. We do this precisely four times with different SAMPLE sizes n of 40 million rows, 80 million rows, 120 million rows, and 160 million rows. After each loading of the table, we check the number of hash collisions:

INSERT INTO  DWHPRO.COLLISION_NO
SELECT ROW_NUMBER() OVER (ORDER BY 1) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;

INSERT INTO  DWHPRO.COLLISION_YES
SELECT TRIM(ROW_NUMBER() OVER (ORDER BY 1)) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;
SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_NO 
GROUP BY 1 HAVING X > 1
) X;

SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_YES 
GROUP BY 1 HAVING X > 1
) X

Here is the result:

RowsBIGINTVARCHAR(30)
40 Mio0213481
80 Mio0371526
120 Mio0824160
160 Mio03133779

As you can see, the hashing function is much more stable with integer data types. There is no hash collision with any sample size for BIGINT.

The situation is quite different for a primary index of type VARCHAR(30):

Teradata Hash Collisions

Conclusions

As we have seen, with 120 million rows in the table, we are far from the expected one hash collision per 4 billion rows. In our example, it is more than 3 million!

The problem with hash collisions is that valuable CPU seconds are wasted reading and writing rows. Hash collisions make finding the data and storing rows in data blocks difficult. This may not be a problem if only a few, but eventually, the CPU effect will be noticeable.

So you should think carefully about how to design your Primary Index, and I hope this is another incentive to use surrogate keys in your data warehouse:

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>