February 21

0 comments

How You Can Reduce Teradata Hash Collisions

By DWH Pro Admin

February 21, 2020


To understand the problem of hash collisions in Teradata I will give a short overview of how rows are distributed in Teradata. If you are not familiar with Teradata Architecture or you have forgotten the details, I recommend to read the following article first:

As you already know, the rows of a table are distributed to the AMPs using a hashing algorithm.

The Foundations

This hashing algorithm has as input one or more column values and generates as output a 32-bit integer called ROWHASH.

Two important characteristics of this hashing function are important when we will talk about hash collisions:

  • The same input values always produce the same ROWHASH
  • Different input values can generate the same ROWHASH

First of all, it is the second feature that interests us in this blog post. As already mentioned a ROWHASH is a 32-bit integer.

This means that more than 4 billion different ROWHASH values can be generated. So we have to expect that with each 4 billion rows the same ROWHASH value is reused for a different input value. This is called Hash Collision.

Hash collisions are the reason why it is not sufficient to know the ROWHASH to locate a row in a data block.

Therefore, every time a row is searched, both the ROWHASH and the searched value must be passed to the AMP in the message generated by the parsing engine.

In addition, a 32-bit integer is appended to the Rowhash which uniquely identifies the Row. This 32-bit integer together with the ROWHASH is called ROWID.

The first 16 or 20 bits (depending on the Teradata System) of the ROWHASH are the so-called Hash Bucket. The hash bucket determines which entry in the hash map is responsible for this row. Finally, the hash map shows the target amp.

A Case Study On Teradata Hash Collisions

Although the Teradata Hashing Algorithm basically provides a good distribution and is designed to prevent hash collisions, care should be taken when defining columns of type character as the primary index. Here is an example.

We first create two tables. The primary index of one table is of data type BIGINT, the second table has a primary index of data type VARCHAR(30):

CREATE TABLE DWHPRO.COLLISION_YES
(
    PK VARCHAR(30)  NOT NULL
) PRIMARY INDEX (PK);

CREATE TABLE DWHPRO.COLLISION_NO
(
    PK BIGINT  NOT NULL
) PRIMARY INDEX (PK);

We populate the two tables as shown below. We do this exactly four times with different SAMPLE sizes n of 40 million rows, 80 million rows, 120 million rows, and 160 million rows. After each loading of the table we check the number of hash collisions:

INSERT INTO  DWHPRO.COLLISION_NO
SELECT ROW_NUMBER() OVER (ORDER BY 1) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;

INSERT INTO  DWHPRO.COLLISION_YES
SELECT TRIM(ROW_NUMBER() OVER (ORDER BY 1)) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;
SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_NO 
GROUP BY 1 HAVING X > 1
) X;

SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_YES 
GROUP BY 1 HAVING X > 1
) X

Here is the result:

RowsBIGINTVARCHAR(30)
40 Mio0213481
80 Mio0371526
120 Mio0824160
160 Mio03133779

As you can see the hashing function is much more stable with integer data types. There is no hash collision with any sample size for BIGINT.

The situation is quite different for a primary index of type VARCHAR(30):

Teradata Hash Collisions

Conclusions

As we have seen, with 120 million rows in the table we are far from the expected one hash collision per 4 billion rows. In our example, it is more than 3 million!

The problem with hash collisions is that valuable CPU seconds are wasted on both reading and writing rows. Hash collisions make both finding the data and storing rows in data blocks difficult. This may not be a problem if there are only a few, but eventually, the effect on the CPU will be noticeable.

So you should think carefully about how to design your Primary Index and I hope this is another incentive to use surrogate keys in your data warehouse:

DWH Pro Admin


Teradata and Data Warehouse Specialist with more than 20 years of experience

You might also like

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

Never miss a good story!

 Subscribe to our newsletter to keep up with the latest trends!

>