fbpx

How You Can Reduce Teradata Hash Collisions

By DWH Pro Admin

February 21, 2020


To understand the problem of hash collisions in Teradata, I will give a short overview of how rows are distributed in Teradata. If you are not familiar with Teradata Architecture or have forgotten the details, I recommend reading the following article first:

As you already know, a table’s rows are distributed to the AMPs using a hashing algorithm.

The Foundations

This hashing algorithm has as input one or more column values and generates a 32-bit integer called ROWHASH.

Two important characteristics of this hashing function are important when we will talk about hash collisions:

  • The same input values always produce the same ROWHASH
  • Different input values can generate the same ROWHASH

First of all, it is the second feature that interests us in this blog post. As already mentioned, a ROWHASH is a 32-bit integer.

This means that more than 4 billion different ROWHASH values can be generated. So we have to expect that with each 4 billion rows, the same ROWHASH value is reused for a different input value. This is called Hash Collision.

Hash collisions are why it is not sufficient to know the ROWHASH to locate a row in a data block.

Therefore, every time a row is searched, both the ROWHASH and the searched value must be passed to the AMP in the parsing engine’s message.

In addition, a 32-bit integer is appended to the Rowhash, which uniquely identifies the Row. This 32-bit integer, together with the ROWHASH, is called ROWID.

The first 16 or 20 bits (depending on the Teradata System) of the ROWHASH are the so-called Hash Bucket. The hash bucket determines which entry in the hash map is responsible for this row. Finally, the hash map shows the target amp.

A Case Study On Teradata Hash Collisions

Although the Teradata Hashing Algorithm basically provides a good distribution and is designed to prevent hash collisions, care should be taken when defining character columns as the primary index. Here is an example.

We first create two tables. The primary index of one table is of data type BIGINT. The second table has a primary index of data type VARCHAR(30):

CREATE TABLE DWHPRO.COLLISION_YES
(
    PK VARCHAR(30)  NOT NULL
) PRIMARY INDEX (PK);

CREATE TABLE DWHPRO.COLLISION_NO
(
    PK BIGINT  NOT NULL
) PRIMARY INDEX (PK);

We populate the two tables, as shown below. We do this exactly four times with different SAMPLE sizes n of 40 million rows, 80 million rows, 120 million rows, and 160 million rows. After each loading of the table, we check the number of hash collisions:

INSERT INTO  DWHPRO.COLLISION_NO
SELECT ROW_NUMBER() OVER (ORDER BY 1) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;

INSERT INTO  DWHPRO.COLLISION_YES
SELECT TRIM(ROW_NUMBER() OVER (ORDER BY 1)) FROM DWHPRO.A_HUGE_TABLE
SAMPLE <n>;
SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_NO 
GROUP BY 1 HAVING X > 1
) X;

SELECT COUNT(*) FROM
(
SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_YES 
GROUP BY 1 HAVING X > 1
) X

Here is the result:

RowsBIGINTVARCHAR(30)
40 Mio0213481
80 Mio0371526
120 Mio0824160
160 Mio03133779

As you can see, the hashing function is much more stable with integer data types. There is no hash collision with any sample size for BIGINT.

The situation is quite different for a primary index of type VARCHAR(30):

Teradata Hash Collisions

Conclusions

As we have seen, with 120 million rows in the table, we are far from the expected one hash collision per 4 billion rows. In our example, it is more than 3 million!

The problem with hash collisions is that valuable CPU seconds are wasted on both reading and writing rows. Hash collisions make both finding the data and storing rows in data blocks difficult. This may not be a problem if there are only a few, but eventually, the CPU effect will be noticeable.

So you should think carefully about how to design your Primary Index, and I hope this is another incentive to use surrogate keys in your data warehouse:

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Teradata Book Query Performance Tuning
Buy the Book Teradata Query Performance Tuning

DWH Pro Admin

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>