To understand the problem of hash collisions in Teradata I will give a short overview of how rows are distributed in Teradata. If you are not familiar with Teradata Architecture or you have forgotten the details, I recommend to read the following article first:
As you already know, the rows of a table are distributed to the AMPs using a hashing algorithm.
This hashing algorithm has as input one or more column values and generates as output a 32-bit integer called ROWHASH.
Two important characteristics of this hashing function are important when we will talk about hash collisions:
- The same input values always produce the same ROWHASH
- Different input values can generate the same ROWHASH
First of all, it is the second feature that interests us in this blog post. As already mentioned a ROWHASH is a 32-bit integer.
This means that more than 4 billion different ROWHASH values can be generated. So we have to expect that with each 4 billion rows the same ROWHASH value is reused for a different input value. This is called Hash Collision.
Hash collisions are the reason why it is not sufficient to know the ROWHASH to locate a row in a data block.
Therefore, every time a row is searched, both the ROWHASH and the searched value must be passed to the AMP in the message generated by the parsing engine.
In addition, a 32-bit integer is appended to the Rowhash which uniquely identifies the Row. This 32-bit integer together with the ROWHASH is called ROWID.
The first 16 or 20 bits (depending on the Teradata System) of the ROWHASH are the so-called Hash Bucket. The hash bucket determines which entry in the hash map is responsible for this row. Finally, the hash map shows the target amp.
A Case Study On Teradata Hash Collisions
Although the Teradata Hashing Algorithm basically provides a good distribution and is designed to prevent hash collisions, care should be taken when defining columns of type character as the primary index. Here is an example.
We first create two tables. The primary index of one table is of data type BIGINT, the second table has a primary index of data type VARCHAR(30):
CREATE TABLE DWHPRO.COLLISION_YES ( PK VARCHAR(30) NOT NULL ) PRIMARY INDEX (PK); CREATE TABLE DWHPRO.COLLISION_NO ( PK BIGINT NOT NULL ) PRIMARY INDEX (PK);
We populate the two tables as shown below. We do this exactly four times with different SAMPLE sizes n of 40 million rows, 80 million rows, 120 million rows, and 160 million rows. After each loading of the table we check the number of hash collisions:
INSERT INTO DWHPRO.COLLISION_NO SELECT ROW_NUMBER() OVER (ORDER BY 1) FROM DWHPRO.A_HUGE_TABLE SAMPLE <n>; INSERT INTO DWHPRO.COLLISION_YES SELECT TRIM(ROW_NUMBER() OVER (ORDER BY 1)) FROM DWHPRO.A_HUGE_TABLE SAMPLE <n>;
SELECT COUNT(*) FROM ( SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_NO GROUP BY 1 HAVING X > 1 ) X; SELECT COUNT(*) FROM ( SELECT HASHROW(PK) AS HashValue ,COUNT(*) X FROM DWHPRO.COLLISION_YES GROUP BY 1 HAVING X > 1 ) X
Here is the result:
As you can see the hashing function is much more stable with integer data types. There is no hash collision with any sample size for BIGINT.
The situation is quite different for a primary index of type VARCHAR(30):
As we have seen, with 120 million rows in the table we are far from the expected one hash collision per 4 billion rows. In our example, it is more than 3 million!
The problem with hash collisions is that valuable CPU seconds are wasted on both reading and writing rows. Hash collisions make both finding the data and storing rows in data blocks difficult. This may not be a problem if there are only a few, but eventually, the effect on the CPU will be noticeable.
So you should think carefully about how to design your Primary Index and I hope this is another incentive to use surrogate keys in your data warehouse: