The Teradata Primary Index and Hash Collisions
Typically it makes no difference whether we use INTEGER data types or CHARACTER data types in the Teradata Primary Index. The Teradata hashing algorithm is highly efficient, and at most, when joining two tables with primary index columns with different CHARACTER SET, there can be minimal performance losses. But this is insignificant.
Still, there is a problem with CHARACTER columns in the primary index that we want to address in this article, which can negatively impact performance. We often overlook this problem because it is not visible like skew and is completely hidden.
Our test scenario assumes a Teradata system with 160 AMPs with 20-bit hash buckets. We create a table and choose a CHARACTER(8) column as the primary index containing 100 million phone numbers. 20-bit hash buckets correspond in numbers to 1,048,576 different hash buckets. We expect the following from the hashing algorithm:
- Each hash bucket is used approx. 95 times
- Each AMP receives approx. 625,000 rows
- Minimal to no hash collisions as billions of different hash values can be generated
We calculated the respective DSW, AMP, and number of collisions for all possible input values.
The figure below shows the number of different hash buckets (vertical axis) as a function of frequency generation for the string-based primary index column. The ideal distribution is an impulse function centered at value 95, with 1,048,576 counts. The observed distribution in our test is still centered at value 95, but it looks more like a bell curve with a maximum at about 43,000 counts, i.e., 25 times less than expected ones. One bucket is generated only 50 times, another only 51 times, and many more in the range of 120 to 150.
A closer analysis shows that the even distribution of the rows across the AMPs is not the problem. All AMPs get approximately the same number of rows (between 623,000 and 627,000). The problem is that hash collisions are generated on all AMPs, as we can see in the following graph:
The algorithm created approximately 15,000 hash collisions on each of the 160 AMPS. Most of the 15,000 hash collisions are caused by 3 to 4 primary index values being mapped to the same hash value. The fact that hash collisions harm performance should therefore worry us.
Now let’s look at the same scenario again, but this time store the phone numbers in a BIGINT column and use that as the Teradata Primary Index. Again we create 100 million rows and keep them in our table. The resulting graph now looks like what we would have expected for the CHARACTER column:
The chart becomes a sharp impulse function, with values strictly ranging between 93 and 98. Every AMP is assigned 625,000 rows, with a negligible variation of ±280. Even better, there are no duplicate hashes anymore!
As we have seen, using CHARACTER columns in the primary index can lead to many hash collisions. Typically, the number of different contents in a CHARACTER column is limited; therefore, this is often not a problem. However, in extreme cases, as just described, with millions of different values in the primary index column, this can lead to enormous performance problems.
I want to avoid misunderstandings. Using CHARACTER columns in the primary index is not generally harmful. Typically this has no adverse effect on retrieve steps, join steps, or aggregations. The test scenario shown here refers to the particular case where many (millions) of different values are stored in a CHARACTER column. Only here it comes to this massive number of hash collisions. So please don’t redesign the PDM now but be on the lookout for such cases.
In the following article, I will show you what else you need to pay attention to when choosing the Primary Index to achieve optimal performance: