The Teradata Primary Index and Hash Collisions
Using either INTEGER or CHARACTER data types for the Teradata Primary Index does not usually impact performance. The hashing algorithm in Teradata is highly efficient. In the rare case where primary index columns from different tables have different CHARACTER SET, there may be minimal performance losses, but this is insignificant.
This article addresses a performance issue with CHARACTER columns in the primary index. Although not as apparent as skew, this problem is often overlooked and remains hidden.
Assuming a Teradata system with 160 AMPs and 20-bit hash buckets, we create a table with a primary index of a CHARACTER(8) column that contains 100 million phone numbers. The 20-bit hash buckets allow for 1,048,576 unique hash buckets. The hashing algorithm is expected to provide:
- Each hash bucket is used approx. 95 times
- Each AMP receives approx. 625,000 rows
- Minimal to no hash collisions, as billions of different hash values can be generated
We computed the DSW, AMP, and collision count for every conceivable input value.
The chart displays the quantity of hash buckets (y-axis) produced by the frequency generation of the primary index column utilizing strings (x-axis). An ideal distribution would exhibit an impulse function with 1,048,576 counts focused on 95. Our test results show a distribution still centered on 95, but resembling a bell curve with a peak of approximately 43,000 counts, a mere 25% of the expected value. The frequency of certain buckets is significantly lower, with one bucket only occurring 50 times and another 51 times. The majority of buckets fall within the range of 120 to 150 occurrences.
Upon closer examination, it becomes evident that the equitable allocation of rows across the AMPs is not the underlying issue. Each AMP receives an almost identical number of rows, ranging from 623,000 to 627,000. Rather, the problem lies in generating hash collisions on all AMPs, as demonstrated in the subsequent graph.
The algorithm produced around 15,000 hash collisions per 160 AMPS. Most of these collisions result from three or four primary index values being linked to the same hash value. This is concerning, as hash collisions have a negative impact on performance.
Let’s revisit the scenario, but this time, we store the phone numbers in a BIGINT column and employ it as the Teradata Primary Index. We create a table containing 100 million rows. The resulting graph corresponds to our expectations for the CHARACTER column.
The graph is now a precise impulse function, with values ranging exclusively from 93 to 98. Each AMP is allocated 625,000 rows, with a minimum fluctuation of ±280. Most importantly, there are currently no replicated hashes.
Summary
Using CHARACTER columns in the primary index may result in numerous hash collisions. While the contents of such columns are usually limited, this may not be a concern. Nevertheless, significant performance issues may arise in exceptional scenarios where millions of distinct values are present in the primary index column.
To prevent confusion, it is worth noting that implementing CHARACTER columns in the primary index is usually not detrimental. Retrieval, joining, and aggregating are ordinarily not impacted. The situation demonstrated in this test scenario involves storing numerous (millions) diverse values in a CHARACTER column, which only results in an immense amount of hash collisions in this circumstance. Therefore, it is advised not to overhaul the PDM presently but to remain vigilant for these kinds of cases.
This article will illustrate additional factors to consider when selecting the Primary Index for optimal performance.