What is the Teradata Primary Index?
The primary index is a mechanism that decides how each data row is physically located. It may be based on a single column or multiple columns. The values of the primary index columns may be unique or non-unique.
Don’t confuse the primary index with the primary key of a table. The primary key is a concept of logical data modeling; the primary index is used for defining how table rows are stored.
While the primary key is utilized to identify each object stored in a table uniquely, the main idea behind primary index choice is to use the parallel system in the most efficient way.
The Hashing Algorithm
Each row which is inserted into a table has to pass the hashing function. The hashing function calculates the so-called row hash from the primary index columns. The input order of the primary index columns doesn’t matter:
f(a,b) = f(b,a)
The data rows are distributed based on the calculated row hash. The so-called hash map assigns each row hash to one specific AMP. Each AMP has its mass storage assigned and is responsible for storing and retrieving a portion of a table’s data rows.
The same input values to the hashing function will always be assigned to the same AMP (as long as the system configuration is not changed). Sometimes different combinations of primary index columns map to the same row hash value, which is called “hash collision.” Hash collisions can negatively impact performance.
The Primary Index Choice
We have to consider three criteria when selecting the primary index:
- A Good Access Path
- Even Distribution of Rows across all AMPs
- Low Volatility of the Primary Index Columns
A good access path means to achieve optimal retrieve and join performance:
In retrieve steps, the primary index is the most efficient way to pick up data rows. Join steps are fast if the join columns of both tables are the same, making them an excellent primary index candidate.
The second important criteria for primary index choice is even row distribution, to use the parallel architecture in an optimal way.
Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary index for a fast access path, accepting that the data distribution is not ideal. It’s perfectly fine, as long as we are aware of this fact and know how to deal with resulting issues.
Finally, the volatility of the primary index values should be kept to a minimum. When the primary index value of a row is changed, the column values are sent to the hashing function, and the rows are re-distributed to their new AMP. Rehashing can become an expensive operation.
Ideally, we would like to have a non-volatile primary index with an even row distribution and a fast access path (indexed access) – but:
The “Perfect Primary Index” does not exist. Different workloads may require a different primary index for the same table.