What is the Teradata Primary Index?
The Teradata primary index is a mechanism that decides where each data row is physically located. It may be based on a single column or multiple columns. The values of the primary index columns may be unique or non-unique.
Don’t confuse the primary index with the primary key of a table. The primary key is a concept of logical data modeling; the primary index is used for defining how table rows are stored.
While the primary key is to identify each object stored in a table uniquely, the main idea behind the primary index choice is to use the parallel system most efficiently.
The Hashing Algorithm
Each row that is inserted into a table has to pass the hashing function. The hashing function calculates the so-called row hash from the primary index columns. The input order of the primary index columns doesn’t matter:
f(a,b) = f(b,a)
The data rows are distributed based on the calculated row hash. The so-called hash map assigns each row hash to one specific AMP. Each AMP has its mass storage assigned and is responsible for storing and retrieving a portion of a table’s data rows.
The same input values to the hashing function will always be assigned to the same AMP (as long as the system configuration is not changed). Sometimes different combinations of primary index columns map to the same row hash value called “hash collision.” Hash collisions can negatively impact performance.
The Teradata Primary Index Choice
We have to consider three criteria when selecting the primary index:
- A Good Access Path
- Even Distribution of Rows across all AMPs
- Low Volatility of the Primary Index Columns
A right access path means to achieve optimal retrieve and join performance:
In the retrieve steps, the primary index is the most efficient way to pick up data rows. Join steps are fast if the join columns of both tables are the same, making them an excellent primary index candidate.
The second important criterion for primary index choice is even row distribution, to use the parallel architecture optimally.
Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary index for a fast access path, accepting that the data distribution is not ideal. It’s perfectly fine, as long as we are aware of this fact and know how to deal with the resulting issues.
Finally, the volatility of the primary index values should be kept to a minimum. When the rows’ primary index value is changed, the column values are sent to the hashing function, and the rows are re-distributed to their new AMP. Rehashing can become an expensive operation.
The unique primary index (UPI) offers the best performance:
- Only one row inside one data block is accessed.
- No spool file is created.
- Only one AMP is needed.
The NUPI offers still good performance but:
- Several rows are accessed typically in different data blocks.
- Duplicate row checked may occur.
- Several I/Os are needed if not all rows fit into one data block.
- Spool files might be used.
Ideally, we would like to have a non-volatile primary index with an even row distribution and a fast access path (indexed access) – but:
The “Perfect Primary Index” does not exist. Different workloads may require a different primary index for the same table.