The Primary Index Choice

9
1871
Teradata Primary Index

 

Teradata Primary IndexWhat is the Teradata Primary Index?

The primary index is a mechanism that decides how each data row is physically located. It may be based on a single column or multiple columns. The values of the primary index columns may be unique or non-unique.

Don’t confuse the primary index with the primary key of a table. The primary key is a concept of logical data modeling; the primary index is used for defining how table rows are stored.

While the primary key is utilized to identify each object stored in a table uniquely, the main idea behind primary index choice is to use the parallel system in the most efficient way.

The Hashing Algorithm

Each row which is inserted into a table has to pass the hashing function. The hashing function calculates the so-called row hash from the primary index columns. The input order of the primary index columns doesn’t matter:

f(a,b) = f(b,a)

The data rows are distributed based on the calculated row hash. The so-called hash map assigns each row hash to one specific AMP. Each AMP has its mass storage assigned and is responsible for storing and retrieving a portion of a table’s data rows.

The same input values to the hashing function will always be assigned to the same AMP (as long as the system configuration is not changed). Sometimes different combinations of primary index columns map to the same row hash value, which is called “hash collision.” Hash collisions can negatively impact performance.

The Primary Index Choice 

We have to consider three criteria when selecting the primary index:

    • A Good Access Path
    • Even Distribution of Rows across all AMPs
    • Low Volatility of the Primary Index Columns

A good access path means to achieve optimal retrieve and join performance:
In retrieve steps, the primary index is the most efficient way to pick up data rows. Join steps are fast if the join columns of both tables are the same, making them an excellent primary index candidate.

The second important criteria for primary index choice is even row distribution, to use the parallel architecture in an optimal way.

Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary index for a fast access path, accepting that the data distribution is not ideal. It’s perfectly fine, as long as we are aware of this fact and know how to deal with resulting issues.

Finally, the volatility of the primary index values should be kept to a minimum. When the primary index value of a row is changed, the column values are sent to the hashing function, and the rows are re-distributed to their new AMP. Rehashing can become an expensive operation.

Ideally, we would like to have a non-volatile primary index with an even row distribution and a fast access path (indexed access) – but:

The “Perfect Primary Index” does not exist. Different workloads may require a different primary index for the same table.

Questions?
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.
Our Reader Score
[Total: 14    Average: 4.6/5]
The Primary Index Choice written by Roland Wenzlofsky on February 14, 2017 average rating 4.6/5 - 14 user ratings

9 COMMENTS

  1. Hi Roland,
    Came back here after months and saw the UI changed. Pleasant surprise 🙂
    Unable to articulate on “Different workloads may require a different primary index for the same table.”
    Could you please help with an example.

    Regards,
    Virendra

  2. Hi,

    Primary key and primary index are two different concepts. Primary keys are used in logical data modeling. A Primary key is used to identify an object uniquely. The primary index is utilized in the physical modeling process. It is used to achieve proper data distribution and to give Teradata a fast data access path. Often it’s the case that the primary key is at the same time a good primary index, as it’s unique and therefore ensure even data distribution.

    In my opinion, using a surrogate key for tables such as customers always pays off in the long term. I have seen several times that a new legacy system was introduced (replacing an existing one). If you use surrogate keys, changes on the implementation side are limited to mapping the new natural keys to existing surrogate keys. Without surrogate keys, it can be a challenging task to adjust natural keys in all core databases and data marts (Of course, without a source independent data model, you still will have a hard time)

    Best Regards,
    Roland

      • Hello Paul

        I’d like to add that it is good to think about primary index (PI) more as physical distribution key and do not think about PI as an logical attribute(you may have a primary key on non-PI columns). Primary index is key of success or failure for your DWH. If a lot of your big tables have the same primary index and partitioning then it is most likely you use the great power of teradata.

        So try to find common attributes for large tables(especially if you have ions on those attributes) when you build your data model. These common attributes may be a good choose for PI( in a lot of cases it is much better choose then PK of the tables).

        Regards
        Aleksei.

  3. Hi Roland,
    thanks for the article.
    I have a design question. I have worked last 6 years with Microsoft BI in DWH implementations and I usually used a surrogated key (an auto-incremental int or bigint) as the primary index (clustered) for every table in the warehouse. This suppose to bring a good performance by joining tables, reduce the fragmentation problems and makes the data portable.
    In some Teradata DWH design I have seen a mix of surrogated keys and codes. Usually transaction tables use surrogated keys while reference tables, like customer or product type use the natural keys, always with the “cd” sufix as naming convention, i.e., custumer_cd, product_type_cd.
    In my old world the there was a primary (clustered) index on the primary key on the table. Could you help me to map my previous concepts to the Teradata World? What could be a best practice for transactions and reference tables? How are the primary key and the primary index related?
    Thanks in advance and kind regards,
    Paul

LEAVE A REPLY

Please enter your comment!
Please enter your name here