fbpx

The well-chosen Teradata Primary Index

By Roland Wenzlofsky

September 12, 2020


What is the Teradata Primary Index?

The Teradata primary index is a mechanism that decides where each data row is physically located. It may be based on a single column or multiple columns. The values of the primary index columns may be unique or non-unique.

Don’t confuse the primary index with the primary key of a table. The primary key is a concept of logical data modeling; the primary index is used for defining how table rows are stored.

While the primary key is to identify each object stored in a table uniquely, the main idea behind the primary index choice is to use the parallel system most efficiently.

The Hashing Algorithm

Each row that is inserted into a table has to pass the hashing function. The hashing function calculates the so-called row hash from the primary index columns. The input order of the primary index columns doesn’t matter:

f(a,b) = f(b,a)

The data rows are distributed based on the calculated row hash. The so-called hash map assigns each row hash to one specific AMP. Each AMP has its mass storage assigned and is responsible for storing and retrieving a portion of a table’s data rows.

The same input values to the hashing function will always be assigned to the same AMP (as long as the system configuration is not changed). Sometimes different combinations of primary index columns map to the same row hash value called “hash collision.” Hash collisions can negatively impact performance.

The Teradata Primary Index Choice 

We have to consider three criteria when selecting the primary index:

    • A Good Access Path
    • Even Distribution of Rows across all AMPs
    • Low Volatility of the Primary Index Columns

A right access path means to achieve optimal retrieve and join performance:
In the retrieve steps, the primary index is the most efficient way to pick up data rows. Join steps are fast if the join columns of both tables are the same, making them an excellent primary index candidate.

The second important criterion for primary index choice is even row distribution, to use the parallel architecture optimally.

Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary index for a fast access path, accepting that the data distribution is not ideal. It’s perfectly fine, as long as we are aware of this fact and know how to deal with the resulting issues.

Finally, the volatility of the primary index values should be kept to a minimum. When the rows’ primary index value is changed, the column values are sent to the hashing function, and the rows are re-distributed to their new AMP. Rehashing can become an expensive operation.

Performance Considerations

The unique primary index (UPI) offers the best performance:

  • Only one row inside one data block is accessed.
  • No spool file is created.
  • Only one AMP is needed.

The NUPI offers still good performance but:

  • Several rows are accessed typically in different data blocks.
  •  Duplicate row checked may occur.
  • Several I/Os are needed if not all rows fit into one data block.
  • Spool files might be used.

Summary

Ideally, we would like to have a non-volatile primary index with an even row distribution and a fast access path (indexed access) – but:

The “Perfect Primary Index” does not exist. Different workloads may require a different primary index for the same table.

Teradata Primary Index Documentation

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Hi Roland,
    Came back here after months and saw the UI changed. Pleasant surprise 🙂
    Unable to articulate on “Different workloads may require a different primary index for the same table.”
    Could you please help with an example.

    Regards,
    Virendra

  • Hi,

    The primary key and primary index are two different concepts. Primary keys are used in logical data modeling. A Primary key is used to identify an object uniquely. The primary index is utilized in the physical modeling process. It is used to achieve proper data distribution and to give Teradata a fast data access path. Often it’s the case that the primary key is at the same time a good primary index, as it’s unique and therefore ensures even data distribution.

    In my opinion, using a surrogate key for tables such as customers always pays off in the long term. I have seen several times that a new legacy system was introduced (replacing an existing one). If you use surrogate keys, changes on the implementation side are limited to mapping the new natural keys to existing surrogate keys. Without surrogate keys, it can be a challenging task to adjust natural keys in all core databases and data marts (Of course, without a source independent data model, you still will have a hard time)

    Best Regards,
    Roland

    • Hi Roland,
      many thanks for the clarification. I will consider the use of surrogate keys for such a tables.

      • Avatar
        Aleksei Svitin says:

        Hello Paul

        I’d like to add that it is good to think about the primary index (PI) more as a physical distribution key and do not think about PI as a logical attribute(you may have a primary key on non-PI columns). The primary index is the key to success or failure for your DWH. If a lot of your big tables have the same primary index and partitioning then it is most likely you use the great power of Teradata.

        So try to find common attributes for large tables(especially if you have ions on those attributes) when you build your data model. These common attributes may be a good choice for PI( in a lot of cases it is much better to choose then PK of the tables).

        Regards
        Aleksei.

  • Hi Roland,
    thanks for the article.
    I have a design question. I have worked last 6 years with Microsoft BI in DWH implementations and I usually used a surrogated key (an auto-incremental int or bigint) as the primary index (clustered) for every table in the warehouse. This supposes to bring a good performance by joining tables, reduce the fragmentation problems and makes the data portable.
    In some Teradata DWH designs, I have seen a mix of surrogated keys and codes. Usually, transaction tables use surrogated keys while reference tables, like customer or product type, use the natural keys, always with the “cd” suffix as naming convention, i.e., custumer_cd, product_type_cd.
    In my old world, there was a primary (clustered) index on the primary key on the table. Could you help me to map my previous concepts to the Teradata World? What could be the best practice for transactions and reference tables? How is the primary key and the primary index related?
    Thanks in advance and kind regards,
    Paul

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >