Understanding the Teradata Primary Index: Distribution, Hashing Algorithm, and Performance Considerations

Roland Wenzlofsky

April 23, 2023

minutes reading time

What is the Teradata Primary Index?

The Teradata Primary Index evenly distributes table rows across AMPs using a hashing algorithm that calculates a value based on one or more table columns.

An AMP is designated to oversee all corresponding operations by utilizing a Primary Index assigned to each table row. This index may be classified as UNIQUE or NON-UNIQUE. The most crucial attribute of the Primary Index is its ability to facilitate parallel execution of all activities by Teradata, provided it is configured correctly.

The Primary Index and Primary Key of a table are not interchangeable concepts, despite their potential for similarity. While the Primary Index is responsible for distributing data, the Primary Key serves to identify an object uniquely.

The Primary Key and Primary Index are often identical, as the former is inherently unique. This means that a table’s rows will be evenly dispersed across all Advanced Microprocessors (AMPs) when employed as a Primary Index.

The Teradata Hashing Algorithm

A hashing function processes every row added to a table by deriving the ROWHASH, represented by a 32-bit INTEGER value, from the values of all primary index columns. Its design allows for the order of column processing to be inconsequential.

hash(a,b) = hash(b,a)

The HASHMAP assigns each ROWHASH to its corresponding AMP. An AMP is a self-contained logical unit that comprises a CPU, memory, and mass storage. Thus, each AMP is responsible for processing a specific table segment.

The ROWHASH alone cannot guarantee the unique identification of a row. To differentiate rows sharing the same Primary Index (which can occur with a non-unique Primary Index), Teradata appends a uniqueness value (a counter) to each row. To identify a row, three components are required: ROWHASH, Uniqueness Value, and the corresponding value. Although uncommon, the hash function may produce identical ROWHASH values for different inputs. The AMP can accurately determine which rows to access by utilizing the actual values.

If the Teradata system is not reconfigured, the assignment of ROWHASH to AMP will remain consistent. The same input values will always correspond to the same ROWHASH. Nevertheless, the hashing algorithm may occasionally lead to hash collisions, where different input values are mapped to the same ROWHASH, thereby diminishing performance when executed in large numbers. Despite this, the likelihood of a hash collision occurring is negligible.

I have been frequently asked about the impact of input column types, specifically character versus integer. It is worth noting that character columns can lead to numerous hash collisions and duplicate hash values. For a more comprehensive understanding of this topic, I recommend reading our book, which provides in-depth explanations.

Selection of the Teradata Primary Index

Three factors influence the selection of the optimal Primary Index.

A Suitable Access Path
Even Distribution of Rows across all AMPs
Low Volatility of the Primary Index Columns

An efficient access path guarantees optimal performance when querying and joining tables. Our aim is to minimize the amount of data blocks read during retrieval. Moreover, we strive to minimize the effort required for preparing joins. Ideally, joined tables’ primary index and row partitioning should be identical to eliminate the need for join preparation.

The second primary index criterion is achieving optimal parallelism through the even distribution of rows.

Sometimes, it is not possible to optimize for both goals simultaneously. We may need to prioritize designing the primary index for swift access, even if the data distribution is suboptimal or vice versa. Awareness of this reality and the ability to manage associated drawbacks is crucial.

To minimize primary index value volatility, avoiding altering the primary index value of rows is crucial. If changes are made, the hashing function redistributes the rows to a different AMP, which can be a costly rehashing process. As a result, it’s advisable to limit UPDATE statements on primary index columns whenever feasible.

When choosing Primary Index Columns, we must consider that certain data types, such as BLOB, CLOB, XML, and JSON, are not permissible.

Different data types can result in distinct hash values, despite appearing identical upon initial inspection. As an illustration:

SELECT HASHROW(‘3’) AS CHARACTER_HASH , HASHROW(3) AS INTEGER_HASH;

CHARACTER_HASH	INTEGER_HASH
9A0F8DF8	6D27DAA6

Different data types create different ROWHASH

Identical columns across tables must share the same data type to enable direct joining through ROWHASH without the need for row redistribution.

Some of Teradata’s built-in data types are hash-compatible, meaning different data types can generate the same hash value for a given input value. This is because the hashing algorithm operates on the internal binary representation of the input value. Whole-number data types, such as BYTEINT, SMALLINT, INTEGER, and BIGINT columns, and fractional-number data types without a fractional part (e.g., DECIMAL(n,0) columns), produce identical hash values. Similarly, the DATE data type is encoded as an integer value in Teradata, resulting in the same hash value as an INTEGER. For instance:

Data Type	Value	ROWHASH
BYTEINT	99	9A0F8DF8
SMALLINT	99	9A0F8DF8
INTEGER	99	9A0F8DF8
BIGINT	99	9A0F8DF8
DATE	99	9A0F8DF8

The CHAR and VARCHAR data types are compatible with hashing, but keeping spaces in mind is important. The hash value of ‘x ‘ with a space will not be the same as that of ‘x’. The CHARACTER SET, whether LATIN or UNICODE, does not affect the hash value.

The values of NULL, zero, and an empty string (represented as “”) all hash to the same value regardless of their data type.

Data Type	ROWHASH
NULL	9A0F8CC8
“	9A0F8CC8
0	9A0F8CC8

Teradata Performance Considerations

The UPI provides optimal performance.

Only one row inside one data block is accessed.
No spool file is created.
Only one AMP is needed.

The NUPI continues to perform well, however:

Several rows are accessed, typically located in different data blocks.
Duplicate row checks may occur.
Several IOs are needed if not all rows are located in one data block.
Teradata might use spool files.

Summary

Selecting a Primary Index requires balancing equitable distribution and expedient access. Even distribution is futile if it results in skew during the joining of rows via columns. Conversely, it is imperative to steer clear of volatility.

There is no such thing as a flawless primary index. A single table can necessitate distinct primary indexes for various workloads. Teradata addresses this issue by providing the Single Table Join Index in conjunction with an alternative primary index.

Teradata Primary Index Documentation

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

Hi Roland,
Came back here after months and saw the UI changed. Pleasant surprise 🙂
Unable to articulate on “Different workloads may require a different primary index for the same table.”
Could you please help with an example.

Regards,
Virendra

Hi,

The primary key and primary index are two different concepts. Primary keys are used in logical data modeling. A Primary key is used to identify an object uniquely. The primary index is utilized in the physical modeling process. It is used to achieve proper data distribution and to give Teradata a fast data access path. Often it’s the case that the primary key is at the same time a good primary index, as it’s unique and therefore ensures even data distribution.

In my opinion, using a surrogate key for tables such as customers always pays off in the long term. I have seen several times that a new legacy system was introduced (replacing an existing one). If you use surrogate keys, changes on the implementation side are limited to mapping the new natural keys to existing surrogate keys. Without surrogate keys, it can be a challenging task to adjust natural keys in all core databases and data marts (Of course, without a source independent data model, you still will have a hard time)

Best Regards,
Roland

Paul Hernandez says:
at
Hi Roland,
many thanks for the clarification. I will consider the use of surrogate keys for such a tables.
Reply
- Aleksei Svitin says:
  at
  Hello Paul
  I’d like to add that it is good to think about the primary index (PI) more as a physical distribution key and do not think about PI as a logical attribute(you may have a primary key on non-PI columns). The primary index is the key to success or failure for your DWH. If a lot of your big tables have the same primary index and partitioning then it is most likely you use the great power of Teradata.
  So try to find common attributes for large tables(especially if you have ions on those attributes) when you build your data model. These common attributes may be a good choice for PI( in a lot of cases it is much better to choose then PK of the tables).
  Regards
  Aleksei.
  Reply
  - DWH Pro Admin says:
    at
    Thanks Aleksei. Very good input.
    Reply

Hi Roland,
thanks for the article.
I have a design question. I have worked last 6 years with Microsoft BI in DWH implementations and I usually used a surrogated key (an auto-incremental int or bigint) as the primary index (clustered) for every table in the warehouse. This supposes to bring a good performance by joining tables, reduce the fragmentation problems and makes the data portable.
In some Teradata DWH designs, I have seen a mix of surrogated keys and codes. Usually, transaction tables use surrogated keys while reference tables, like customer or product type, use the natural keys, always with the “cd” suffix as naming convention, i.e., custumer_cd, product_type_cd.
In my old world, there was a primary (clustered) index on the primary key on the table. Could you help me to map my previous concepts to the Teradata World? What could be the best practice for transactions and reference tables? How is the primary key and the primary index related?
Thanks in advance and kind regards,
Paul

Roland Wenzlofsky says:
at
I just saw that you have a very helpful post about installing Teradata 15.10 Express Edition on your blog. I hope it’s ok for you if I share the link here:
https://hernandezpaul.wordpress.com/2017/01/24/teradata-express-15-10-installation-using-oracle-virtualbox/
Reply

Understanding the Teradata Primary Index: Distribution, Hashing Algorithm, and Performance Considerations

What is the Teradata Primary Index?

The Teradata Hashing Algorithm

Selection of the Teradata Primary Index

Teradata Performance Considerations

Summary

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes