# The Teradata Primary Index explained in detail

Roland Wenzlofsky

June 2, 2022

## What is the Teradata Primary Index?

The Teradata Primary Index distributes the rows of each table evenly across all AMPs. In simple terms, a hashing algorithm calculates a hash value based on the values of one or more table columns.

Using the Primary Index for each row of a table, an AMP is determined which is responsible for this and takes over all activities for this row. The Primary Index can be UNIQUE or NON-UNIQUE. The most important property of the Primary Index is that it ensures that Teradata can execute all activities in parallel if defined correctly.

We must not confuse the Primary Index with the Primary Key of a table, even if they can be the same. The Primary Index distributes the data, and the Primary Key uniquely identifies an object.

The primary Key and Primary Index are often the same because the Primary Key is unique by definition. Therefore, when used as a Primary Index, a table’s rows are distributed evenly across all AMPs.

Each row inserted into a table is processed by a hashing function. This takes the values of all primary index columns and calculates the so-called ROWHASH, which is presented by a 32-bit INTEGER value. The hashing function is designed so that the order in which the columns are processed does not matter.

hash(a,b) = hash(b,a)

The so-called HASHMAP does the assignment of each ROWHASH to an AMP. Each AMP is an independently working logical unit consisting of CPU, memory, and mass storage. I.e., each AMP processes a particular part of a table.

The ROWHASH on its own is not sufficient to identify a row uniquely. Teradata adds the so-called uniqueness value (a counter) to each row to distinguish rows with the same Primary Index (this can happen for a non-unique Primary index). The AMP needs three pieces of information to identify a row: ROWHASH, Uniqueness Value, and the value itself. In rare cases, different input values to the hash function can produce the same ROWHASH. Together with the actual values, the AMP knows which rows to access.

As long as the Teradata system is not reconfigured (e.g., by an upgrade), the assignment of ROWHASH to AMP remains the same. The same combination of input values is always mapped to the same ROWHASH. However, The hashing algorithm may map different input values to the same ROWHASH. This is called a hash collision and usually harms performance when en masse. However, the probability of this happening is minimal.

I have been asked several times if it makes a difference if the input columns are of type character or integer. Yes, it can impact as character columns can cause many duplicate hash values (or hash collisions). If you are interested in the details I suggest reading our book where you can find all the details:

## Selection of the Teradata Primary Index

Three criteria play a role in selecting the best possible Primary Index.

• A Suitable Access Path
• Even Distribution of Rows across all AMPs
• Low Volatility of the Primary Index Columns

A suitable access path ensures good performance for querying tables and joins. We want to read the minimum amount of data blocks for retrieve steps. For joins, we also want to keep the effort for preparing the join as low as possible. Ideally, both primary index and row partitioning of joined tables are identical because any join preparation is omitted.

The second important criterion for primary index choice is even row distribution to exploit the parallelism as much as possible.

Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary Index for a fast access path, accepting that the data distribution is not ideal, or the other way around. It’s magnificent, as long as we know this fact and can deal with the resulting disadvantages.

Finally, we should keep the volatility of the primary index values to a minimum. When the rows’ primary index value is changed, the column values are sent to the hashing function, and the rows are redistributed to their new AMP. Rehashing can become an expensive operation. Therefore, we should prevent UPDATE statements on the primary index columns as far as possible.

When selecting Primary Index Columns, we must consider that not all data types are allowed. These include BLOB, CLOB, XML, and JSON.

Furthermore, we must consider that different data types produce different hash values, even if they seem identical at first sight. Here is an example:

SELECT HASHROW(‘3’) AS CHARACTER_HASH , HASHROW(3) AS INTEGER_HASH;

Therefore, identical columns in different tables must be of the same data type. We lose the possibility of direct joining via the ROWHASH without redistributing rows first.

Teradata offers by design hash-compatible data types, i.e., different data types that generate the same hash value for the same logical input value. The hashing algorithm processes an input value’s internal, binary representation. All whole-number data types (i.e., BYTEINT, SMALLINT, INTEGER, and BIGINT columns) and fractional-number data types without a fractional part (e.g., a DECIMAL(n,0) column) generate the same hash value. The same holds for the DATE data type since it is encoded as an integer value in Teradata. Here is an example:

The NULL value (no matter the data type), number zero, and zero-length string (i.e., an empty string we often represent as”) hash to the same value:

The character data types CHAR and VARCHAR are hash-compatible, but we must consider spaces. ‘x’ will create a different hash value than ‘x’.

The unique primary Index (UPI) offers the best performance:

• Only one row inside one data block is accessed.
• No spool file is created.
• Only one AMP is needed.

The NUPI offers still good performance but:

• Several rows are accessed, located typically in different data blocks.
• Â Duplicate row checks may occur.
• Several IOs are needed if not all rows are located in one data block.
• Teradata might use spool files.

## Summary

As we have seen, the choice of Primary Index is always a trade-off between a good distribution (avoiding skew) and efficient access (preventing the movement of a massive amount of data from disk to memory). The best distribution is of no use if the rows of both tables are then joined via columns which causes skew. Volatility, on the other hand, should always be avoided.

The perfect primary Index does not exist. Different workloads may require a different primary index for the same table. Teradata also offers a solution: the Single Table Join Index with another Primary Index.

• Virendra says:

Hi Roland,
Came back here after months and saw the UI changed. Pleasant surprise ðŸ™‚
Unable to articulate on “Different workloads may require a different primary index for the same table.”

Regards,
Virendra

• Hi,

The primary key and primary index are two different concepts. Primary keys are used in logical data modeling. A Primary key is used to identify an object uniquely. The primary index is utilized in the physical modeling process. It is used to achieve proper data distribution and to give Teradata a fast data access path. Often it’s the case that the primary key is at the same time a good primary index, as it’s unique and therefore ensures even data distribution.

In my opinion, using a surrogate key for tables such as customers always pays off in the long term. I have seen several times that a new legacy system was introduced (replacing an existing one). If you use surrogate keys, changes on the implementation side are limited to mapping the new natural keys to existing surrogate keys. Without surrogate keys, it can be a challenging task to adjust natural keys in all core databases and data marts (Of course, without a source independent data model, you still will have a hard time)

Best Regards,
Roland

• Hi Roland,
many thanks for the clarification. I will consider the use of surrogate keys for such a tables.

• Aleksei Svitin says:

Hello Paul

I’d like to add that it is good to think about the primary index (PI) more as a physical distribution key and do not think about PI as a logical attribute(you may have a primary key on non-PI columns). The primary index is the key to success or failure for your DWH. If a lot of your big tables have the same primary index and partitioning then it is most likely you use the great power of Teradata.

So try to find common attributes for large tables(especially if you have ions on those attributes) when you build your data model. These common attributes may be a good choice for PI( in a lot of cases it is much better to choose then PK of the tables).

Regards
Aleksei.

• Hi Roland,
thanks for the article.
I have a design question. I have worked last 6 years with Microsoft BI in DWH implementations and I usually used a surrogated key (an auto-incremental int or bigint) as the primary index (clustered) for every table in the warehouse. This supposes to bring a good performance by joining tables, reduce the fragmentation problems and makes the data portable.
In some Teradata DWH designs, I have seen a mix of surrogated keys and codes. Usually, transaction tables use surrogated keys while reference tables, like customer or product type, use the natural keys, always with the “cd” suffix as naming convention, i.e., custumer_cd, product_type_cd.
In my old world, there was a primary (clustered) index on the primary key on the table. Could you help me to map my previous concepts to the Teradata World? What could be the best practice for transactions and reference tables? How is the primary key and the primary index related?
Thanks in advance and kind regards,
Paul