What is the Teradata Primary Index?
The Teradata Primary Index distributes the rows of each table evenly across all AMPs. In simple terms, a hashing algorithm calculates a hash value based on the values of one or more table columns.
Using the Primary Index for each row of a table, an AMP is determined which is responsible for this and takes over all activities for this row. The Primary Index can be UNIQUE or NON-UNIQUE. The most important property of the Primary Index is that it ensures that Teradata can execute all activities in parallel if defined correctly.
We must not confuse the Primary Index with the Primary Key of a table, even if they can be the same. The Primary Index distributes the data, and the Primary Key uniquely identifies an object.
The primary Key and Primary Index are often the same because the Primary Key is unique by definition. Therefore, when used as a Primary Index, a table’s rows are distributed evenly across all AMPs.
The Teradata Hashing Algorithm
Each row inserted into a table is processed by a hashing function. This takes the values of all primary index columns and calculates the so-called ROWHASH, which is presented by a 32-bit INTEGER value. The hashing function is designed so that the order in which the columns are processed does not matter.
hash(a,b) = hash(b,a)
The so-called HASHMAP does the assignment of each ROWHASH to an AMP. Each AMP is an independently working logical unit consisting of CPU, memory, and mass storage. I.e., each AMP processes a particular part of a table.
The ROWHASH on its own is not sufficient to identify a row uniquely. Teradata adds the so-called uniqueness value (a counter) to each row to distinguish rows with the same Primary Index (this can happen for a non-unique Primary index). The AMP needs three pieces of information to identify a row: ROWHASH, Uniqueness Value, and the value itself. In rare cases, different input values to the hash function can produce the same ROWHASH. Together with the actual values, the AMP knows which rows to access.
As long as the Teradata system is not reconfigured (e.g., by an upgrade), the assignment of ROWHASH to AMP remains the same. The same combination of input values is always mapped to the same ROWHASH. However, The hashing algorithm may map different input values to the same ROWHASH. This is called a hash collision and usually harms performance when en masse. However, the probability of this happening is minimal.
I have been asked several times if it makes a difference if the input columns are of type character or integer. Yes, it can impact as character columns can cause many duplicate hash values (or hash collisions). If you are interested in the details I suggest reading our book where you can find all the details:
Selection of the Teradata Primary Index
Three criteria play a role in selecting the best possible Primary Index.
- A Suitable Access Path
- Even Distribution of Rows across all AMPs
- Low Volatility of the Primary Index Columns
A suitable access path ensures good performance for querying tables and joins. We want to read the minimum amount of data blocks for retrieve steps. For joins, we also want to keep the effort for preparing the join as low as possible. Ideally, both primary index and row partitioning of joined tables are identical because any join preparation is omitted.
The second important criterion for primary index choice is even row distribution to exploit the parallelism as much as possible.
Often we can’t optimize for both mentioned goals at the same time. We may have to design the primary Index for a fast access path, accepting that the data distribution is not ideal, or the other way around. It’s magnificent, as long as we know this fact and can deal with the resulting disadvantages.
Finally, we should keep the volatility of the primary index values to a minimum. When the rows’ primary index value is changed, the column values are sent to the hashing function, and the rows are redistributed to their new AMP. Rehashing can become an expensive operation. Therefore, we should prevent UPDATE statements on the primary index columns as far as possible.
When selecting Primary Index Columns, we must consider that not all data types are allowed. These include BLOB, CLOB, XML, and JSON.
Furthermore, we must consider that different data types produce different hash values, even if they seem identical at first sight. Here is an example:
SELECT HASHROW(‘3’) AS CHARACTER_HASH , HASHROW(3) AS INTEGER_HASH;
Therefore, identical columns in different tables must be of the same data type. We lose the possibility of direct joining via the ROWHASH without redistributing rows first.
Teradata offers by design hash-compatible data types, i.e., different data types that generate the same hash value for the same logical input value. The hashing algorithm processes an input value’s internal, binary representation. All whole-number data types (i.e., BYTEINT, SMALLINT, INTEGER, and BIGINT columns) and fractional-number data types without a fractional part (e.g., a DECIMAL(n,0) column) generate the same hash value. The same holds for the DATE data type since it is encoded as an integer value in Teradata. Here is an example:
The character data types CHAR and VARCHAR are hash-compatible, but we must consider spaces. ‘x’ will create a different hash value than ‘x’. It makes no difference whether CHARACTER SET LATIN or UNICODE is used. The hash is the same.
The NULL value (no matter the data type), number zero, and zero-length string (i.e., an empty string we often represent as”) hash to the same value:
Teradata Performance Considerations
The unique primary Index (UPI) offers the best performance:
- Only one row inside one data block is accessed.
- No spool file is created.
- Only one AMP is needed.
The NUPI offers still good performance but:
- Several rows are accessed, located typically in different data blocks.
- Duplicate row checks may occur.
- Several IOs are needed if not all rows are located in one data block.
- Teradata might use spool files.
As we have seen, the choice of Primary Index is always a trade-off between a good distribution (avoiding skew) and efficient access (preventing the movement of a massive amount of data from disk to memory). The best distribution is of no use if the rows of both tables are then joined via columns which causes skew. Volatility, on the other hand, should always be avoided.
The perfect primary Index does not exist. Different workloads may require a different primary index for the same table. Teradata also offers a solution: the Single Table Join Index with another Primary Index.