What is the Teradata Primary Index?
The Teradata Primary Index evenly distributes table rows across AMPs using a hashing algorithm that calculates a value based on one or more table columns.
An AMP is designated to oversee all corresponding operations by utilizing a Primary Index assigned to each table row. This index may be classified as UNIQUE or NON-UNIQUE. The most crucial attribute of the Primary Index is its ability to facilitate parallel execution of all activities by Teradata, provided it is configured correctly.
The Primary Index and Primary Key of a table are not interchangeable concepts, despite their potential for similarity. While the Primary Index is responsible for distributing data, the Primary Key serves to identify an object uniquely.
The Primary Key and Primary Index are often identical, as the former is inherently unique. This means that a table’s rows will be evenly dispersed across all Advanced Microprocessors (AMPs) when employed as a Primary Index.
The Teradata Hashing Algorithm
A hashing function processes every row added to a table by deriving the ROWHASH, represented by a 32-bit INTEGER value, from the values of all primary index columns. Its design allows for the order of column processing to be inconsequential.
hash(a,b) = hash(b,a)
The HASHMAP assigns each ROWHASH to its corresponding AMP. An AMP is a self-contained logical unit that comprises a CPU, memory, and mass storage. Thus, each AMP is responsible for processing a specific table segment.
The ROWHASH alone cannot guarantee the unique identification of a row. To differentiate rows sharing the same Primary Index (which can occur with a non-unique Primary Index), Teradata appends a uniqueness value (a counter) to each row. To identify a row, three components are required: ROWHASH, Uniqueness Value, and the corresponding value. Although uncommon, the hash function may produce identical ROWHASH values for different inputs. The AMP can accurately determine which rows to access by utilizing the actual values.
If the Teradata system is not reconfigured, the assignment of ROWHASH to AMP will remain consistent. The same input values will always correspond to the same ROWHASH. Nevertheless, the hashing algorithm may occasionally lead to hash collisions, where different input values are mapped to the same ROWHASH, thereby diminishing performance when executed in large numbers. Despite this, the likelihood of a hash collision occurring is negligible.
I have been frequently asked about the impact of input column types, specifically character versus integer. It is worth noting that character columns can lead to numerous hash collisions and duplicate hash values. For a more comprehensive understanding of this topic, I recommend reading our book, which provides in-depth explanations.
Selection of the Teradata Primary Index
Three factors influence the selection of the optimal Primary Index.
- A Suitable Access Path
- Even Distribution of Rows across all AMPs
- Low Volatility of the Primary Index Columns
An efficient access path guarantees optimal performance when querying and joining tables. Our aim is to minimize the amount of data blocks read during retrieval. Moreover, we strive to minimize the effort required for preparing joins. Ideally, joined tables’ primary index and row partitioning should be identical to eliminate the need for join preparation.
The second primary index criterion is achieving optimal parallelism through the even distribution of rows.
Sometimes, it is not possible to optimize for both goals simultaneously. We may need to prioritize designing the primary index for swift access, even if the data distribution is suboptimal or vice versa. Awareness of this reality and the ability to manage associated drawbacks is crucial.
To minimize primary index value volatility, avoiding altering the primary index value of rows is crucial. If changes are made, the hashing function redistributes the rows to a different AMP, which can be a costly rehashing process. As a result, it’s advisable to limit UPDATE statements on primary index columns whenever feasible.
When choosing Primary Index Columns, we must consider that certain data types, such as BLOB, CLOB, XML, and JSON, are not permissible.
Different data types can result in distinct hash values, despite appearing identical upon initial inspection. As an illustration:
SELECT HASHROW(‘3’) AS CHARACTER_HASH , HASHROW(3) AS INTEGER_HASH;
CHARACTER_HASH | INTEGER_HASH |
9A0F8DF8 | 6D27DAA6 |
Different data types create different ROWHASH
Identical columns across tables must share the same data type to enable direct joining through ROWHASH without the need for row redistribution.
Some of Teradata’s built-in data types are hash-compatible, meaning different data types can generate the same hash value for a given input value. This is because the hashing algorithm operates on the internal binary representation of the input value. Whole-number data types, such as BYTEINT, SMALLINT, INTEGER, and BIGINT columns, and fractional-number data types without a fractional part (e.g., DECIMAL(n,0) columns), produce identical hash values. Similarly, the DATE data type is encoded as an integer value in Teradata, resulting in the same hash value as an INTEGER. For instance:
Data Type | Value | ROWHASH |
BYTEINT | 99 | 9A0F8DF8 |
SMALLINT | 99 | 9A0F8DF8 |
INTEGER | 99 | 9A0F8DF8 |
BIGINT | 99 | 9A0F8DF8 |
DATE | 99 | 9A0F8DF8 |
The CHAR and VARCHAR data types are compatible with hashing, but keeping spaces in mind is important. The hash value of ‘x ‘ with a space will not be the same as that of ‘x’. The CHARACTER SET, whether LATIN or UNICODE, does not affect the hash value.
The values of NULL, zero, and an empty string (represented as “”) all hash to the same value regardless of their data type.
Data Type | ROWHASH |
NULL | 9A0F8CC8 |
“ | 9A0F8CC8 |
0 | 9A0F8CC8 |
Teradata Performance Considerations
The UPI provides optimal performance.
- Only one row inside one data block is accessed.
- No spool file is created.
- Only one AMP is needed.
The NUPI continues to perform well, however:
- Several rows are accessed, typically located in different data blocks.
- Duplicate row checks may occur.
- Several IOs are needed if not all rows are located in one data block.
- Teradata might use spool files.
Summary
Selecting a Primary Index requires balancing equitable distribution and expedient access. Even distribution is futile if it results in skew during the joining of rows via columns. Conversely, it is imperative to steer clear of volatility.
There is no such thing as a flawless primary index. A single table can necessitate distinct primary indexes for various workloads. Teradata addresses this issue by providing the Single Table Join Index in conjunction with an alternative primary index.
Hi Roland,
Came back here after months and saw the UI changed. Pleasant surprise 🙂
Unable to articulate on “Different workloads may require a different primary index for the same table.”
Could you please help with an example.
Regards,
Virendra
Hi,
The primary key and primary index are two different concepts. Primary keys are used in logical data modeling. A Primary key is used to identify an object uniquely. The primary index is utilized in the physical modeling process. It is used to achieve proper data distribution and to give Teradata a fast data access path. Often it’s the case that the primary key is at the same time a good primary index, as it’s unique and therefore ensures even data distribution.
In my opinion, using a surrogate key for tables such as customers always pays off in the long term. I have seen several times that a new legacy system was introduced (replacing an existing one). If you use surrogate keys, changes on the implementation side are limited to mapping the new natural keys to existing surrogate keys. Without surrogate keys, it can be a challenging task to adjust natural keys in all core databases and data marts (Of course, without a source independent data model, you still will have a hard time)
Best Regards,
Roland
Hi Roland,
many thanks for the clarification. I will consider the use of surrogate keys for such a tables.
Hello Paul
I’d like to add that it is good to think about the primary index (PI) more as a physical distribution key and do not think about PI as a logical attribute(you may have a primary key on non-PI columns). The primary index is the key to success or failure for your DWH. If a lot of your big tables have the same primary index and partitioning then it is most likely you use the great power of Teradata.
So try to find common attributes for large tables(especially if you have ions on those attributes) when you build your data model. These common attributes may be a good choice for PI( in a lot of cases it is much better to choose then PK of the tables).
Regards
Aleksei.
Thanks Aleksei. Very good input.
Hi Roland,
thanks for the article.
I have a design question. I have worked last 6 years with Microsoft BI in DWH implementations and I usually used a surrogated key (an auto-incremental int or bigint) as the primary index (clustered) for every table in the warehouse. This supposes to bring a good performance by joining tables, reduce the fragmentation problems and makes the data portable.
In some Teradata DWH designs, I have seen a mix of surrogated keys and codes. Usually, transaction tables use surrogated keys while reference tables, like customer or product type, use the natural keys, always with the “cd” suffix as naming convention, i.e., custumer_cd, product_type_cd.
In my old world, there was a primary (clustered) index on the primary key on the table. Could you help me to map my previous concepts to the Teradata World? What could be the best practice for transactions and reference tables? How is the primary key and the primary index related?
Thanks in advance and kind regards,
Paul
I just saw that you have a very helpful post about installing Teradata 15.10 Express Edition on your blog. I hope it’s ok for you if I share the link here:
https://hernandezpaul.wordpress.com/2017/01/24/teradata-express-15-10-installation-using-oracle-virtualbox/