Doing Teradata Hashing The Right Way

Roland Wenzlofsky

April 21, 2023

minutes reading time


This article presupposes your existing familiarity with the fundamental structure of a Teradata System.

As you are aware, the AMPs operate on a Teradata System.

The number of AMPs in charge of managing rows stored on their respective virtual disks varies based on the system’s size. There may be hundreds of AMPs.

As noted in my article about the Teradata High-Level Architecture, the primary goal is to distribute rows evenly among all AMPs in order to achieve parallelism.

What are the methods for achieving uniform data distribution on Teradata?

Its dependency lies on the Primary Index.

The primary Index and Primary Key are distinct concepts. Primary Key is utilized in data modeling, whereas Primary Index is a scientific principle in Teradata.

Please note that the following paragraph is a simplified example but should provide an adequate understanding of the general concepts.

The Primary Index comprises table columns utilized as input for a highly efficient hashing algorithm. The algorithm’s output designates a responsible AMP that assumes all tasks pertaining to the corresponding row. The designated AMP stores the row on its associated virtual disk and becomes solely accountable for handling it thereafter.

The order of columns passed to the hashing algorithm is inconsequential. However, it’s crucial to consider data types as compatibility is essential, while dissimilar data types yield different results.

The data distribution strategy employed is both uncomplicated and effective. A designated portion of the data is allocated to each AMP, with the distribution of selected rows solely determined by the Primary Index. All tables are accessible on the Teradata System and are duly recorded and updated by each AMP.

The AMP with the most rows to handle will determine the overall answer time for any DML statement. Thus, it is important to remember this. Evenly distributing rows across all AMPs will ensure linear scalability, which should be the primary focus in designing the physical data model.

Teradata’s Primary Index concept facilitates direct access to data based on a hash value, similar to other hashing algorithms. Utilizing Primary Index access is the most expeditious method of retrieving rows from disks.

Changing the Primary Index columns on a row will result in a handover of the row to another responsible AMP, which is expensive. Therefore, it is recommended to refrain from altering the PI.

Primary Indexes are classified as UNIQUE or non-UNIQUE. Records that share the same content are assigned identical hash values. Additional information, such as a uniqueness value, is added to differentiate these records. However, both records are managed by the same AMP.

While the data distribution process involves additional details, the preceding description should suffice for your everyday use of Teradata.

  • Avatar
    Roland Wenzlofsky says:

    HI. I assume the primary index you want to use is not distrubuting the rows evenly across all AMPs and you run out of space on a single AMP or a few AMPs

  • Avatar
    material.study says:

    Hi,

    Can I have Primary key columns different than the Primary Index? I tried doing that and my ETL process is failing to complain about the space issues. But, I have ample space and when I make Primary Key columns the same as Primary Index, it works just fine.

    I will appreciate your response or any input.

    Regards
    Nirav

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >