# Doing Teradata Hashing The Right Way

Roland Wenzlofsky

March 12, 2014

As you already know, the AMPs are responsible for doing the work on a Teradata System.

Depending on the size of the system, you will have hundreds of AMPs, each in charge of handling the rows stored on its related virtual disk.

As I pointed out in an article about the Teradata High-Level Architecture, the primary goal is an even distribution of rows across all AMPs. Even distribution ensures parallelism.

How is an even data distribution on Teradata achieved?

It is dependent on the Primary Index.

Please donâ€™t confuse the Primary Index with the term Primary Key! While Primary Key is a term used in data modeling, Primary Index is a scientific concept on Teradata.

Over the following paragraph, be aware that this is a simplified example, but enough for understanding the general concepts.

The Primary Index is a set of table columns used as input to a very efficient hashing algorithm. The output of applying the algorithm to the Primary Index value is assigning aÂ  responsible AMP which will take over all work related to the observed row. The receiving AMP will write the row to its related virtual disk, and so it becomes the only one responsible for handling this row ever after.

As a side note, the order of the columns passed to the hashing algorithm does not matter. Still, data types will possibly matter. There are compatible data types and such which are considered different.

This data distribution strategy is straightforward and efficient. Each AMP holds a portion of the data; the Primary Index exclusively determines the chosen row distribution. Each AMP will maintain records of all tables available on the Teradata System.

Keep in mind that the slowest AMP determines the overall answer time of any DML statement. It will be the one with the most rows to handle. Linear scalability is ensured with an even distribution of rows across all AMPs. Linear scalability is the primary goal in designing the physical data model.

Like hashing algorithms in general, the Primary Index concept on Teradata allows direct access to the data related to a particular hash value. Primary Index access is the fastest way of retrieving rows from the disks.

Please be aware that any change of the Primary Index columns on a row will trigger a handover of this row to another responsible AMP. Avoid changing the PI as it is a costly process.

Primary Indexes can be defined as UNIQUE or non-UNIQUE. Two records with the same content will be assigned the same hash value. To be able to distinguish them, more information will be added, i.e., a uniqueness value. Still, both records are handled by the same AMP.

Although more details are involved in the data distribution process, the above description should be enough for your daily work with Teradata.

• Roland Wenzlofsky says:

HI. I assume the primary index you want to use is not distrubuting the rows evenly across all AMPs and you run out of space on a single AMP or a few AMPs

• material.study says:

Hi,

Can I have Primary key columns different than the Primary Index? I tried doing that and my ETL process is failing to complain about the space issues. But, I have ample space and when I make Primary Key columns the same as Primary Index, it works just fine.

I will appreciate your response or any input.

Regards
Nirav