Doing Teradata Hashing The Right Way
This article assumes that you are already familiar with the general architecture of a Teradata System.
As you should already know, the AMPs are responsible for doing the work on a Teradata System.
Depending on the size of the system, you will have hundreds of AMPs, each one being in charge of handling the rows stored on its related virtual disk.
As I already pointed out in an article about the Teradata High-Level Architecture, the primary goal is an even distribution of rows across all AMPs. Even distribution ensures parallelism.
How is an even data distribution on Teradata achieved?
It is dependent on the Primary Index.
Please don’t confuse the Primary Index with the term Primary Key! While Primary Key is a term used in data modeling, Primary Index is a scientific concept on Teradata.
Over the following paragraph, be aware that this is a simplified example, but enough for understanding the general concepts.
The Primary Index is a set of table columns which is used as input to a very efficient hashing algorithm. The output of applying the algorithm to the Primary Index value is the assignment of a responsible AMP which will take over all work related to the observed row. The receiving AMP will write the row to its related virtual disk, and so it becomes the only one responsible for handling this row ever after.
As a side note, the order of the columns passed to the hashing algorithm does not matter. Still, data types will possibly matter. There are compatible data types and such which are considered different.
This data distribution strategy is straightforward and efficient. Each AMP holds a portion of the data; the Primary Index exclusively determines row distribution was chosen. Each AMP will maintain records of all tables available on the Teradata System.
Keep in mind that the slowest AMP determines the overall answer time of any DML statement. It will be the one with the most rows to handle. Linear scalability is ensured with an even distribution of rows across all AMPs. Linear scalability is the primary goal in designing the physical data model.
Like hashing algorithms in general, the Primary Index concept on Teradata allows direct access to the data related to a particular hash value. Primary Index access is the fastest way of retrieving rows from the disks.
Please be aware that any change of the Primary Index columns on a row will trigger a handover of this row to another responsible AMP. Avoid changing the PI as it is a costly process.
Primary Indexes can be defined UNIQUE or non-UNIQUE. Two records with the same content will be assigned the same hash value. To be able to distinguish them, more information will be added, i.e. a uniqueness value. Still, both records are handled by the same AMP.
Although there are more details involved in the data distribution process, the above description should be enough for you daily work with Teradata.