teradata join strategies

When analyzing the execution plan of your queries, you will mainly see one of below described join strategies used. This post shows how each of them works and carves out advantages and disadvantages of each join strategy.

Data demographics heavily influence the execution plan and the selected join strategy.

 

1. Teradata Merge Join – The Swiss Army Knife

Requirements:

  • The rows to be joined have to be on a common AMP
  • Both spools must be sorted by the ROWID calculated over the join column(s)

Process:

  • The ROWHASH of each qualifying row in the left spool is used to look up matching rows with identical ROWHASH in the right spool (with a binary search as both spools are sorted by ROWID)

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of one or both spools by the ROWID

teradata merge join

The common AMP of rows from two spools being joined is defined by the join columns. This leaves us with 3 data distribution scenarios:

1.The Primary Indexes (or any other suitable index) of both tables equals the join columns: No join preparation is needed as the rows to be joined are already on the common AMP

2.Only the Primary Index (or any other suitable index)  of one table matches the join columns: The rows of the second table have to be relocated to the common AMP

3.Neither the Primary Index of the first table (or any other suitable index) nor the Primary Index (or any other suitable index)  of the second table matches the join columns: The rows of both tables have to be relocated to the common AMP

Relocation of rows to the common AMP can be done by redistribution of the rows by the join column(s) ROWHASH or by copying the smaller table to all AMPs.

2. Teradata Hash Join – The Sprinter, but only if executed in FSG Cache

Requirements:

  • The rows to be joined have to be on a common AMP
  • The smaller spool is sorted by the ROWHASH calculated over the join column(s) and kept in the FSG cache
  • The bigger spool stays unsorted

Process:

  • The bigger spool is full table scanned row by row
  • Each ROWID from the bigger spools is searched in the smaller spool (with a binary search)

Possible Join Preparations required:

  • Re-Distribution of the smaller spool by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of the smaller spools

hash_join

3. Teradata Nested Join – The Fastest, but scarce

Requirements:

  • Spool 1 allows a unique ROWHASH access (a unique index is defined)
  • Spool 2 allows any kind of ROWHASH access (a unique or not unique is index defined)

Process:

  • The qualifying row of spool 1 is accessed by usage of any unique index.
  • The row is relocated to the AMP owning the rows of spool 2
  • Spool 2 is full table scanned and each row is combined with the one row from Spool 1

Possible Join Preparations required:

  • None

4. Teradata Product Join – The disliked Guest

Requirements:

  • The rows to be joined have to be on the AMP
  • No spool needs to be sorted!

Process:

  • A full table scan is done on the smaller spool and
  • Each qualifying row of spool 1 is compared against each row of spool 2

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool

product_join

 

 

Our Reader Score
[Total: 60    Average: 4.3/5]
The 4 Principal Teradata Join Strategies written by Roland Wenzlofsky average rating 4.3/5 - 60 user ratings

6 COMMENTS

  1. Hello Roland,

    I like your articles. They are always described the subject very well and you provide a good parallels with real life.

    I’d like to add something to the article:
    1) What statement contains some inaccuracy
    >> “The smaller spool is sorted by the ROWHASH calculated over the join column(s) and kept in the FSG cache”
    Smaller rowset(spool is a bit incorrect term here) is not actually sorted, but a hash table is built based on hashing of smaller row set( of some part of smaller row set if the hash table does not fit in memory).

    2) >>The bigger spool is full table scanned row by row
    Usually the bigger rowset is full table scanned by cylinder reads and probes row by row

    3) >>Sorting of the smaller spools
    Sorting is wrong. Building a hash table under a smaller row set

    4) It will be good to add into the requirements for hash, what it can be happen on equality join.

    5)It may be added what if the hash table does not fit in FSG cache then the smaller table will be divided into some smaller parts, named Fanout, and the hash table will be built for every Fanout and the larger row set will be probed for every Fanout. That fact may multiply count of full table scans depends on Fanout count, what in turn may dramatically reduce performance.

    6) It will be good to add into the requirements for nested loops, what it can be happen on equality join.

    7) I would also suggest to change requirements for nested join according documentation:
    “There is a join on a column of the row specified by the first table to any primary index or
    USI of the second table. In rare cases, the index on the second table can be a NUSI. ”

    8) Small notice that a merge join may contain a product join inside.
    I make an example in my comment to https://www.dwhpro.com/teradata-merge-join-vs-product-join/ article.

    Thank you!

    Best regards,
    Aleksei Svitin.

  2. Roland,

    I really like the way you have described this very important topic. It is one of the best explanations I have seen. Thank you for this brilliant explanation. You have covered the most important topic and all who learn should really appreciate your guidance.

    Thank you again,

    Tom Coffing

LEAVE A REPLY

Please enter your comment!
Please enter your name here