Forum

The 4 Principal Ter...
 

The 4 Principal Teradata Join Strategies  

  RSS

Roland Wenzlofsky
(@rwenzlofsky)
Estimable Member
Joined: 6 years ago
Posts: 203
19/11/2014 8:45 pm  

Teradata has various join methods to choose from. Each of these methods places different demands on data distribution. But they all have one thing in common: the rows of both tables must be available for joining on a common AMP.

It is therefore usually necessary to copy the rows of one or both tables. The join method together with the selected type of copying the data is called a Teradata join strategy.

When analyzing the execution plan of your queries, you will mainly see one of the join methods described below.

This article analyzes how every join method works and highlights the advantages and disadvantages.

Data demography has a great influence on the execution plan and the chosen join strategy.

Requirements:

  • The rows to be joined have to be on a common AMP
  • Both spools must be sorted by the ROWID calculated over the join column(s)

Process:

  • The ROWHASH of each qualifying row in the left spool is used to lookup matching rows with identical ROWHASH in the right spool (with a binary search as both spools are sorted by ROWID)

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of one or both spools by the ROWID

1. Teradata Merge Join - The Swiss Army Knife

teradata merge join

The common AMP of rows from two spools being joined is defined by the join columns. This leaves us with 3 data distribution scenarios:

  1. The Primary Indexes (or any other suitable index) of both tables equals the join columns: No join preparation is needed as the rows to be joined are already on the common AMP

2. Only the Primary Index (or any other suitable index)  of one table matches the join columns: The rows of the second table have to be relocated to the common AMP

3. Neither the Primary Index of the first table (or any other suitable index) nor the Primary Index (or any other suitable index)  of the second table matches the join columns: The rows of both tables have to be relocated to the common AMP

Relocation of rows to the common AMP can be done by redistribution of the rows by the join column(s) ROWHASH or by copying the smaller table to all AMPs.

2. Teradata Hash Join - The Sprinter, but only if executed in FSG Cache

Requirements:

  • The rows to be joined have to be on a common AMP
  • The smaller spool is sorted by the ROWHASH calculated over the join column(s) and kept in the FSG cache
  • The bigger spool stays unsorted

Process:

  • The bigger spool is full table scanned row by row
  • Each ROWID from the bigger spools is searched in the smaller spool (with a binary search)

Possible Join Preparations required:

  • Re-Distribution of the smaller spool by ROWHASH or
  • Duplication of the smaller spool to all AMPs
  • Sorting of the smaller spools

teradata hash join

3. Teradata Nested Join - The Fastest, but scarce

Requirements:

  • Spool 1 allows a unique ROWHASH access (a unique index is defined)
  • Spool 2 allows any kind of ROWHASH access (a unique or not unique is index defined)
  • Nested joins must have an equality join condition

Process:

  • The qualifying row of spool 1 is accessed by the usage of any unique index (UPI, USI). Rarely it can be a NUPI if the optimizer knows that only one row is returned.
  • The row is relocated to the AMPs owning these rows of spool 2.
    A Nested Join is using a few AMPs if the index of Spool 2 is a UPI, NUPI, or USI. It used all-AMPs if the index of Spool 2 is a NUSI.
  • Spool 2 is full table scanned and each row is combined with the one row from Spool 1

Possible Join Preparations required:

  • None

4. Teradata Product Join - The disliked Guest

Requirements:

  • The rows to be joined have to be on the AMP
  • No spool needs to be sorted!

Process:

  • A full table scan is done on the smaller spool and
  • Each qualifying row of spool 1 is compared against each row of spool 2

Possible Join Preparations required:

  • Re-Distribution of one or both spools by ROWHASH or
  • Duplication of the smaller spool

Teradata Quiz
Teradata Quiz
Developer: Roland Wenzlofsky
Price: Free

This topic was modified 2 weeks ago 11 times by DWH Pro Admin
This topic was modified 7 days ago by DWH Pro Admin
This topic was modified 6 days ago by DWH Pro Admin

Quote
Tom Coffing
 Tom Coffing
(@Tom Coffing)
Guest
Joined: 5 years ago
Posts: 3
22/11/2014 3:37 pm  

Roland,

I really like the way you have described this very important topic. It is one of the best explanations I have seen. Thank you for this brilliant explanation. You have covered the most important topic and all who learn should really appreciate your guidance.

Thank you again,

Tom Coffing


ReplyQuote
Roland Wenzlofsky
(@rwenzlofsky)
Estimable Member
Joined: 6 years ago
Posts: 203
22/11/2014 9:21 pm  

Thank you very much


ReplyQuote
Eman Dawod
 Eman Dawod
(@Eman Dawod)
Guest
Joined: 5 years ago
Posts: 1
24/11/2014 11:57 am  

Really it is very useful to me, you have covered the most important topics i want to know,Thank U very much


ReplyQuote
Pallavi Sharda
 Pallavi Sharda
(@Pallavi Sharda)
Guest
Joined: 3 years ago
Posts: 1
19/03/2017 12:01 pm  

what an explanation !! WOW


ReplyQuote
Goutham
 Goutham
(@Goutham)
Guest
Joined: 3 years ago
Posts: 1
20/03/2017 1:11 am  

Really Very useful to me.


ReplyQuote
Aleksei Svitin
(@aleksei-svitin)
New Member
Joined: 3 years ago
Posts: 3
04/04/2017 4:50 pm  

Hello Roland,

I like your articles. They are always described the subject very well and you provide a good parallels with real life.

I'd like to add something to the article:
1) What statement contains some inaccuracy
>> "The smaller spool is sorted by the ROWHASH calculated over the join column(s) and kept in the FSG cache"
Smaller rowset(spool is a bit incorrect term here) is not actually sorted, but a hash table is built based on hashing of smaller row set( of some part of smaller row set if the hash table does not fit in memory).

2) >>The bigger spool is full table scanned row by row
Usually the bigger rowset is full table scanned by cylinder reads and probes row by row

3) >>Sorting of the smaller spools
Sorting is wrong. Building a hash table under a smaller row set

4) It will be good to add into the requirements for hash, what it can be happen on equality join.

5)It may be added what if the hash table does not fit in FSG cache then the smaller table will be divided into some smaller parts, named Fanout, and the hash table will be built for every Fanout and the larger row set will be probed for every Fanout. That fact may multiply count of full table scans depends on Fanout count, what in turn may dramatically reduce performance.

6) It will be good to add into the requirements for nested loops, what it can be happen on equality join.

7) I would also suggest to change requirements for nested join according documentation:
"There is a join on a column of the row specified by the first table to any primary index or
USI of the second table. In rare cases, the index on the second table can be a NUSI. "

8) Small notice that a merge join may contain a product join inside.
I make an example in my comment to https://www.dwhpro.com/teradata-merge-join-vs-product-join/ article.

Thank you!

Best regards,
Aleksei Svitin.


ReplyQuote
Sri Ramanujam
 Sri Ramanujam
(@Sri Ramanujam)
Guest
Joined: 2 years ago
Posts: 1
13/07/2017 4:48 pm  

The 4 Principal Teradata Join Strategies
Hi Roland - Greetings! These concepts are useful. I would appreciate if you could please post or send me few sample queries. I am a Data Scientist/Analyst and I do not get involved at the AMP Level. This is kool. Need your help. Thank you. Regards, Sri


ReplyQuote
Share:
>

Please Login or Register