The Teradata Traditional Merge Join
- The rows to be joined have to be located on the same AMP
- Both spools have to be sorted by the ROWID calculated over the join column(s)
Possible Join Preparations required:
- Re-Distribution of one or both spools by ROWHASH or
- Duplication of the smaller spool to all AMPs
- Sorting of one or both spools by the ROWID
Two different algorithms can be used for the Merge Join:
1. The Fast Path Algorithm
The comparison takes place alternating from both sides starting wit the left table. The algorithm tries to join rows with matching rowhash. If there is no match, the pointer is positioned on the row with the next highest rowhash value, and the comparison continues until all rows have been compared.
This method is used if the left and right table are full table scanned:
2. The Slow Path Algorithm
This algorithm reads each row from the left table and tries to match it against rows with the same rowhash from the right table.
This method is used if the left table is accessed via an index:
Traditional Merge Joins have a significant advantage over other join types which don't require both tables to be sorted by rowhash of the join columns:
Each data block of both tables is accessed exactly once (the algorithm slides down on each of the rowhash sorted tables), and it is, therefore, less sensitive to the size of available FSG cache.
The Teradata Sliding Window Merge Join
The sliding window merge join is an advancement of the traditional merge join. After introducing the feature of row partitioning, it was required to find an algorithm which allows joining a row partitioned table (PPI table) with
- Another PPI table having different partition characteristics
- A Non-partitioned table
The optimizer has the possibility to change a PPI table into an NPPI table and vice versa. A traditional merge join can follow this step. Nevertheless, to avoid this join preparation step, a sliding window merge join can be executed without the requirement of the restructuring of tables.
As opposed to non-partitioned tables (NPPI tables), which are only sorted by rowhash, PPI tables are sorted on each AMP by two levels:
1. Each row is placed into its assigned row partition.
2. Within each row partition, the rows are sorted by rowhash (the same way the rows of an NPPI table are stored).
The Sliding Window Merge Join was designed to be able to join directly
- An NPPI table with a PPI table or
- Two PPI tables with different partitioning
Directly means without changing a PPI table into an NPPI table or the other way around.
To understand the sliding window join process, one has to remember how PPI table rows are stored:Data rows are sorted by rowhash within the data blocks, but several row partitions can hold the same rowhash value!
The easiest way of merge joining an NPPI table against a PPI table seems by joining the NPPI table against each PPI table partition. This approach would be a reasonable solution because the NPPI table, and each partition of the PPI table, are sorted by rowhash allowing to execute a binary search within the data blocks of each partition.
Nevertheless, for performance reasons, Teradata implements a slightly different, but faster, algorithm:
The AMP reads the first data block from the NPPI table and one data block per partition from the PPI table. The rows are joined (binary search), and the algorithm moves down both tables by reading data block after data block from both tables.
The data block of the NPPI tables stays in FSG cache as long as data blocks from any partition of the PPI tables can be matched. If no more rows can be matched, the next data block of the NPPI tables is moved into the FSG cache and above described process is repeated, until the last data blocks of each table and partition are reached.
This process requires each data block of each table to be touched exactly once (similar to a traditional merge join).
This process could theoretically result in a similar join performance, as we can reach with a traditional merge join, but there is one restriction degrading performance: The available FSG cache memory.
If the FSG cache is not big enough to hold at least one data block from each PPI table partition, the optimizer has to split the join process into so-called windows.
Assume for example, that the PPI tables consist of 4 partitions, but there is only space in FSG cache for the first data block of 2 partitions. In this case, the process would define two windows, each one consisting of 2 partitions:
- Join the NPPI table against the first two partitions
- Join the NPPI table against the remaining two partitions
As a result, the NPPI table has to be read twice, and the join becomes costly. The more windows are needed, the more expensive the join becomes. If the NPPI table is small enough, caching effects could decrease the negative performance impact of having several windows.
The sliding window merge join has a similar performance pattern like a traditional merge join if most of the partitions can be eliminated before the join takes place. The best case scenario is when there is sufficient FSG cache available to join all partitions at once. This kind of setup is called single window merge join.
The join of two PPI tables with different partitioning is implemented similarly to the join between an NPPI and a PPI table:
The left table and the right table rows are split into windows (each window containing a part of the row partitions), and each window from the left table is joined against each window from the right table.
Here is an example: If the join needs to create five windows from the left PPI table and two windows from the right PPI table, this will result in a product join of 5*2 windows: