Teradata Join Strategy: The Benefits And Costs Of Merge And Product Join Methods

Teradata employs various join methods and techniques to merge the rows of two tables onto a single AMP, which is essential for joining. The combination of join technique and data geography is called the join strategy, and its goal is to reduce resource consumption (CPU seconds, I/Os). Some join methods, like the Teradata Merge Join, have highly efficient algorithms, but preparing them can be expensive, particularly when merging tables with different primary indexes or partitioning. Other join methods are less efficient algorithmically but do not incur any preparation costs, such as those related to join geography.

We should consider the following costs in the join strategy:

The number of required comparisons between the rows of both tables.
How often must each data block of both tables be loaded from disk into main memory during the join process?
The amount of data that must be copied by duplicating or redistributing between AMPs.
The number of rows of both tables that need to be sorted (if any).

The merge join is highly efficient, with minimal comparisons required after sorting the rows on the common AMP. This applies only when both tables have identical primary indices and partitions; otherwise, redistributing and sorting one or both tables may be necessary depending on the specific situation. For row-partitioned tables, it may be necessary to convert one or both to non-partitioned tables, or to partition non-partitioned tables. These preparation steps for the join can be quite expensive.

The product join is less costly to prepare than other types of joins. However, it requires many comparisons because every row in one table must be compared to every row in the other. Generally, the only preparation required for a product join is duplicating the smaller of the two tables to all AMPs. Rows do not need to be sorted and can be compared immediately.

The cost of the product join increases when a small table is erroneously identified as a large table due to absent or outdated statistics. This may cause the smaller table to exceed the available memory, requiring Teradata to repeatedly retrieve data blocks from disk into main memory.

Recapping the cost characteristics of the product join:

High costs for comparison
Low costs for join preparation

To summarize the cost characteristics of the merge join:

No costs (if the primary index and partitioning of both tables are the same) to very high costs for join preparation
Low costs for comparison

Let’s proceed to our test scenario to demonstrate that the product join is not always the wrong choice. When an execution plan involves a product join, it’s essential to consider the following factors:

Is it an unwanted product join, e.g., because of a wrong alias or a forgotten join condition?
Is at least one of the two tables very small?

If one of the two tables is small, a product join may perform better than a merge join. Remember that Teradata can also apply dynamic partition elimination to product joins.

Table of Contents

The Teradata Merge Join

Here is a query that can be resolved with a merge join:

SELECT Customer.ISO, Product.Desc
FROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCd;

4)We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with no residual conditions into Spool 2, redistributed by hash code to all AMPs. Then, we do a SORT to order Spool 2 by row hash. The result spool file will not be cached in memory. The size of Spool 2 is estimated with low confidence to be 10,000,100 rows. The estimated time for this step is 30 minutes and 12 seconds.
5) We do an all-AMPs JOIN step from Customer by way of a RowHash match scan with no residual conditions, which is joined to Spool 2 (Last Use). Product and Spool 2 are joined using a merge join, with a join condition of ("Product.ProductCd=ProductCd"). The result goes into Spool 1, which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 10,000,000 rows. The estimated time for this step is 11 minutes and 20 seconds.

The execution plan shows that join preparation will take around 30 minutes, followed by an estimated 11 minutes for the join operation itself. These estimates align with the query’s actual runtime. It is important to note that redistributing and sorting all table rows is necessary to ensure the primary index of each table aligns, so the rows end up on a common AMP.

The Teradata Product Join

We improve the query by adding a WHERE condition, prompting the optimizer to use a product join.

SELECT Customer.ISO, Product.Desc
FROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCd
WHERE Product.Price <= 1000;

4) We do an all-AMPs RETRIEVE step from Product by way of an all-rows scan with a condition of ("Product.Price <= 1000") into Spool 2, which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 100 rows. The estimated time for this step is 0.15 seconds.

5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to Customer. Spool 2 and Customer are joined using a product join, with a join condition of ("ProductCd = Customer.ProductCd"). The input table Customer will not be cached in memory, but it is eligible for synchronized scanning. The result goes into Spool 1, which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 800,000 rows. The estimated time for this step is 18 minutes and 17 seconds.

The updated execution plan projects a sub-one-second duration for copying the small table to all AMPs. The join is estimated to take approximately 18 minutes. These projections align closely with the actual runtimes.

The Product Join performs significantly better than the Merge Join.

The WHERE condition facilitates a product join in this example. However, the optimizer may instead choose to duplicate or redistribute the smaller table across all available AMPs, sort it, and execute a merge join via a rowhash match scan. This method incurs greater expense than joining the unsorted table. If Teradata were to join the tables through columns that lack a primary index, the optimizer would be forced to sort both tables, which is exceedingly costly for large tables.

Summary

While seeing a product join in an execution plan may be alarming, it should not automatically be deemed unfavorable. It is worth noting that this join type is universally applicable, unlike others. In certain cases, such as when OR conditions are present in the join, the optimizer may have no option but to use a product join.

If you’re interested in join techniques and tactics, I suggest reading this article as well:

Introduction to Teradata Nested Join: The Cheapest Join Method Using Indexes

To learn more about product joins, we suggest referring to the official Teradata documentation.

Official Teradata Docs: The Product Join

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

3 thoughts on “Teradata Join Strategy: The Benefits and Costs of Merge and Product Join Methods”

Roland Wenzlofsky

03/28/2017 at 6:43 pm

Thank you very much, Aleksei.

This was an excellent hint: Sometimes, caused by biased data demographics, a product join is hidden behind a merge join. This was a great example.
- Radnor
  
  11/01/2021 at 6:04 am
  
  hi sir, can you please let me know the best way to self join same table multiple times based on different filter in the join clause. Ex: joining to country table multiple times on country_id (just for example sake provided)
  
  on 1st two joins doing “We do an All-AMPs JOIN step from Spool 434 by way of an all-rows scan, which is joined to table abc. Spool xyz and table abc are left outer joined using a merge join . The result goes into Spool xyz1, which is redistributed by hash code to all AMPs. This step ends a parallel block of steps.”
  but
  its just doing “We do an All-AMPs RETRIEVE step from table abc by way of an all-rows scan into Spool 123, which is duplicated on all AMPs.” on the 3rd join and this duplication is taking long time
Aleksei Svitin

03/27/2017 at 8:07 pm

Hello.

Thank you for the article.

I have some small addition: a merge join may contain a product join inside.
Let’s assume that we have the same non-UniquePrimary Index( Store_id and good_id) for 2 different tables.
Each good may be sold many times in some stores. Let’s assume that it is sold 1000 times in some store but on different days(which is not a part of PI). All 1000 records will have the same rowhash, stored on the same APM, and will have a different uniques value in its rowid.

If we try to “make” a merge join of two tables then Teradata do real merge join algorithm between different row hashes(for example if some store is presented in one table but does not present in second then the rowhashes is not matched and that store will be skipped) but it has to make the product join between rows shared the same rowhash value(so for some particular pair of Store_id and Good_id all rows of the first table will be product joined to the corresponding rows from the second table).

This is the reason why the performance of merge join may be decreased in some cases.
Another popular reason is a different partitioning between joined tables but it not a theme of this article.

Best regards,
Aleksei Svitin.

Teradata Join Strategy: The Benefits and Costs of Merge and Product Join Methods

The Teradata Merge Join

The Teradata Product Join

Summary

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

3 thoughts on “Teradata Join Strategy: The Benefits and Costs of Merge and Product Join Methods”

Leave a Comment Cancel reply