Teradata employs various join methods and techniques to merge the rows of two tables into a single AMP, which is vital for merging. The join technique and geography combination is the join strategy to reduce resource consumption (CPU seconds, IOs). Some join methods, like the Teradata Merge Join, have highly efficient algorithms, but preparing them can be expensive, particularly when merging tables with distinct primary indexes or partitioning. Other join methods are less effective algorithmically but do not incur any costs for join preparation, such as join geography.
We should consider the following costs in the Join strategy:
- The number of required comparisons between the rows of both tables.
- How often must each data block of both tables be loaded from the disk into the main memory during the join process?
- The amount of data must be copied by duplicating or redistributing between AMPs.
- The number of rows of both tables that need to be sorted (if any).
The merge join is highly efficient, with minimal comparisons required after sorting the rows on the common AMP. This applies only when both tables have identical primary indices and partitions; otherwise, redistributing and sorting one or both tables may be necessary based on the specific situation. For row partitioned tables, it may be necessary to convert one or both to non-partitioned tables or partition non-partitioned tables. These preparation steps for the join can be quite expensive.
The product join is less costly to prepare than other types of joins. However, it requires many comparisons because every row in one table must be compared to another. Generally, the only preparation required for a product join is duplicating the smaller two tables to all AMPs. Rows do not need to be sorted and can be compared without delay.
The cost of the product increases when a small table is erroneously identified as a large table because of absent or outdated statistics. This may cause the diminutive table to exceed the primary memory capacity, requiring Teradata to repeatedly retrieve data blocks from the disk into the primary memory.
Recapping the cost features of the product join:
- High costs for comparison
- Low costs for join preparation
To summarize the cost characteristics of the merge join:
- No costs (if the primary index and partitioning of both tables are the same) to enormous costs for join preparation
- Low costs for comparison
Summarizing once again the cost characteristics of the merge join:
Let’s proceed to our test scenario to demonstrate that the vilified product join isn’t always incorrect. When an execution plan involves a product join, it’s essential to consider the following factors:
- Is it an unwanted product join, e.g., because of a wrong alias or a forgotten join condition?
- Is at least one of the two tables tiny?
If one of the two tables is small, a product join may perform better than a merge join. Remember that Teradata can also apply dynamic partition elimination to product joins.
The Teradata Merge Join
Here is a query that can be resolved with a merge join:
SELECT Customer.ISO, Product.Desc
FROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCd;
4)We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with no residual conditions into Spool 2, redistributed by hash code to all AMPs. Then, we do a SORT to order Spool 2 by row hash. The result spool file will not be cached in memory. The size of Spool 2 is estimated with low confidence to be 10,000,100 rows. The estimated time for this step is 30 minutes and 12 seconds.
5) We do an all-AMPs JOIN step from Customer by way of a RowHash match scan with no residual conditions, which is joined to Spool 2 (Last Use). Product and Spool 2 are joined using a merge join, with a join condition of ("Product.ProductCd=ProductCd"). The result goes into Spool 1, which is built locally on the AMPs. The result spool file will not be cached in memory. The size of Spool 1 is estimated with low confidence to be 10,000,000 rows. The estimated time for this step is 11 minutes and 20 seconds.
The Execution Plan shows that the join preparation will take around 30 minutes, followed by an estimated 11 minutes for the join operation. These estimations align with the query’s actual runtime. It’s important to note that redistributing and sorting all table rows is necessary to ensure the primary index of each table aligns, or the rows are on a common AMP.
The Teradata Product Join
We improve the query by adding a WHERE condition, prompting the optimizer to utilize a product join.
SELECT Customer.ISO, Product.Desc
FROM Customer INNER JOIN Product ON Customer.ProductCd = Product.ProductCd
WHERE Product.Price <= 1000;
4) We do an all-AMPs RETRIEVE step from Product by way of an all-rows scan with a condition of ("Product.Price <= 1000") into Spool 2, which is duplicated on all AMPs. The size of Spool 2 is estimated with high confidence to be 100 rows. The estimated time for this step is 0.15 seconds.
5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an all-rows scan, which is joined to Customer. Spool 2 and Customer are joined using a product join, with a join condition of ("ProductCd = Customer.ProductCd"). The input table Customer will not be cached in memory, but it is eligible for synchronized scanning. The result goes into Spool 1, which is built locally on the AMPs. The size of Spool 1 is estimated with low confidence to be 800,000 rows. The estimated time for this step is 18 minutes and 17 seconds.
The revised text would be: The updated Execution Plan projects a sub-one-second duration for copying the small table to all AMPs. The join is estimated to take approximately 18 minutes. These projections align closely with the actual runtimes.
The Product Join performs significantly better than the Merge Join.
The WHERE condition facilitates a product join in our intriguing example. However, the optimizer may choose to duplicate or disseminate the smaller table across all available AMPs, sort it, and execute a merge join via a rowhash match scan. Nonetheless, this method incurs greater expense in comparison to joining the unsorted table. If Teradata were to join the tables through columns that lack primary indexing, the optimizer would be compelled to sort both tables, which proves exceedingly costly for large tables.
Summary
While the sight of a product join may be alarming for some, it should not be automatically deemed unfavorable. It is worth noting that this join type is universally applicable, unlike others. In certain cases, such as when OR conditions are present in the join, the optimizer may have no option but to utilize a product join.
If you’re interested in joining techniques and tactics, I suggest reading this article as well:
To learn more about product joins, we suggest referring to the official Teradata documentation.
Thank you very much, Aleksei.
This was an excellent hint: Sometimes, caused by biased data demographics, a product join is hidden behind a merge join. This was a great example.
hi sir, can you please let me know the best way to self join same table multiple times based on different filter in the join clause. Ex: joining to country table multiple times on country_id (just for example sake provided)
on 1st two joins doing “We do an All-AMPs JOIN step from Spool 434 by way of an all-rows scan, which is joined to table abc. Spool xyz and table abc are left outer joined using a merge join . The result goes into Spool xyz1, which is redistributed by hash code to all AMPs. This step ends a parallel block of steps.”
but
its just doing “We do an All-AMPs RETRIEVE step from table abc by way of an all-rows scan into Spool 123, which is duplicated on all AMPs.” on the 3rd join and this duplication is taking long time
Hello.
Thank you for the article.
I have some small addition: a merge join may contain a product join inside.
Let’s assume that we have the same non-UniquePrimary Index( Store_id and good_id) for 2 different tables.
Each good may be sold many times in some stores. Let’s assume that it is sold 1000 times in some store but on different days(which is not a part of PI). All 1000 records will have the same rowhash, stored on the same APM, and will have a different uniques value in its rowid.
If we try to “make” a merge join of two tables then Teradata do real merge join algorithm between different row hashes(for example if some store is presented in one table but does not present in second then the rowhashes is not matched and that store will be skipped) but it has to make the product join between rows shared the same rowhash value(so for some particular pair of Store_id and Good_id all rows of the first table will be product joined to the corresponding rows from the second table).
This is the reason why the performance of merge join may be decreased in some cases.
Another popular reason is a different partitioning between joined tables but it not a theme of this article.
Best regards,
Aleksei Svitin.