How to Join Partitioned Tables with an Unusual Trick for Improved Performance

Joining partitioned tables can be resource-intensive, particularly if the primary indexes or partitioning differ. This article presents an unconventional technique for achieving significant performance gains. While not a common practice, it showcases the benefits of creative problem-solving. The method demonstrated is effective for tables with matching primary indexes and partitioning. Experimentation is encouraged to determine its suitability for other configurations.

The test setup includes two tables (each with approximately 50K rows), as depicted below:

CREATE MULTISET TABLE DWHPro.Sales2 , NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY , NO RANGE);
CREATE MULTISET TABLE DWHPro.Sales3, NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY , NO RANGE);

For our initial query, we retrieve all partitions within the table. Removing the WHERE condition would yield a nearly identical execution plan.

SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';

The plan indicates that both tables are redistributed and hashed based on the join columns. As a merge join is utilized, sorting by ROWHASH is required.

1) First, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t02.
2) Next, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t01.
3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from 42734 partitions of
DWHPRO.t02 with a condition of ("(DWHPRO.t02.SalesDate >=
DATE '1900-01-01') AND (DWHPRO.t02.SalesDate <= DATE
'2016-12-31')") into Spool 2 (all_amps), which is
redistributed by the hash code of (DWHPRO.t02.SalesId,
DWHPRO.t02.SalesDate) to all AMPs. Then we do a SORT to
order Spool 2 by row hash. The size of Spool 2 is estimated
with high confidence to be 42,734 rows (897,414 bytes). The
estimated time for this step is 0.34 seconds.
2) We do an all-AMPs RETRIEVE step from 42734 partitions of
DWHPRO.t01 with a condition of ("(DWHPRO.t01.SalesDate <=
DATE '2016-12-31') AND (DWHPRO.t01.SalesDate >= DATE
'1900-01-01')") into Spool 3 (all_amps), which is
redistributed by the hash code of (DWHPRO.t01.SalesId,
DWHPRO.t01.SalesDate) to all AMPs. Then we do a SORT to
order Spool 3 by row hash. The size of Spool 3 is estimated
with high confidence to be 42,734 rows (897,414 bytes). The
estimated time for this step is 0.34 seconds.
5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to Spool 3 (Last Use) by way
of a RowHash match scan. Spool 2 and Spool 3 are joined using a
merge join, with a join condition of ("(SalesId = SalesId) AND
(SalesDate = SalesDate)"). The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with low confidence to be 42,734 rows (
1,239,286 bytes). The estimated time for this step is 0.31
seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.66 seconds.

Now for the trick: we’ll remove the last date from the WHERE condition, specifically ‘2016-12-31’.

SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30';

The EXPLAIN PLAN shows that the Optimizer utilizes a rowkey-based merge join without incurring any expensive row redistribution or sorting.

1) First, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t02.
2) Next, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t01.
3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.
4) We do an all-AMPs JOIN step from 42733 partitions of DWHPRO.t02 by
way of a RowHash match scan with a condition of (
"(DWHPRO.t02.SalesDate >= DATE '1900-01-01') AND
(DWHPRO.t02.SalesDate <= DATE '2016-12-30')"), which is joined to
42733 partitions of DWHPRO.t01 by way of a RowHash match scan with
a condition of ("(DWHPRO.t01.SalesDate <= DATE '2016-12-30') AND
(DWHPRO.t01.SalesDate >= DATE '1900-01-01')"). DWHPRO.t02 and
DWHPRO.t01 are joined using a rowkey-based merge join, with a join
condition of ("(DWHPRO.t01.SalesId = DWHPRO.t02.SalesId) AND
(DWHPRO.t01.SalesDate = DWHPRO.t02.SalesDate)"). The result goes
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 42,734
rows (1,239,286 bytes). The estimated time for this step is 0.28
seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.28 seconds.

This would have to be accompanied by enormous savings in resources. But since we want to select all days, we need to extend our query. I added the missing day as a test using the UNION ALL statement. Again we compare the query that causes a merge join with our query using a rowkey-based merge join and added day using UNION ALL:

SELECT -- Standard merge join
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';
SELECT -- Forcing the Optimizer into the rowkey-based merge join
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30'
UNION ALL
SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE = DATE'2016-12-31';

As expected, the rowkey-based merge join is much cheaper and has a much better run time:

 Logical IOsCPU SecondsSpool (GB)
RowKey Based Merge Join91,000,020,10
Traditional Merge Join272,000,241,80
Reduction in Resources67%92%94%

Conclusion:

This trick is useful for retrieving all partitions from two PPI tables that have been joined together. By omitting a WHERE condition that covers the partitions, we can prompt the Optimizer to perform a more cost-effective rowkey-based merge join instead of a conventional one.

Further testing revealed that the Optimizer does not use a rowkey-based merge join for very small tables. Equally, this technique may prove unnecessary for very large tables, as the Optimizer will automatically opt for the rowkey-based merge join. It is worth noting that altering the SQL statement can result in substantial performance improvements, highlighting the fact that optimization is an inherently creative process.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

📊 Data Platform Migration Survey

Help us map where the industry is heading. Results are public — see what others chose.

1. What is your current data platform?

2. Where are you migrating to (or evaluating)?

Migrating FROM
Migrating TO

Thanks for voting! Share this with your network.

Follow me on LinkedIn for daily insights on data warehousing and platform migrations.

Stay Ahead in Data Warehousing

Get expert insights on Teradata, Snowflake, BigQuery, Databricks, Microsoft Fabric, and modern data architecture — delivered to your inbox.

3 thoughts on “How to Join Partitioned Tables with an Unusual Trick for Improved Performance”

Leave a Comment

DWHPro

Expert network for enterprise data platforms. Senior consultants, project teams built for your challenge — across Teradata, Snowflake, Databricks, and more.

📍Vienna, Austria & Jacksonville, Florida

Quick Links
Services Team Teradata Book Blog Contact Us
Connect
LinkedIn → [email protected]
Newsletter

Join 4,000+ data professionals.
Weekly insights on Teradata, Snowflake & data architecture.