How to Join Partitioned Tables with an Unusual Trick for Improved Performance

Roland Wenzlofsky

April 27, 2023

minutes reading time

Joining partitioned tables can be resource-intensive, particularly if the primary indexes or partitioning differ. This article presents an unconventional technique for achieving significant performance gains. While not a common practice, it showcases the benefits of creative problem-solving. The method demonstrated is effective for tables with matching primary indexes and partitioning. Experimentation is encouraged to determine its suitability for other configurations.

The test setup includes two tables (each with approximately 50K rows), as depicted below:

CREATE MULTISET TABLE DWHPro.Sales2 , NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY , NO RANGE);

CREATE MULTISET TABLE DWHPro.Sales3, NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2016-12-31' EACH INTERVAL '1' DAY , NO RANGE);

For our initial query, we retrieve all partitions within the table. Removing the WHERE condition would yield a nearly identical execution plan.

SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';

The plan indicates that both tables are redistributed and hashed based on the join columns. As a merge join is utilized, sorting by ROWHASH is required.

1) First, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t02.
2) Next, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t01.
3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.
4) We execute the following steps in parallel.
1) We do an all-AMPs RETRIEVE step from 42734 partitions of
DWHPRO.t02 with a condition of ("(DWHPRO.t02.SalesDate >=
DATE '1900-01-01') AND (DWHPRO.t02.SalesDate <= DATE
'2016-12-31')") into Spool 2 (all_amps), which is
redistributed by the hash code of (DWHPRO.t02.SalesId,
DWHPRO.t02.SalesDate) to all AMPs. Then we do a SORT to
order Spool 2 by row hash. The size of Spool 2 is estimated
with high confidence to be 42,734 rows (897,414 bytes). The
estimated time for this step is 0.34 seconds.
2) We do an all-AMPs RETRIEVE step from 42734 partitions of
DWHPRO.t01 with a condition of ("(DWHPRO.t01.SalesDate <=
DATE '2016-12-31') AND (DWHPRO.t01.SalesDate >= DATE
'1900-01-01')") into Spool 3 (all_amps), which is
redistributed by the hash code of (DWHPRO.t01.SalesId,
DWHPRO.t01.SalesDate) to all AMPs. Then we do a SORT to
order Spool 3 by row hash. The size of Spool 3 is estimated
with high confidence to be 42,734 rows (897,414 bytes). The
estimated time for this step is 0.34 seconds.
5) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to Spool 3 (Last Use) by way
of a RowHash match scan. Spool 2 and Spool 3 are joined using a
merge join, with a join condition of ("(SalesId = SalesId) AND
(SalesDate = SalesDate)"). The result goes into Spool 1
(group_amps), which is built locally on the AMPs. The size of
Spool 1 is estimated with low confidence to be 42,734 rows (
1,239,286 bytes). The estimated time for this step is 0.31
seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.66 seconds.

Now for the trick: we’ll remove the final tag from the WHERE condition, specifically ‘2016-12-31’.

SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30';

The EXPLAIN PLAN shows that the Optimizer utilizes a rowkey-based merge join without incurring any expensive row redistribution or sorting.

1) First, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t02.
2) Next, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.t01.
3) We lock DWHPRO.t02 for read, and we lock DWHPRO.t01 for read.
4) We do an all-AMPs JOIN step from 42733 partitions of DWHPRO.t02 by
way of a RowHash match scan with a condition of (
"(DWHPRO.t02.SalesDate >= DATE '1900-01-01') AND
(DWHPRO.t02.SalesDate <= DATE '2016-12-30')"), which is joined to
42733 partitions of DWHPRO.t01 by way of a RowHash match scan with
a condition of ("(DWHPRO.t01.SalesDate <= DATE '2016-12-30') AND
(DWHPRO.t01.SalesDate >= DATE '1900-01-01')"). DWHPRO.t02 and
DWHPRO.t01 are joined using a rowkey-based merge join, with a join
condition of ("(DWHPRO.t01.SalesId = DWHPRO.t02.SalesId) AND
(DWHPRO.t01.SalesDate = DWHPRO.t02.SalesDate)"). The result goes
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 42,734
rows (1,239,286 bytes). The estimated time for this step is 0.28
seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.28 seconds.

This would have to be accompanied by enormous savings in resources. But since we want to select all days, we need to extend our query. I added the missing day as a test using the UNION ALL statement. Again we compare the query that causes a merge join with our query using a rowkey-based merge join and added day using UNION ALL:

SELECT -- Standard merge join
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-31';

SELECT -- Forcing the Optimizer into the rowkey-based merge join
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE BETWEEN DATE'1900-01-01' AND DATE'2016-12-30'
UNION ALL
SELECT
t01.SalesId,
t02.SalesId
FROM Sales2 t01
INNER JOIN
Sales3 t02
ON
t01.SalesId = t02.SalesId AND t01.SalesDate = t02.SalesDate
WHERE t01.SALESDATE = DATE'2016-12-31';

As expected, the rowkey-based merge join is much cheaper and has a much better run time:

	Logical IOs	CPU Seconds	Spool (GB)
RowKey Based Merge Join	91,00	0,02	0,10
Traditional Merge Join	272,00	0,24	1,80
Reduction in Resources	67%	92%	94%

Conclusion:

This trick is useful for retrieving all partitions from two PPI tables that have been joined together. By omitting a WHERE condition that covers the partitions, we can prompt the Optimizer to perform a more cost-effective rowkey-based merge join instead of a conventional one.

Upon further investigation, it has been discovered that the Optimizer refrains from utilizing a rowkey-based merge join when joining minuscule tables. Nevertheless, this stratagem may prove futile for larger tables as the Optimizer will automatically opt for the rowkey-based merge join. It is worth noting that altering the SQL statement can result in substantial performance improvements, highlighting the fact that optimization is an inherently creative process.