Query rewriting is a potent approach for enhancing performance in row partition elimination. The ensuing test configuration is rooted in an actual optimization I executed recently.
We are dealing with a straightforward query: We only want to retrieve the rows of one row partition of our test table. Nevertheless, vast improvements were possible by rewriting the query.
Our table for testing comprises approximately 400 million rows and is partitioned according to the following DDL statement:
CREATE MULTISET TABLE TheDatabase.TheTable
(
PK BIGINT NOT NULL
COL1,
…
COLn
) PRIMARY INDEX (PK)
PARTITION BY (
CASE_N( COL1 = ‘100’,COL1 = ‘200’,COL1 = ‘300’,COL1 = ‘400’,
COL1 = ‘500’,COL1 = ‘600’,COL1 = ‘700’,COL1 = ‘800’,COL1 = ‘900’, NO CASE, UNKNOWN)
);
The task is to retrieve rows where the value in the COL1 column is ‘100’. We’ll start by writing the most straightforward query:
SELECT * FROM TheDatabase.TheTable WHERE COL1 = '100';
As anticipated, the plan of execution comprises a single retrieval step.
The optimizer can use static partition elimination to substitute the literal value of ‘100’ when creating the execution plan.
Important note: The required statistics are readily available (“high confidence”) and up-to-date in our tests.
1) First, we lock a distinct TheDatabase. “pseudo table” for read on a RowHash
to prevent global deadlock for TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable for read.
3) We do an all-AMPs RETRIEVE step from a single partition of
TheDatabase.TheTable with a condition of (
“TheDatabase.TheTable .COL1 = ‘100’”) into Spool 1
(group_amps), which is built locally on the AMPs. The input table
will not be cached in memory, but it is eligible for synchronized
scanning. The result spool file will not be cached in memory.
The size of Spool 1 is estimated with high confidence to be
108,121,098 rows (23,354,157,168 bytes). The estimated time for
this step is 18.85 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 18.85 seconds.
The performance diagram below shows us the optimal situation and represents our benchmark:
Rows | Disk IOs | CPU Seconds | Spool Space |
108.121.087,00 | 294.425,00 | 489,64 | 20.831.922.176,00 |
Assuming a literal cannot be used for COL1, we must dynamically pass the value ‘100’ from another table into the query.
There are several ways of rewriting this query. Below is our first attempt:
(
COL1 CHAR(03) NOT NULL
) PRIMARY INDEX (COL1) ON COMMIT PRESERVE ROWS;
INSERT INTO TMP_COL1 VALUES (‘100’);
SELECT * FROM TheDatabase.TheTable WHERE COL1 IN (SELECT COL1 FROM TMP_COL1);
This is the resulting execution plan:
1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for
TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable for read.
3) We do an all-AMPs RETRIEVE step from TheUser.TMP_COL1 by way
of an all-rows scan with no residual conditions into Spool 3
(all_amps), which is built locally on the AMPs. Then we do a SORT
to order Spool 3 by the sort key in spool field1 (
TheUser.TMP_COL1.COL1) eliminating duplicate rows. The
size of Spool 3 is estimated with high confidence to be 1 row (20
bytes). The estimated time for this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase.TheTable by way
of an all-rows scan with no residual conditions, which is joined
to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase.TheTable and Spool 2 are joined using a inclusion
dynamic hash join, with a join condition of (
“TheDatabase.TheTable.COL1 = COL1”). The input table
TheDatabase.TheTable will not be cached in memory. The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes). The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
Initially, the table TMP_COL1 is partitioned in the same manner as “TheDatabase.TheTable” and duplicated to all AMPs. It contains a solitary row with the value for COL1.
A dynamic inclusion hash join is performed, without any partition elimination.
The query from our first attempt creates 197% more IOs and consumes 145% more CPU seconds when compared to our benchmark query (but the spool usage is the same):
Rows | Disk IOs | CPU Seconds | Spool Space |
108.121.087,00 | 580.757,00 | 711,81 | 20.831.922.176,00 |
We need to rewrite the query to trigger partition elimination.
Here is another revised version of the benchmarking query. We are using a correlated subquery instead of the “IN (SELECT * FROM …):
SELECT * FROM TheDatabase.TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1);
Upon inspection of the execution plan, it is evident that there is no enhancement.
Explain SELECT * FROM TheDatabase.TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1)
1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 3 (all_amps),
which is built locally on the AMPs. Then we do a SORT to order
Spool 3 by the sort key in spool field1 (TheUser.t02.COL1)
eliminating duplicate rows. The size of Spool 3 is estimated with
high confidence to be 1 row (20 bytes). The estimated time for
this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase.t01 by way of an
all-rows scan with no residual conditions, which is joined to
Spool 2 (Last Use) by way of an all-rows scan. TheDatabase.t01
and Spool 2 are joined using a inclusion dynamic hash join, with a
join condition of (“TheDatabase.t01.COL1 = COL1”). The
input table TheDatabase.t01 will not be cached in memory. The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes). The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.
The following performance diagram indicates that the performance metrics remain consistent with those of the previous query:
Rows | Disk IOs | CPU Seconds | Spool Space |
108.121.087,00 | 585.999,00 | 743,45 | 20.831.922.176,00 |
I rewrote the query as a straightforward INNER JOIN on the third try.
SELECT * FROM TheDatabase.TheTable t01
INNER JOIN
TMP_COL1 t02
ON
T01.COL1 = T02.COL1;
Please find below the updated execution plan:
Explain SELECT * FROM TheDatabase.TheTable t01
INNER JOIN
TMP_COL1 t02
ON T01.COL1 = T02.COL1
1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
Then we do a SORT to partition by rowkey. The size of Spool 2 is
estimated with high confidence to be 180 rows (2,880 bytes). The
estimated time for this step is 0.01 seconds.
4) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to TheDatabase.t01 by way of an
all-rows scan with no residual conditions. Spool 2 and
TheDatabase.t01 are joined using a product join, with a join
condition of (“TheDatabase.t01.COL1 = COL1”) enhanced by
dynamic partition elimination. The input table TheDatabase.t01
will not be cached in memory, but it is eligible for synchronized
scanning. The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
low confidence to be 54,142,893 rows (11,857,293,567 bytes). The
estimated time for this step is 26.79 seconds.
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 26.80 seconds.
The current blueprint for executing our benchmark query appears encouraging.
The join step utilizes dynamic partition elimination in a product join.
Knowing that the left table (TMP_COL1) has only one row, we can anticipate improved query performance compared to our previous two queries.
The FSG stores the TMP_COL1 row and compares it to the matching partition where COL1=’100′.
Hence, a product join’s impact ought to be inconsequential, while a dynamic partition is expected to enhance query performance notably. In fact, the performance diagram below illustrates that we are achieving nearly identical performance results (with only a slight increase in CPU seconds) as the benchmarking query that utilized static partition elimination.
Rows | Disk IOs | CPU Seconds | Spool Space |
108.121.087,00 | 295.099,00 | 596,86 | 21.214.113.280,00 |
Conclusion:
Although the optimizer generally performs well, analyzing the execution plan and seeking potential enhancements is beneficial. In this case, a minor query rewrite could decrease disk IO by 50%.