Improving Performance Through Query Rewriting With Row Partition Elimination: A Real-World Optimization Example On A 400-Million-Row Table

Query rewriting is a potent approach for enhancing performance through row partition elimination. The following test configuration is based on an actual optimization I performed recently.

We are dealing with a straightforward query: we only want to retrieve the rows of one row partition of our test table. Nevertheless, vast improvements were possible by rewriting the query.

Our table for testing comprises approximately 400 million rows and is partitioned according to the following DDL statement:

CREATE MULTISET TABLE TheDatabase. TheTable
(
PK BIGINT NOT NULL
COL1,
…
COLn
) PRIMARY INDEX (PK)
PARTITION BY (
CASE_N(COL1 = ‘100’,COL1 = ‘200’,COL1 = ‘300’,COL1 = ‘400’,
COL1 = ‘500’,COL1 = ‘600’,COL1 = ‘700’,COL1 = ‘800’,COL1 = ‘900’, NO CASE, UNKNOWN)
);

The task is to retrieve rows where the value in the COL1 column is ‘100’. We’ll start by writing the most straightforward query:

SELECT * FROM TheDatabase.TheTable  WHERE COL1 = '100';

As anticipated, the execution plan comprises a single retrieval step.

The optimizer can use static partition elimination to substitute the literal value of ‘100’ when creating the execution plan.

Important note: The required statistics are readily available (“high confidence”) and up to date in our tests.

1) First, we lock a distinct TheDatabase. “pseudo table” for read on a RowHash
to prevent global deadlock for TheDatabase. TheTable.
2) Next, we lock TheDatabase. TheTable for read.
3) We do an all-AMPs RETRIEVE step from a single partition of
TheDatabase. TheTable with a condition of (
“TheDatabase. TheTable. COL1 = ‘100’”) into Spool 1
(group_amps), which is built locally on the AMPs. The input table
will not be cached in memory, but it is eligible for synchronized
scanning. The result spool file will not be cached in memory.
The size of Spool 1 is estimated with high confidence to be
108,121,098 rows (23,354,157,168 bytes). The estimated time for
this step is 18.85 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 18.85 seconds.

The performance diagram below shows the optimal situation and represents our benchmark:

Rows	Disk IOs	CPU Seconds	Spool Space
108.121.087,00	294.425,00	489,64	20.831.922.176,00

Assuming a literal cannot be used for COL1, we must dynamically pass the value ‘100’ from another table into the query.

There are several ways of rewriting this query. Below is our first attempt:

(
COL1 CHAR(03) NOT NULL
) PRIMARY INDEX (COL1) ON COMMIT PRESERVE ROWS;

INSERT INTO TMP_COL1 VALUES (‘100’);

SELECT * FROM TheDatabase.TheTable WHERE COL1 IN (SELECT COL1 FROM TMP_COL1);

This is the resulting execution plan:
1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for
TheDatabase. TheTable.
2) Next, we lock TheDatabase. TheTable for read.
3) We do an all-AMPs RETRIEVE step from TheUser. TMP_COL1 by way
of an all-rows scan with no residual conditions into Spool 3
(all_amps), which is built locally on the AMPs. Then we do a SORT
to order Spool 3 by the sort key in spool field1 (
TheUser. TMP_COL1.COL1) eliminating duplicate rows. The
size of Spool 3 is estimated with high confidence to be 1 row (20
bytes). The estimated time for this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase. TheTable by way
of an all-rows scan with no residual conditions, which is joined
to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase. TheTable and Spool 2 are joined using a inclusion
dynamic hash join, with a join condition of (
“TheDatabase. TheTable. COL1 = COL1”). The input table
TheDatabase. TheTable will not be cached in memory. The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes). The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

Initially, the table TMP_COL1 is partitioned in the same manner as “TheDatabase. TheTable” and duplicated to all AMPs. It contains a single row with the value for COL1.

A dynamic inclusion hash join is performed, without any partition elimination.

The query from our first attempt creates 197% more IOs and consumes 145% more CPU seconds when compared to our benchmark query (but the spool usage is the same):

Rows	Disk IOs	CPU Seconds	Spool Space
108.121.087,00	580.757,00	711,81	20.831.922.176,00

We need to rewrite the query to trigger partition elimination.

Here is another revised version of the benchmarking query. We are using a correlated subquery instead of the “IN (SELECT * FROM …):

SELECT * FROM TheDatabase. TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1);

Upon inspection of the execution plan, it is evident that there is no improvement.

Explain SELECT * FROM TheDatabase. TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1)

1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 3 (all_amps),
which is built locally on the AMPs. Then we do a SORT to order
Spool 3 by the sort key in spool field1 (TheUser.t02.COL1)
eliminating duplicate rows. The size of Spool 3 is estimated with
high confidence to be 1 row (20 bytes). The estimated time for
this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs. The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase.t01 by way of an
all-rows scan with no residual conditions, which is joined to
Spool 2 (Last Use) by way of an all-rows scan. TheDatabase.t01
and Spool 2 are joined using a inclusion dynamic hash join, with a
join condition of (“TheDatabase.t01.COL1 = COL1”). The
input table TheDatabase.t01 will not be cached in memory. The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes). The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

The following performance diagram indicates that the performance metrics remain consistent with those of the previous query:

Rows	Disk IOs	CPU Seconds	Spool Space
108.121.087,00	585.999,00	743,45	20.831.922.176,00

On the third try, I rewrote the query as a straightforward INNER JOIN.

SELECT * FROM TheDatabase. TheTable t01
INNER JOIN
TMP_COL1 t02
ON
T01.COL1 = T02.COL1;

Please find below the updated execution plan:

Explain SELECT * FROM TheDatabase. TheTable t01
INNER JOIN
TMP_COL1 t02
ON T01.COL1 = T02.COL1

1) First, we lock a distinct TheDatabase.” pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
Then we do a SORT to partition by rowkey. The size of Spool 2 is
estimated with high confidence to be 180 rows (2,880 bytes). The
estimated time for this step is 0.01 seconds.
4) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to TheDatabase.t01 by way of an
all-rows scan with no residual conditions. Spool 2 and
TheDatabase.t01 are joined using a product join, with a join
condition of (“TheDatabase.t01.COL1 = COL1”) enhanced by
dynamic partition elimination. The input table TheDatabase.t01
will not be cached in memory, but it is eligible for synchronized
scanning. The result goes into Spool 1 (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
low confidence to be 54,142,893 rows (11,857,293,567 bytes). The
estimated time for this step is 26.79 seconds.
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 26.80 seconds.

The execution plan for our benchmark query looks promising.

The join step utilizes dynamic partition elimination in a product join.

Knowing that the left table (TMP_COL1) has only one row, we can anticipate improved query performance compared to our previous two queries.

The FSG stores the TMP_COL1 row and compares it to the matching partition where COL1=’100′.

Therefore, the impact of the product join should be negligible, while dynamic partition elimination is expected to improve query performance significantly. In fact, the performance diagram below illustrates that we are achieving nearly identical results (with only a slight increase in CPU seconds) to the benchmark query that used static partition elimination.

Rows	Disk IOs	CPU Seconds	Spool Space
108.121.087,00	295.099,00	596,86	21.214.113.280,00

Conclusion:

Although the optimizer generally performs well, analyzing the execution plan and looking for potential improvements is beneficial. In this case, a minor query rewrite decreased disk IO by 50%.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

Improving Performance through Query Rewriting with Row Partition Elimination: A Real-World Optimization Example on a 400-Million-Row Table

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

Leave a Comment Cancel reply