Teradata Query Rewriting & Partition Elimination

0
520

Teradata Query RewritingWhen it comes to row partition elimination, query rewriting can be a very powerful method for improving performance. The test setup below is based on a real-world optimization I did recently.

We are dealing with a very simple query: All we want to do is to retrieve the rows of one row-partition of our test table. Nevertheless, vast improvements were possible by rewriting the query.

Our test table contains about 400 million rows, and is partitioned like shown in the below DDL statement:

CREATE MULTISET TABLE TheDatabase.TheTable
(
PK BIGINT NOT NULL
COL1,

COLn
) PRIMARY INDEX (PK)
PARTITION BY (
CASE_N( COL1 =  ‘100’,COL1 =  ‘200’,COL1 =  ‘300’,COL1 =  ‘400’,
COL1 =  ‘500’,COL1 =  ‘600’,COL1 =  ‘700’,COL1 =  ‘800’,COL1 =  ‘900’, NO CASE, UNKNOWN)
);

The query task is to retrieve the rows where the column value of COL1 is ‘100’. We will begin with the simplest way of writing the query:

SELECT * FROM TheDatabase.TheTable  WHERE COL1 = ‘100’;

As expected, the execution plan consists of one retrieve step.

Static partition elimination is used, as the optimizer can substitute the literal ‘100’ during the creation of the execution plan.

One important remark: In our tests all needed statistics are available (“high confidence”)  and fresh.

1) First, we lock a distinct TheDatabase.”pseudo table” for read  on a RowHash
to prevent global deadlock for TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable  for read.
3) We do an all-AMPs RETRIEVE step from a single partition of
TheDatabase.TheTable  with a condition of (
” TheDatabase.TheTable .COL1 = ‘100’”) into Spool 1
(group_amps), which is built locally on the AMPs.  The input table
will not be cached in memory, but it is eligible for synchronized
scanning.  The result spool file will not be cached in memory.
The size of Spool 1 is estimated with high confidence to be
108,121,098 rows (23,354,157,168 bytes).  The estimated time for
this step is 18.85 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.  The total estimated time is 18.85 seconds.

The performance diagram below shows us the optimal situation and represents our benchmark:

Rows Disk IOs CPU Seconds Spool Space
108.121.087,00 294.425,00 489,64 20.831.922.176,00

 

Now let’s assume that we can’t use a literal for COL1, but we need to pass the value ‘100’ dynamically from another table into the query.

There are several ways of rewriting this query. Below is our first attempt:
(
COL1 CHAR(03) NOT NULL
) PRIMARY INDEX (COL1) ON COMMIT PRESERVE ROWS;

INSERT INTO TMP_COL1 VALUES (‘100’);

SELECT * FROM TheDatabase.TheTable WHERE COL1 IN (SELECT COL1 FROM TMP_COL1);

 

This is the resulting execution plan:
1) First, we lock a distinct TheDatabase.”pseudo table” for read
on a RowHash to prevent global deadlock for
TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable for read.
3) We do an all-AMPs RETRIEVE step from TheUser.TMP_COL1 by way
of an all-rows scan with no residual conditions into Spool 3
(all_amps), which is built locally on the AMPs.  Then we do a SORT
to order Spool 3 by the sort key in spool field1 (
TheUser.TMP_COL1.COL1) eliminating duplicate rows.  The
size of Spool 3 is estimated with high confidence to be 1 row (20
bytes).  The estimated time for this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs.  The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase.TheTable by way
of an all-rows scan with no residual conditions, which is joined
to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase.TheTable and Spool 2 are joined using a inclusion
dynamic hash join, with a join condition of (
“TheDatabase.TheTable.COL1 = COL1”).  The input table
TheDatabase.TheTable will not be cached in memory.  The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs.  The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes).  The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

In a first step, the temporary table TMP_COL1 which is holding one row with the value for COL1 is being partitioned in the same way as “TheDatabase.TheTable”, and the row is duplicated to all AMPs.

Subsequently, a dynamic inclusion hash join is done.
The important insight is that no partition elimination takes place.

The query from our first attempt creates 197% more IOs and consumes 145% more CPU seconds when compared against our benchmark query (but the spool usage is the same):

Rows Disk IOs CPU Seconds Spool Space
108.121.087,00 580.757,00 711,81 20.831.922.176,00

 

That’s bad. Let’s see if we can rewrite the query in a way to trigger partition elimination.

Here is another revised version of the benchmarking query. This time we are using a correlated subquery instead of the “IN (SELECT * FROM …):

 

SELECT * FROM TheDatabase.TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1);

A quick check of the execution plan shows that it doesn’t improve:

Explain SELECT * FROM TheDatabase.TheTable t01
WHERE EXISTS (SELECT COL1 FROM TMP_COL1 t02 WHERE t01.COL1 = t02.COL1)

 

1) First, we lock a distinct TheDatabase.”pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 3 (all_amps),
which is built locally on the AMPs.  Then we do a SORT to order
Spool 3 by the sort key in spool field1 (TheUser.t02.COL1)
eliminating duplicate rows.  The size of Spool 3 is estimated with
high confidence to be 1 row (20 bytes).  The estimated time for
this step is 0.02 seconds.
4) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps) (compressed columns
allowed), which is duplicated on all AMPs.  The size of Spool 2 is
estimated with high confidence to be 90 rows (1,800 bytes).
5) We do an all-AMPs JOIN step from TheDatabase.t01 by way of an
all-rows scan with no residual conditions, which is joined to
Spool 2 (Last Use) by way of an all-rows scan.  TheDatabase.t01
and Spool 2 are joined using a inclusion dynamic hash join, with a
join condition of (“TheDatabase.t01.COL1 = COL1”).  The
input table TheDatabase.t01 will not be cached in memory.  The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs.  The size of Spool 1 is estimated with low confidence to
be 27,071,447 rows (5,847,432,552 bytes).  The estimated time for
this step is 24.66 seconds.
6) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

 

Below performance diagram shows that the performance metrics are almost the same like in the previous query:

Rows Disk IOs CPU Seconds Spool Space
108.121.087,00 585.999,00 743,45 20.831.922.176,00

 

IN a third attempt, I was rewriting the query into a plain INNER JOIN:

SELECT * FROM TheDatabase.TheTable  t01
INNER JOIN
TMP_COL1 t02
ON
T01.COL1 = T02.COL1;

 

Below you can see the new execution plan:

Explain SELECT * FROM TheDatabase.TheTable  t01
INNER JOIN
TMP_COL1 t02
ON T01.COL1 = T02.COL1

 

1) First, we lock a distinct TheDatabase.”pseudo table” for read
on a RowHash to prevent global deadlock for TheDatabase.t01.
2) Next, we lock TheDatabase.t01 for read.
3) We do an all-AMPs RETRIEVE step from TheUser.t02 by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps)
(compressed columns allowed), which is duplicated on all AMPs.
Then we do a SORT to partition by rowkey.  The size of Spool 2 is
estimated with high confidence to be 180 rows (2,880 bytes).  The
estimated time for this step is 0.01 seconds.
4) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of an
all-rows scan, which is joined to TheDatabase.t01 by way of an
all-rows scan with no residual conditions.  Spool 2 and
TheDatabase.t01 are joined using a product join, with a join
condition of (“TheDatabase.t01.COL1 = COL1”) enhanced by
dynamic partition elimination.  The input table TheDatabase.t01
will not be cached in memory, but it is eligible for synchronized
scanning.  The result goes into Spool 1 (group_amps), which is
built locally on the AMPs.  The size of Spool 1 is estimated with
low confidence to be 54,142,893 rows (11,857,293,567 bytes).  The
estimated time for this step is 26.79 seconds.
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.  The total estimated time is 26.80 seconds.

The execution plan for this version of our benchmark query looks promising.

The join step is now a product join but with dynamic partition elimination.

As we know that the left table (TMP_COL1) contains only one row, we can expect a better performance than in our previous two queries:

The single row from TMP_COL1 is kept in memory (FSG) and compared against all matching partitions (i.e. one partition where COL1=’100’).

The impact of a product join should, therefore, be small, but dynamic partition should improve query performance a lot.
Indeed: Below performance, diagram shows that we are reaching almost the same performance characteristics (only the CPU seconds are a little bit higher)  like in the benchmarking query with static partition elimination:

 

Rows Disk IOs CPU Seconds Spool Space
108.121.087,00 295.099,00 596,86 21.214.113.280,00

Conclusion:

 

While the optimizer usually is doing a great job, it’s always worth to analyze the execution plan and to look for possible improvement. In our example, we were able to reduce disk IO by 50% with a simple query rewrite.

 

 

 

Our Reader Score
[Total: 3    Average: 5/5]
Teradata Query Rewriting & Partition Elimination written by Roland Wenzlofsky average rating 5/5 - 3 user ratings

LEAVE A REPLY

Please enter your comment!
Please enter your name here