Teradata Statistics – A Case Study

1
708
Teradata Statistics

Teradata StatisticsThe Optimizer usually is doing a good job when it comes to the usage of statistics. Nevertheless, sometimes it pays off to take a closer look at the execution plan and the Optimizer’s cardinality estimations.

Since Teradata 14.10 I got used to including the SHOW STATISTICS statement into my considerations, as the output metrics can help a lot to discover statistic problems.

Teradata Statistics – Avoid the Heuristics

The following case study shows how the Optimizer can be convinced (or forced) to use collected statistics instead of applying inaccurate heuristics.

The case study is based on the following two tables:

SHOW TABLE TheDatabase.TableOneDate;

CREATE SET TABLE TheDatabase.TableOneDate
(
OneDate DATE FORMAT ‘YYYY-MM-DD’
)  UNIQUE PRIMARY INDEX ( OneDate );

 

SELECT COUNT(*) FROM TheDatabase.TableOneDate;
–> 1 Row

SHOW TABLE TheDatabase.TheCalendar

CREATE MULTISET TABLE TheDatabase.TheCalendar
(
CalendarDate DATE FORMAT ‘yyyy-mm-dd’
PRIMARY INDEX ( CalendarDate );

SELECT COUNT(*) FROM TheDatabase.TheCalendar;

–> 36.889 Rows

I removed all columns of both tables, which are not relevant for this example and collected statistics on the remaining columns:

COLLECT STATS ON TheDatabase.TableOneDate COLUMN OneDate;
COLLECT STATS ON TheDatabase.TheCalendar COLUMN CalendarDate;


The following query is used in several hundreds of bad performing reports, and therefore caught my attention (it’s not the same query, but you should get the idea):

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE CalendarDate >=
(
SELECT
OneDate
FROM TheDatabase.TableOneDate
)

1) First, we lock a distinct TheDatabase.”pseudo table”
for read on a RowHash to prevent global deadlock for
TheDatabase.TableOneDate.
2) Next, we lock a distinct TheDatabase.”pseudo table” for
read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
3) We lock TheDatabase.TableOneDate for read, and we
lock TheDatabase.TheCalendar for read.
4) We do an all-AMPs RETRIEVE step from
TheDatabase.TableOneDate by way of an all-rows
scan with no residual conditions into Spool 1 (all_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
high confidence to be 1 row (25 bytes). The estimated time for
this step is 0.03 seconds.
5) We do an all-AMPs DISPATCHER RETRIEVE step from Spool 1 (Last Use)
by way of an all-rows scan and send the rows back to the
Dispatcher. The size is estimated with high confidence to be 1
row. The estimated time for this step is 0.01 seconds.
6) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of (“TheDatabase.TheCalendar.CalendarDate >= :%SSQ20”)
into Spool 2 (group_amps), which is built locally on the AMPs.
The size of Spool 2 is estimated with no confidence to be 12,297
rows (4,943,394 bytes). The estimated time for this step is 0.04
seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 2 are sent back to the user as the result of
statement 1. The total estimated time is 0.08 seconds.

Although above execution plan doesn’t look that bad, there was something I did not like about it: The estimated number of rows in step 6, namely 12,297, was much too little. The result set in reality has 32,482 rows.

The reason is that the Optimizer is not using the available statistics (we collected above), but applies heuristics (with “no confidence”):

12,297 (estimation) / 36,889 (table cardinality) = 33%

You can easily prove this with a similar setup: The optimizer will always estimate the number of rows in return set to be 33% of the total table rows.

Below you can see the statistic histograms for related statistics on CalendarDate:

COLLECT STATISTICS
COLUMN ( CalendarDate )
ON TheDatabase.TheCalendar
VALUES
(
/** SummaryInfo **/
/* NumOfBiasedValues */ 0,
/* NumOfEHIntervals */ 249,
/* NumOfHistoryRecords */ 7,
/* MinVal */ DATE ‘2000-01-02’,
/* MaxVal */ DATE ‘2100-12-31’,
/* ModeVal */ DATE ‘2000-01-02’,
/* HighModeFreq */ 1,
/* NumOfDistinctVals */ 36889,
/* NumOfRows */ 36889,
/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/
/* 1 */ DATE ‘2000-10-04’, DATE ‘2000-01-02’, 1, 1, 276, 276,
..
/* 249 */ DATE ‘2100-12-31’, DATE ‘2100-09-08’, 1, 1, 114, 114,

While it seems strange that the optimizer is not using available statistics, I had to accept this fact. I changed the query, replacing the subquery with a date literal (2012-01-26 is the same date returned in the subquery of our previous query):

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE CalendarDate >=
(
SELECT
DATE’2012-01-26′
)

1) First, we lock a distinct TheDatabase.”pseudo table”
for read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
2) Next, we lock TheDatabase.TheCalendar for read.
3) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of (“TheDatabase.TheCalendar.CalendarDate >= DATE
‘2012-01-26′”) into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with high
confidence to be 32,482 rows (1,006,942 bytes). The estimated
time for this step is 0.03 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

The new execution plan shows that the Optimizer is using the collected statistics, and correctly estimates the result set rows. Unfortunately, using a literal is not an option for my report optimization task.

To get a better understanding how the Optimizer is designed, I again changed the query, selecting the literal from a “dummy table”:

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE DAY_DATE >=
(
SELECT DATE’2012-01-26′ FROM
(
SELECT 1 AS x
) x
)

1) First, we lock a distinct TheDatabase.”pseudo table”
for read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
2) Next, we lock TheDatabase.TheCalendar for read.
3) We do an INSERT into Spool 2.
4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 1 (used to materialize view, derived
table, table function or table operator x) (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
high confidence to be 1 row (22 bytes). The estimated time for
this step is 0.01 seconds.
5) We do a group-AMP RETRIEVE step from Spool 1 (Last Use) by way of
an all-rows scan into Spool 3 (group_amps), which is built locally
on that AMP. The size of Spool 3 is estimated with high
confidence to be 1 row (25 bytes). The estimated time for this
step is 0.01 seconds.
6) We do a group-AMP DISPATCHER RETRIEVE step from Spool 3 (Last Use)
by way of an all-rows scan and send the rows back to the
Dispatcher. The size is estimated with high confidence to be 1
row. The estimated time for this step is 0.01 seconds.
7) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of (“TheDatabase.TheCalendar.DAY_DATE >= :%SSQ20”)
into Spool 4 (group_amps), which is built locally on the AMPs.
The size of Spool 4 is estimated with no confidence to be 12,297
rows (381,207 bytes). The estimated time for this step is 0.03
seconds.
8) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 4 are sent back to the user as the result of
statement 1.

The execution plan shows that we are back to the heuristics. Not what I want. Actually, by optimizing the “calendar view part” of all my reports I expected an improvement for all of them.

Finally, I came up with the following query, doing a correlated subquery:

Explain SELECT * FROM TheDatabase.TheCalendar t01
WHERE EXISTS
(
SELECT
*
FROM TheDatabase.TableOneDate t02
WHERE t01.DAY_DATE>=OneDate
)

1) First, we lock a distinct TheDatabase.”pseudo table”
for read on a RowHash to prevent global deadlock for
TheDatabase.t01.
2) Next, we lock a distinct TheDatabase.”pseudo table” for
read on a RowHash to prevent global deadlock for
TheDatabase.t02.
3) We lock TheDatabase.t01 for read, and we lock
TheDatabase.t02 for read.
4) We do an all-AMPs SUM step to aggregate from
TheDatabase.t02 by way of an all-rows scan with no
residual conditions. Aggregate Intermediate Results are computed
globally, then placed in Spool 3.
5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps), which is duplicated on
all AMPs. The size of Spool 2 is estimated with high confidence
to be 72 rows (1,224 bytes).
6) We do an all-AMPs JOIN step from TheDatabase.t01 by way
of an all-rows scan with no residual conditions, which is joined
to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase.t01 and Spool 2 are joined using an
inclusion product join, with a join condition of (
“TheDatabase.t01.DAY_DATE >= OneDate”). The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to be
32,482 rows (1,006,942 bytes). The estimated time for this
step is 3.77 seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

The execution plan changed completely, but this plan allowed the Optimizer to use the statistics. The estimations match exactly the number of rows in the result set.

Although the new execution plan involves a product join, because of the small amount of data, this is not a big deal. More important is:  The estimations are excellent now. As the “calendar view part” is cross joined in each report several times, having the correct estimates is crucial.

I hope this case study motivates you to dive deeper into the optimizer’s details. As mentioned above, especially the new “SHOW STATISTICS” command has a lot of useful information.

Our Reader Score
[Total: 7    Average: 4.9/5]
Teradata Statistics – A Case Study written by Roland Wenzlofsky average rating 4.9/5 - 7 user ratings

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here