How to Optimize Teradata Statistics and Avoid Heuristics: A Case Study

Roland Wenzlofsky

April 28, 2023

minutes reading time

The Optimizer typically excels in utilizing statistics, yet examining the execution plan and cardinality estimations can sometimes be beneficial.

Since Teradata 14.10, I have habitually included the SHOW STATISTICS statement in my considerations. The resulting metrics can aid in identifying statistical issues.

Teradata Statistics – Avoid the Heuristics

This case study demonstrates how the Optimizer can be compelled to utilize precise statistics as opposed to utilizing imprecise heuristics.

The case study relies on two tables as its foundation:

SHOW TABLE TheDatabase.TableOneDate;

CREATE SET TABLE TheDatabase.TableOneDate
(
OneDate DATE FORMAT 'YYYY-MM-DD'
)  UNIQUE PRIMARY INDEX ( OneDate );

SELECT COUNT(*) FROM TheDatabase.TableOneDate;
--> 1 Row

SHOW TABLE TheDatabase.TheCalendar

CREATE MULTISET TABLE TheDatabase.TheCalendar
(
CalendarDate DATE FORMAT 'yyyy-mm-dd'
) PRIMARY INDEX ( CalendarDate );

SELECT COUNT(*) FROM TheDatabase.TheCalendar;

–> 36.889 Rows

I excluded irrelevant columns from both tables and gathered statistics on the remaining ones.

COLLECT STATS ON TheDatabase.TableOneDate COLUMN OneDate;
COLLECT STATS ON TheDatabase.TheCalendar COLUMN CalendarDate;

This query, which is not identical but conveys the same meaning, has attracted my attention due to its use in numerous badly-performing reports:

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE CalendarDate >=
(
SELECT
OneDate
FROM TheDatabase.TableOneDate
)

1) First, we lock a distinct TheDatabase."pseudo table"
for read on a RowHash to prevent global deadlock for
TheDatabase.TableOneDate.
2) Next, we lock a distinct TheDatabase."pseudo table" for
read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
3) We lock TheDatabase.TableOneDate for read, and we
lock TheDatabase.TheCalendar for read.
4) We do an all-AMPs RETRIEVE step from
TheDatabase.TableOneDate by way of an all-rows
scan with no residual conditions into Spool 1 (all_amps)
built locally on the AMPs. The size of Spool 1 is estimated with
high confidence to be 1 row (25 bytes). The estimated time for
this step is 0.03 seconds.
5) We do an all-AMPs DISPATCHER RETRIEVE step from Spool 1 (Last Use)
by way of an all-rows scan and send the rows back to the
Dispatcher. The size is estimated with high confidence to be 1
row. The estimated time for this step is 0.01 seconds.
6) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of ("TheDatabase.TheCalendar.CalendarDate >= :%SSQ20")
into Spool 2 (group_amps), which is built locally on the AMPs.
The size of Spool 2 is estimated with no confidence to be 12,297
rows (4,943,394 bytes). The estimated time for this step is 0.04
seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 2 are sent back to the user as the result of
statement 1. The total estimated time is 0.08 seconds.

Although the execution plan appears satisfactory, there is a concern regarding the estimated number of rows in step 6. The estimate of 12,297 is significantly lower than the actual result set of 32,482 rows.

The Optimizer is not utilizing the collected statistics and instead relies on heuristics with low confidence.

12,297 (estimation) / 36,889 (table cardinality) = 33%

This can be demonstrated easily with a comparable configuration: The Optimizer consistently approximates that the number of rows in the result set will be one-third of the total table rows.

Presented below are the histograms of related statistics pertaining to CalendarDate.

COLLECT STATISTICS
COLUMN ( CalendarDate )
ON TheDatabase.TheCalendar
VALUES
(
/** SummaryInfo **/
/* NumOfBiasedValues */ 0,
/* NumOfEHIntervals */ 249,
/* NumOfHistoryRecords */ 7,
/* MinVal */ DATE '2000-01-02',
/* MaxVal */ DATE '2100-12-31',
/* ModeVal */ DATE '2000-01-02',
/* HighModeFreq */ 1,
/* NumOfDistinctVals */ 36889,
/* NumOfRows */ 36889,
/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/
/* 1 */ DATE '2000-10-04', DATE '2000-01-02', 1, 1, 276, 276,
..
/* 249 */ DATE '2100-12-31', DATE '2100-09-08', 1, 1, 114, 114,

Although it may seem peculiar that the Optimizer is not utilizing accessible statistics, I have come to acknowledge this reality. I have modified the query by substituting the subquery with a date literal (2012-01-26, which is the same date retrieved in the previous query’s subquery):

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE CalendarDate >=
(
SELECT
DATE'2012-01-26'
)

1) First, we lock a distinct TheDatabase."pseudo table"
for read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
2) Next, we lock TheDatabase.TheCalendar for read.
3) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of ("TheDatabase.TheCalendar.CalendarDate >= DATE
'2012-01-26'") into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with high
confidence to be 32,482 rows (1,006,942 bytes). The estimated
time for this step is 0.03 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.03 seconds.

The updated execution plan reveals that the Optimizer effectively utilizes the gathered statistics to accurately project the number of result set rows. Regrettably, my report optimization task does not allow for using a literal.

To better understand the Optimizer’s design, I modified the query once more and chose the literal from a “dummy table.”

Explain SELECT * FROM TheDatabase.TheCalendar
WHERE DAY_DATE >=
(
SELECT DATE'2012-01-26' FROM
(
SELECT 1 AS x
) x
)

1) First, we lock a distinct TheDatabase."pseudo table"
for read on a RowHash to prevent global deadlock for
TheDatabase.TheCalendar.
2) Next, we lock TheDatabase.TheCalendar for read.
3) We do an INSERT into Spool 2.
4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way of
an all-rows scan into Spool 1 (used to materialize view, derived
table, table function or table operator x) (group_amps), which is
built locally on the AMPs. The size of Spool 1 is estimated with
high confidence to be 1 row (22 bytes). The estimated time for
this step is 0.01 seconds.
5) We do a group-AMP RETRIEVE step from Spool 1 (Last Use) by way of
an all-rows scan into Spool 3 (group_amps), which is built locally
on that AMP. The size of Spool 3 is estimated with high
confidence to be 1 row (25 bytes). The estimated time for this
step is 0.01 seconds.
6) We do a group-AMP DISPATCHER RETRIEVE step from Spool 3 (Last Use)
by way of an all-rows scan and send the rows back to the
Dispatcher. The size is estimated with high confidence to be 1
row. The estimated time for this step is 0.01 seconds.
7) We do an all-AMPs RETRIEVE step from
TheDatabase.TheCalendar by way of an all-rows scan with a
condition of ("TheDatabase.TheCalendar.DAY_DATE >= :%SSQ20")
into Spool 4 (group_amps), which is built locally on the AMPs.
The size of Spool 4 is estimated with no confidence to be 12,297
rows (381,207 bytes). The estimated time for this step is 0.03
seconds.
8) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 4 are sent back to the user as the result of
statement 1.

The execution plan indicates a return to heuristics, which is not ideal. I intended to enhance all reports’ performance by optimizing the “calendar view section.”

After much consideration, I devised a correlated subquery for the following query:

Explain SELECT * FROM TheDatabase.TheCalendar t01
WHERE EXISTS
(
SELECT
*
FROM TheDatabase.TableOneDate t02
WHERE t01.DAY_DATE>=OneDate
)

1) First, we lock a distinct TheDatabase."pseudo table"
for read on a RowHash to prevent global deadlock for
TheDatabase.t01.
2) Next, we lock a distinct TheDatabase."pseudo table" for
read on a RowHash to prevent global deadlock for
TheDatabase.t02.
3) We lock TheDatabase.t01 for read, and we lock
TheDatabase.t02 for read.
4) We do an all-AMPs SUM step to aggregate from
TheDatabase.t02 by way of an all-rows scan with no
residual conditions. Aggregate Intermediate Results are computed
globally, then placed in Spool 3.
5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way of
an all-rows scan into Spool 2 (all_amps), which is duplicated on
all AMPs. The size of Spool 2 is estimated with high confidence
to be 72 rows (1,224 bytes).
6) We do an all-AMPs JOIN step from TheDatabase.t01 by way
of an all-rows scan with no residual conditions, which is joined
to Spool 2 (Last Use) by way of an all-rows scan.
TheDatabase.t01 and Spool 2 are joined using an
inclusion product join, with a join condition of (
"TheDatabase.t01.DAY_DATE >= OneDate"). The
result goes into Spool 1 (group_amps), which is built locally on
the AMPs. The size of Spool 1 is estimated with low confidence to be
 32,482 rows (1,006,942 bytes). The estimated time for this
step is 3.77 seconds.
7) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1.

The execution plan has been revised, allowing the Optimizer to utilize the statistics effectively. As a result, the estimations precisely match the number of rows in the result set.

The product join in the updated execution plan is insignificant due to the limited amount of data. However, what holds greater significance is the improved estimations. Since the “calendar view part” is frequently cross-joined in various reports, precise estimates are imperative.

I hope this case study encourages you to delve further into the intricacies of Optimizer. As previously noted, the “SHOW STATISTICS” command offers a plethora of valuable information.