# How to Optimize Teradata Statistics and Avoid Heuristics: A Case Study

Roland Wenzlofsky

April 28, 2023

The Optimizer typically excels in utilizing statistics, yet examining the execution plan and cardinality estimations can sometimes be beneficial.

Since Teradata 14.10, I have habitually included the SHOW STATISTICS statement in my considerations. The resulting metrics can aid in identifying statistical issues.

### Teradata Statistics – Avoid the Heuristics

This case study demonstrates how the Optimizer can be compelled to utilize precise statistics as opposed to utilizing imprecise heuristics.

The case study relies on two tables as its foundation:

``SHOW TABLE TheDatabase.TableOneDate;``
``CREATE SET TABLE TheDatabase.TableOneDate(OneDate DATE FORMAT 'YYYY-MM-DD')  UNIQUE PRIMARY INDEX ( OneDate );``
``SELECT COUNT(*) FROM TheDatabase.TableOneDate;--> 1 Row``

SHOW TABLE TheDatabase.TheCalendar

``CREATE MULTISET TABLE TheDatabase.TheCalendar(CalendarDate DATE FORMAT 'yyyy-mm-dd') PRIMARY INDEX ( CalendarDate );``
``SELECT COUNT(*) FROM TheDatabase.TheCalendar;``

–> 36.889 Rows

I excluded irrelevant columns from both tables and gathered statistics on the remaining ones.

COLLECT STATS ON TheDatabase.TableOneDate COLUMN OneDate;
COLLECT STATS ON TheDatabase.TheCalendar COLUMN CalendarDate;

This query, which is not identical but conveys the same meaning, has attracted my attention due to its use in numerous badly-performing reports:

``Explain SELECT * FROM TheDatabase.TheCalendarWHERE CalendarDate >=(SELECTOneDateFROM TheDatabase.TableOneDate)``
``1) First, we lock a distinct TheDatabase."pseudo table"for read on a RowHash to prevent global deadlock forTheDatabase.TableOneDate.2) Next, we lock a distinct TheDatabase."pseudo table" forread on a RowHash to prevent global deadlock forTheDatabase.TheCalendar.3) We lock TheDatabase.TableOneDate for read, and welock TheDatabase.TheCalendar for read.4) We do an all-AMPs RETRIEVE step fromTheDatabase.TableOneDate by way of an all-rowsscan with no residual conditions into Spool 1 (all_amps)built locally on the AMPs. The size of Spool 1 is estimated withhigh confidence to be 1 row (25 bytes). The estimated time forthis step is 0.03 seconds.5) We do an all-AMPs DISPATCHER RETRIEVE step from Spool 1 (Last Use)by way of an all-rows scan and send the rows back to theDispatcher. The size is estimated with high confidence to be 1row. The estimated time for this step is 0.01 seconds.6) We do an all-AMPs RETRIEVE step fromTheDatabase.TheCalendar by way of an all-rows scan with acondition of ("TheDatabase.TheCalendar.CalendarDate >= :%SSQ20")into Spool 2 (group_amps), which is built locally on the AMPs.The size of Spool 2 is estimated with no confidence to be 12,297rows (4,943,394 bytes). The estimated time for this step is 0.04seconds.7) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 2 are sent back to the user as the result ofstatement 1. The total estimated time is 0.08 seconds.``

Although the execution plan appears satisfactory, there is a concern regarding the estimated number of rows in step 6. The estimate of 12,297 is significantly lower than the actual result set of 32,482 rows.

The Optimizer is not utilizing the collected statistics and instead relies on heuristics with low confidence.

12,297 (estimation) / 36,889 (table cardinality) = 33%

This can be demonstrated easily with a comparable configuration: The Optimizer consistently approximates that the number of rows in the result set will be one-third of the total table rows.

Presented below are the histograms of related statistics pertaining to CalendarDate.

``COLLECT STATISTICSCOLUMN ( CalendarDate )ON TheDatabase.TheCalendarVALUES(/** SummaryInfo **//* NumOfBiasedValues */ 0,/* NumOfEHIntervals */ 249,/* NumOfHistoryRecords */ 7,/* MinVal */ DATE '2000-01-02',/* MaxVal */ DATE '2100-12-31',/* ModeVal */ DATE '2000-01-02',/* HighModeFreq */ 1,/* NumOfDistinctVals */ 36889,/* NumOfRows */ 36889,/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **//* 1 */ DATE '2000-10-04', DATE '2000-01-02', 1, 1, 276, 276,../* 249 */ DATE '2100-12-31', DATE '2100-09-08', 1, 1, 114, 114, ``

Although it may seem peculiar that the Optimizer is not utilizing accessible statistics, I have come to acknowledge this reality. I have modified the query by substituting the subquery with a date literal (2012-01-26, which is the same date retrieved in the previous query’s subquery):

``Explain SELECT * FROM TheDatabase.TheCalendarWHERE CalendarDate >=(SELECTDATE'2012-01-26')``
``1) First, we lock a distinct TheDatabase."pseudo table"for read on a RowHash to prevent global deadlock forTheDatabase.TheCalendar.2) Next, we lock TheDatabase.TheCalendar for read.3) We do an all-AMPs RETRIEVE step fromTheDatabase.TheCalendar by way of an all-rows scan with acondition of ("TheDatabase.TheCalendar.CalendarDate >= DATE'2012-01-26'") into Spool 1 (group_amps), which is built locallyon the AMPs. The size of Spool 1 is estimated with highconfidence to be 32,482 rows (1,006,942 bytes). The estimatedtime for this step is 0.03 seconds.4) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 1 are sent back to the user as the result ofstatement 1. The total estimated time is 0.03 seconds.``

The updated execution plan reveals that the Optimizer effectively utilizes the gathered statistics to accurately project the number of result set rows. Regrettably, my report optimization task does not allow for using a literal.

To better understand the Optimizer’s design, I modified the query once more and chose the literal from a “dummy table.”

``Explain SELECT * FROM TheDatabase.TheCalendarWHERE DAY_DATE >=(SELECT DATE'2012-01-26' FROM(SELECT 1 AS x) x)``
``1) First, we lock a distinct TheDatabase."pseudo table"for read on a RowHash to prevent global deadlock forTheDatabase.TheCalendar.2) Next, we lock TheDatabase.TheCalendar for read.3) We do an INSERT into Spool 2.4) We do an all-AMPs RETRIEVE step from Spool 2 (Last Use) by way ofan all-rows scan into Spool 1 (used to materialize view, derivedtable, table function or table operator x) (group_amps), which isbuilt locally on the AMPs. The size of Spool 1 is estimated withhigh confidence to be 1 row (22 bytes). The estimated time forthis step is 0.01 seconds.5) We do a group-AMP RETRIEVE step from Spool 1 (Last Use) by way ofan all-rows scan into Spool 3 (group_amps), which is built locallyon that AMP. The size of Spool 3 is estimated with highconfidence to be 1 row (25 bytes). The estimated time for thisstep is 0.01 seconds.6) We do a group-AMP DISPATCHER RETRIEVE step from Spool 3 (Last Use)by way of an all-rows scan and send the rows back to theDispatcher. The size is estimated with high confidence to be 1row. The estimated time for this step is 0.01 seconds.7) We do an all-AMPs RETRIEVE step fromTheDatabase.TheCalendar by way of an all-rows scan with acondition of ("TheDatabase.TheCalendar.DAY_DATE >= :%SSQ20")into Spool 4 (group_amps), which is built locally on the AMPs.The size of Spool 4 is estimated with no confidence to be 12,297rows (381,207 bytes). The estimated time for this step is 0.03seconds.8) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 4 are sent back to the user as the result ofstatement 1.``

The execution plan indicates a return to heuristics, which is not ideal. I intended to enhance all reports’ performance by optimizing the “calendar view section.”

After much consideration, I devised a correlated subquery for the following query:

``Explain SELECT * FROM TheDatabase.TheCalendar t01WHERE EXISTS(SELECT*FROM TheDatabase.TableOneDate t02WHERE t01.DAY_DATE>=OneDate)``
``1) First, we lock a distinct TheDatabase."pseudo table"for read on a RowHash to prevent global deadlock forTheDatabase.t01.2) Next, we lock a distinct TheDatabase."pseudo table" forread on a RowHash to prevent global deadlock forTheDatabase.t02.3) We lock TheDatabase.t01 for read, and we lockTheDatabase.t02 for read.4) We do an all-AMPs SUM step to aggregate fromTheDatabase.t02 by way of an all-rows scan with noresidual conditions. Aggregate Intermediate Results are computedglobally, then placed in Spool 3.5) We do an all-AMPs RETRIEVE step from Spool 3 (Last Use) by way ofan all-rows scan into Spool 2 (all_amps), which is duplicated onall AMPs. The size of Spool 2 is estimated with high confidenceto be 72 rows (1,224 bytes).6) We do an all-AMPs JOIN step from TheDatabase.t01 by wayof an all-rows scan with no residual conditions, which is joinedto Spool 2 (Last Use) by way of an all-rows scan.TheDatabase.t01 and Spool 2 are joined using aninclusion product join, with a join condition of ("TheDatabase.t01.DAY_DATE >= OneDate"). Theresult goes into Spool 1 (group_amps), which is built locally onthe AMPs. The size of Spool 1 is estimated with low confidence to be 32,482 rows (1,006,942 bytes). The estimated time for thisstep is 3.77 seconds.7) Finally, we send out an END TRANSACTION step to all AMPs involvedin processing the request.-> The contents of Spool 1 are sent back to the user as the result ofstatement 1.``

The execution plan has been revised, allowing the Optimizer to utilize the statistics effectively. As a result, the estimations precisely match the number of rows in the result set.

The product join in the updated execution plan is insignificant due to the limited amount of data. However, what holds greater significance is the improved estimations. Since the “calendar view part” is frequently cross-joined in various reports, precise estimates are imperative.

I hope this case study encourages you to delve further into the intricacies of Optimizer. As previously noted, the “SHOW STATISTICS” command offers a plethora of valuable information.

• Badshah Rehman says:

I think we should move to use Automatic Statistics feature of Teradata to avoid any manual intervention

• {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like