# Analyzing and Fixing Teradata Statistics Estimation Problem

Roland Wenzlofsky

April 23, 2023

This article will thoroughly examine a statistical problem. We will investigate the inaccuracies in the estimates for our sample query and propose solutions to rectify them.

The initial scenario is as follows:

We have executed an EXPLAIN statement on the query below, resulting in the following execution plan:

SELECT * FROM TheDatabase.TMP_STATS WHERE  CHAR1 = ‘X’;

The Optimizer estimates the result set to contain 36,707 rows with high confidence

Regrettably, the query only returns 501 rows. The Optimizer has overestimated the number of resulting rows.

SELECT COUNT(*) FROM TheDatabase.TMP_STATS WHERE  CHAR1 = ‘X’;

Our objective is to identify the issue and develop a resolution. We will commence by examining the data provided in the EXPLAIN plan.

The available statistics in column CHAR1 have a high confidence level as estimated by the Optimizer. However, their freshness is not known. To confirm their validity, we will conduct a prompt check.

HELP STATS TheDatabase.TMP_STATS;

We are utilizing summary statistics to compare estimations with the actual table data. Specifically, we are comparing the distinct values (2) in column CHAR1 to the actual number of distinct values:

SELECT COUNT(DISTINCT CHAR1) FROM TheDatabase.TMP_STATS;

There is a discrepancy between the Optimizer’s estimated values and the three values presented in the table. To address this, we will conduct a brief verification check:

SELECT CHAR1,COUNT(*) FROM BOOK.TMP_STATS GROUP BY 1;

Column CHAR1 presents three unique values. We must determine the source of inaccurate estimations and devise a solution. We will employ the SHOW STATISTICS command to examine the statistical histograms:

SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;

``COLLECT STATISTICSCOLUMN ( Char1 )ON BOOK.TMP_STATSVALUES(/** SummaryInfo **//* Data Type and Length: 'CF:26' *//* TimeStamp */ TIMESTAMP '2016-02-12 18:32:09-00:00',/* Version */ 6,/* OriginalVersion */ 6,/* DBSVersion */ '15.10.00.07',/* UsageType */ 'D',/* ComplexStatInfo */ 'ComplexStatInfo',/* NumOfBiasedValues */ 2,/* NumOfEHIntervals */ 0,/* NumOfHistoryRecords */ 0,/* SamplePercent */ 0.00,/* NumOfNulls */ 0,/* NumOfAllNulls */ 0,/* NumOfPartialNullVals */ 0,/* PartialNullHMF */ 0,/* AvgAmpRPV */ 0.000000,/* MinVal */ 'A ',/* MaxVal */ 'B ',/* ModeVal */ 'A ',/* HighModeFreq */ 36916,/* NumOfDistinctVals */ 2,  /* NumOfRows */ 73414,/* CPUUsage */ 0.000000,/* IOUsage */ 0.000000,/* Reserved */ 0,/* Reserved */ 0,/* Reserved */ 0.000000,/* Reserved */ 0.000000,/* Reserved */ 'T0000D00000S00000',/* StatsSkipCount */ 0,/* SysInsertCnt */ 0,/* SysDeleteCnt */ 0,/* SysUpdateCnt */ 0,/* SysInsDelLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',/* SysUpdLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',/* IsSampleFollowingTrend*/ 0,/** Biased: Value, Frequency **//* 1 */ 'A ', 36916,  /* 2 */ 'B ', 36498);``

The SHOW STATISTIC statement displays the statistics output.

The value ‘X’ in column CHAR1 is missing from the histograms due to its unavailability during the statistics collection process.

Why do the totals and estimated rows match? The answer is that an UPDATE statement modified the value while the total number of rows remained the same.

Now that we comprehend the issue of outdated statistics due to an UPDATE statement, an unanswered question remains: What is the origin of the estimated count of 37,707 rows, and how is it computed?

Upon examining the SHOW STATISTICS command output, it becomes apparent that two distinct values are present.

/* NumOfDistinctVals */ 2,
/* NumOfRows */ 73414,

Suppose the predicate value, denoted as ‘X’, is not present in any histogram due to outdated statistics. In that case, the Optimizer will estimate the number of retrieved rows by dividing the value in the column NumOfRows by the value in the column NumOfDistinctVals:

73,414 / 2 = 36,707

This example illustrates the potential for misleading statistical estimations when one does not understand Teradata’s statistical implementation.

Extrapolation is ineffective in this scenario as the number of rows remains constant.

Our task is almost complete as we have identified the cause of our stagnant statistics and the reason behind the Optimizer’s estimation of 37,707 rows. The solution to our issue is to refresh the statistics in the CHAR1 column.

COLLECT STATS COLUMN(CHAR1) ON TheDatabase.TMP_STATS;

We can validate the accuracy of our estimations by executing the SHOW STATISTICS statement once more.

``SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;COLLECT STATISTICSCOLUMN ( Char1 )ON BOOK.TMP_STATSVALUES(/** SummaryInfo **//* Data Type and Length: 'CF:26' *//* TimeStamp */ TIMESTAMP '2016-02-13 12:31:06-00:00',/* Version */ 6,/* OriginalVersion */ 6,/* DBSVersion */ '15.10.00.07',/* UsageType */ 'D',/* ComplexStatInfo */ 'ComplexStatInfo',/* NumOfBiasedValues */ 3,/* NumOfEHIntervals */ 0,/* NumOfHistoryRecords */ 1,/* SamplePercent */ 0.00,/* NumOfNulls */ 0,/* NumOfAllNulls */ 0,/* NumOfPartialNullVals */ 0,/* PartialNullHMF */ 0,/* AvgAmpRPV */ 0.000000,/* MinVal */ 'A ',/* MaxVal */ 'X ',/* ModeVal */ 'A ',/* HighModeFreq */ 36656,/* NumOfDistinctVals */ 3,  /* NumOfRows */ 73414,/* CPUUsage */ 0.000000,/* IOUsage */ 0.000000,/* Reserved */ 0,/* Reserved */ 0,/* Reserved */ 0.000000,/* Reserved */ 0.000000,/* Reserved */ 'T0000D00003S00003',/* StatsSkipCount */ 1,/* SysInsertCnt */ 0,/* SysDeleteCnt */ 0,/* SysUpdateCnt */ 501,/* SysInsDelLastResetTS */ TIMESTAMP '2016-02-13 10:24:09-00:00',/* SysUpdLastResetTS */ TIMESTAMP '2016-02-13 10:24:09-00:00',/* IsSampleFollowingTrend*/ 0,/** Biased: Value, Frequency **//* 1 */ 'A ', 36656,  /* 2 */ 'B ', 36257,  /* 3 */ 'X ', 501,/** HistoryOfSummaryInfo **//* SummaryRecord[1] *//* TimeStamp */ TIMESTAMP '2016-02-12 18:32:09-00:00',/* NumOfBiasedValues */ 0,/* NumOfEHIntervals */ 0,/* NumOfHistoryRecords */ 0,/* SamplePercent */ 0.00,/* NumOfNulls */ 0,/* NumOfAllNulls */ 0,/* NumOfPartialNullVals */ 0,/* PartialNullHMF */ 0,/* AvgAMPRPV */ 0.000000,/* ValuesNotPresengFlg */ 0,/* MinVal */ 'A ',/* MaxVal */ 'B ',/* ModeVal */ 'A ',/* HighModeFreq */ 36916,/* NumDistinctVals */ 2,/* NumRows */ 73414,/* CPUUsage */ 0.000000,/* IOUsage */ 0.000000,/* Reserved */ 0,/* Reserved */ 0,/* Reserved */ 0.000000,/* Reserved */ 0.000000,/* AvgEHDeviation */ 0.000000,/* UsageType */ 'D',/* StatsSkipCount */ 0,/* SysInsertCnt */ 0,/* SysDeleteCnt */ 0,/* SysUpdateCnt */ 0,/* SysInsDelLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',/* SysUpdLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',/* IsSampleFollowingTrend*/ 0);``

The displayed output confirms the successful update of 501 rows.

/* SysUpdateCnt */ 501,

We are now reviewing the updated explain plan.

SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X’;

Upon matching the value ‘X’ in the biased histogram, the estimation coincides with the number of rows containing ‘X’ in column CHAR1. This illustration’s main point is that a sense of “high confidence” does not guarantee accurate estimations.