Roland Wenzlofsky

February 13, 2016

In this article, we will analyze a statistics problem in detail. We will figure out why estimations for our example query are wrong and how we can fix them.

The first situation is the following:

We are running an explain statement on the below query, and get the following execution plan:

SELECT * FROM TheDatabase.TMP_STATS WHERE  CHAR1 = ‘X’;

The Optimizer estimates the result set to contain 36,707 rows. The estimation is with high confidence.

Unfortunately, when we execute the query, only 501 rows are returned. The Optimizer is over-estimating the number of result set rows:

SELECT COUNT(*) FROM TheDatabase.TMP_STATS WHERE  CHAR1 = ‘X’;

Now it is our task to find out what is happening and how we can fix it. Let’s start with the information the explain plan gives us.

We can immediately say that collected statistics are available inÂ column CHAR1 as the Optimizer’s estimation is with “high confidence”. Nevertheless, we don’t know if these statisticsÂ are stale. Therefore, we will do a quick check:

HELP STATS TheDatabase.TMP_STATS;

We are using the above summary statistics information to compare estimations against the actual table data. We compare the number of distinct values (2) for column CHAR1 against the actual number of distinct values:

SELECT COUNT(DISTINCT CHAR1) FROM TheDatabase.TMP_STATS;

As you can see, there is a mismatch. The Optimizer estimates two distinct values, but the table has three distinct values. We run a quick check:

SELECT CHAR1,COUNT(*) FROM BOOK.TMP_STATS GROUP BY 1;

Indeed, column CHAR1 has three distinct values. Now it’s time to find out where the wrong estimations are coming from and how we can fix them. We will use the SHOW STATISTICS statement to take a closer look at the statistic histograms:

SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;

COLLECT STATISTICS
COLUMN ( Char1 )
ON BOOK.TMP_STATS
VALUES
(
/** SummaryInfo **/
/* Data Type and Length: ‘CF:26’ */
/* TimeStamp */ TIMESTAMP ‘2016-02-12 18:32:09-00:00’,
/* Version */ 6,
/* OriginalVersion */ 6,
/* DBSVersion */ ‘15.10.00.07’,
/* UsageType */ ‘D’,
/* ComplexStatInfo */ ‘ComplexStatInfo’,
/* NumOfBiasedValues */ 2,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 0,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAmpRPV */ 0.000000,
/* MinVal */ ‘A ‘,
/* MaxVal */ ‘B ‘,
/* ModeVal */ ‘A ‘,
/* HighModeFreq */ 36916,
/* NumOfDistinctVals */ 2,
/* NumOfRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* Reserved */ ‘T0000D00000S00000’,
/* StatsSkipCount */ 0,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 0,
/* SysInsDelLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00’,
/* SysUpdLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00’,
/* IsSampleFollowingTrend*/ 0,
/** Biased: Value, Frequency **/
/* 1 */ ‘A ‘, 36916,
/* 2 */ ‘B ‘, 36498
);

The output of the SHOW STATISTIC statement shows:

The value ‘X’ of column CHAR1 is not available in histograms. WhenÂ the statistics were collected, these rows with value ‘X’ were not available in column CHAR1.

But why are the number of total and estimated rows matching? The conclusion is: The value has been changed with an UPDATE statement, leaving the total number of rows unchanged.

As we understand now that we are dealing with stale statistics caused by an UPDATE statement, one question is unanswered: Where does this estimation of 37,707 rows come from? How is it calculated?

If you take a closer look at the output of the SHOW STATISTICS command, you will see that we have two values there:

/* NumOfDistinctVals */ 2,
/* NumOfRows */ 73414,

If the predicate value (in our case ‘X’) is not available in any statistic histogram (as it is in our case, caused by stale statistics), the Optimizer estimates the retrieved number of rows by dividing the value of column NumOfRows by the value of column NumOfDistinctVals:

73,414 / 2 = 36,707

We learned from this example that statistic estimations could be very misleading if we don’t understand how statistics are implemented in Teradata.

You might wonder why extrapolation is not helping us in this case? Extrapolation only kicks in if we add rows, but in our case, the number of rows stays unchanged.

We are almost finished. We identified the reason for our stale statistics, and we know why the Optimizer estimated 37,707 rows. Finally, we can fix our problem by refreshing statistics on column CHAR1:

COLLECT STATS COLUMN(CHAR1) ON TheDatabase.TMP_STATS;

We can rerun the SHOW STATISTICS statement and prove that we have correct estimations now:

SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;
COLLECT STATISTICS
COLUMN ( Char1 )
ON BOOK.TMP_STATS
VALUES
(
/** SummaryInfo **/
/* Data Type and Length: ‘CF:26’ */
/* TimeStamp */ TIMESTAMP ‘2016-02-13 12:31:06-00:00’,
/* Version */ 6,
/* OriginalVersion */ 6,
/* DBSVersion */ ‘15.10.00.07’,
/* UsageType */ ‘D’,
/* ComplexStatInfo */ ‘ComplexStatInfo’,
/* NumOfBiasedValues */ 3,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 1,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAmpRPV */ 0.000000,
/* MinVal */ ‘A ‘,
/* MaxVal */ ‘X ‘,
/* ModeVal */ ‘A ‘,
/* HighModeFreq */ 36656,
/* NumOfDistinctVals */ 3,
/* NumOfRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* Reserved */ ‘T0000D00003S00003’,
/* StatsSkipCount */ 1,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 501,
/* SysInsDelLastResetTS */ TIMESTAMP ‘2016-02-13 10:24:09-00:00’,
/* SysUpdLastResetTS */ TIMESTAMP ‘2016-02-13 10:24:09-00:00’,
/* IsSampleFollowingTrend*/ 0,
/** Biased: Value, Frequency **/
/* 1 */ ‘A ‘, 36656,
/* 2 */ ‘B ‘, 36257,
/* 3 */ ‘X ‘, 501,
/** HistoryOfSummaryInfo **/
/* SummaryRecord[1] */
/* TimeStamp */ TIMESTAMP ‘2016-02-12 18:32:09-00:00’,
/* NumOfBiasedValues */ 0,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 0,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAMPRPV */ 0.000000,
/* ValuesNotPresengFlg */ 0,
/* MinVal */ ‘A ‘,
/* MaxVal */ ‘B ‘,
/* ModeVal */ ‘A ‘,
/* HighModeFreq */ 36916,
/* NumDistinctVals */ 2,
/* NumRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* AvgEHDeviation */ 0.000000,
/* UsageType */ ‘D’,
/* StatsSkipCount */ 0,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 0,
/* SysInsDelLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00’,
/* SysUpdLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00’,
/* IsSampleFollowingTrend*/ 0
);

Above output shows that the 501 rows have been indeed updated:

/* SysUpdateCnt */ 501,

Finally, we are checking the new explain plan:

SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X’;

As the selected value ‘X’ matches a biased histogram value, the estimation now matches the number of rows with value ‘X’ in column CHAR1.
The central message of this example is thatÂ “high confidence” doesn’t ensureÂ correct estimations.