This article will thoroughly examine a statistical problem. We will investigate the inaccuracies in the estimates for our sample query and propose solutions to rectify them.
The initial scenario is as follows:
We have executed an EXPLAIN statement on the query below, resulting in the following execution plan:
SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X’;
1) First, we lock BOOK.TMP_STATS for read on a reserved RowHash to |
prevent global deadlock. |
2) Next, we lock BOOK.TMP_STATS for read. |
3) We do an all-AMPs RETRIEVE step from BOOK.TMP_STATS by way of an |
all-rows scan with a condition of (“BOOK.TMP_STATS.Char1 = ‘X ‘”) |
into Spool 1 (group_amps), which is built locally on the AMPs. |
The size of Spool 1 is estimated with high confidence to be 36,707 |
rows (16,040,959 bytes). The estimated time for this step is 1.70 |
seconds. |
4) Finally, we send out an END TRANSACTION step to all AMPs involved |
in processing the request. |
-> The contents of Spool 1 are sent back to the user as the result of |
statement 1. The total estimated time is 1.70 seconds. |
The Optimizer estimates the result set to contain 36,707 rows with high confidence
Regrettably, the query only returns 501 rows. The Optimizer has overestimated the number of resulting rows.
SELECT COUNT(*) FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X’;
501 |
Our objective is to identify the issue and develop a resolution. We will commence by examining the data provided in the EXPLAIN plan.
The available statistics in column CHAR1 have a high confidence level as estimated by the Optimizer. However, their freshness is not known. To confirm their validity, we will conduct a prompt check.
HELP STATS TheDatabase.TMP_STATS;
16/02/12 | 19:08:34 | 73,414 | * | * | “*” | null |
16/02/12 | 16:41:39 | 73,414 | PK | PK | PK | null |
16/02/12 | 16:41:39 | 73,414 | TheDate | TheDate | TheDate | null |
16/02/12 | 19:08:34 | 100 | Int1 | Int1 | Int1 | null |
16/02/12 | 16:41:40 | 100 | Int2 | Int2 | Int2 | null |
16/02/12 | 18:32:09 | 2 | Char1 | Char1 | Char1 | null |
16/02/12 | 18:36:52 | 957 | Char2 | CHar2 | CHar2 | null |
We are utilizing summary statistics to compare estimations with the actual table data. Specifically, we are comparing the distinct values (2) in column CHAR1 to the actual number of distinct values:
SELECT COUNT(DISTINCT CHAR1) FROM TheDatabase.TMP_STATS;
3 |
There is a discrepancy between the Optimizer’s estimated values and the three values presented in the table. To address this, we will conduct a brief verification check:
SELECT CHAR1,COUNT(*) FROM BOOK.TMP_STATS GROUP BY 1;
B | 36257 |
X | 501 |
A | 36656 |
Column CHAR1 presents three unique values. We must determine the source of inaccurate estimations and devise a solution. We will employ the SHOW STATISTICS command to examine the statistical histograms:
SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;
COLLECT STATISTICS
COLUMN ( Char1 )
ON BOOK.TMP_STATS
VALUES
(
/** SummaryInfo **/
/* Data Type and Length: 'CF:26' */
/* TimeStamp */ TIMESTAMP '2016-02-12 18:32:09-00:00',
/* Version */ 6,
/* OriginalVersion */ 6,
/* DBSVersion */ '15.10.00.07',
/* UsageType */ 'D',
/* ComplexStatInfo */ 'ComplexStatInfo',
/* NumOfBiasedValues */ 2,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 0,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAmpRPV */ 0.000000,
/* MinVal */ 'A ',
/* MaxVal */ 'B ',
/* ModeVal */ 'A ',
/* HighModeFreq */ 36916,
/* NumOfDistinctVals */ 2,
/* NumOfRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* Reserved */ 'T0000D00000S00000',
/* StatsSkipCount */ 0,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 0,
/* SysInsDelLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',
/* SysUpdLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',
/* IsSampleFollowingTrend*/ 0,
/** Biased: Value, Frequency **/
/* 1 */ 'A ', 36916,
/* 2 */ 'B ', 36498
);
The SHOW STATISTIC statement displays the statistics output.
The value ‘X’ in column CHAR1 is missing from the histograms due to its unavailability during the statistics collection process.
Why do the totals and estimated rows match? The answer is that an UPDATE statement modified the value while the total number of rows remained the same.
Now that we comprehend the issue of outdated statistics due to an UPDATE statement, an unanswered question remains: What is the origin of the estimated count of 37,707 rows, and how is it computed?
Upon examining the SHOW STATISTICS command output, it becomes apparent that two distinct values are present.
/* NumOfDistinctVals */ 2,
/* NumOfRows */ 73414,
Suppose the predicate value, denoted as ‘X’, is not present in any histogram due to outdated statistics. In that case, the Optimizer will estimate the number of retrieved rows by dividing the value in the column NumOfRows by the value in the column NumOfDistinctVals:
73,414 / 2 = 36,707
This example illustrates the potential for misleading statistical estimations when one does not understand Teradata’s statistical implementation.
Extrapolation is ineffective in this scenario as the number of rows remains constant.
Our task is almost complete as we have identified the cause of our stagnant statistics and the reason behind the Optimizer’s estimation of 37,707 rows. The solution to our issue is to refresh the statistics in the CHAR1 column.
COLLECT STATS COLUMN(CHAR1) ON TheDatabase.TMP_STATS;
We can validate the accuracy of our estimations by executing the SHOW STATISTICS statement once more.
SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;
COLLECT STATISTICS
COLUMN ( Char1 )
ON BOOK.TMP_STATS
VALUES
(
/** SummaryInfo **/
/* Data Type and Length: 'CF:26' */
/* TimeStamp */ TIMESTAMP '2016-02-13 12:31:06-00:00',
/* Version */ 6,
/* OriginalVersion */ 6,
/* DBSVersion */ '15.10.00.07',
/* UsageType */ 'D',
/* ComplexStatInfo */ 'ComplexStatInfo',
/* NumOfBiasedValues */ 3,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 1,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAmpRPV */ 0.000000,
/* MinVal */ 'A ',
/* MaxVal */ 'X ',
/* ModeVal */ 'A ',
/* HighModeFreq */ 36656,
/* NumOfDistinctVals */ 3,
/* NumOfRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* Reserved */ 'T0000D00003S00003',
/* StatsSkipCount */ 1,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 501,
/* SysInsDelLastResetTS */ TIMESTAMP '2016-02-13 10:24:09-00:00',
/* SysUpdLastResetTS */ TIMESTAMP '2016-02-13 10:24:09-00:00',
/* IsSampleFollowingTrend*/ 0,
/** Biased: Value, Frequency **/
/* 1 */ 'A ', 36656,
/* 2 */ 'B ', 36257,
/* 3 */ 'X ', 501,
/** HistoryOfSummaryInfo **/
/* SummaryRecord[1] */
/* TimeStamp */ TIMESTAMP '2016-02-12 18:32:09-00:00',
/* NumOfBiasedValues */ 0,
/* NumOfEHIntervals */ 0,
/* NumOfHistoryRecords */ 0,
/* SamplePercent */ 0.00,
/* NumOfNulls */ 0,
/* NumOfAllNulls */ 0,
/* NumOfPartialNullVals */ 0,
/* PartialNullHMF */ 0,
/* AvgAMPRPV */ 0.000000,
/* ValuesNotPresengFlg */ 0,
/* MinVal */ 'A ',
/* MaxVal */ 'B ',
/* ModeVal */ 'A ',
/* HighModeFreq */ 36916,
/* NumDistinctVals */ 2,
/* NumRows */ 73414,
/* CPUUsage */ 0.000000,
/* IOUsage */ 0.000000,
/* Reserved */ 0,
/* Reserved */ 0,
/* Reserved */ 0.000000,
/* Reserved */ 0.000000,
/* AvgEHDeviation */ 0.000000,
/* UsageType */ 'D',
/* StatsSkipCount */ 0,
/* SysInsertCnt */ 0,
/* SysDeleteCnt */ 0,
/* SysUpdateCnt */ 0,
/* SysInsDelLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',
/* SysUpdLastResetTS */ TIMESTAMP '9999-12-31 23:59:59-00:00',
/* IsSampleFollowingTrend*/ 0
);
The displayed output confirms the successful update of 501 rows.
/* SysUpdateCnt */ 501,
We are now reviewing the updated explain plan.
SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X’;
1) First, we lock BOOK.TMP_STATS for read on a reserved RowHash to |
prevent global deadlock. |
2) Next, we lock BOOK.TMP_STATS for read. |
3) We do an all-AMPs RETRIEVE step from BOOK.TMP_STATS by way of an |
all-rows scan with a condition of (“BOOK.TMP_STATS.Char1 = ‘X ‘”) |
into Spool 1 (group_amps), which is built locally on the AMPs. |
The size of Spool 1 is estimated with high confidence to be 501 |
Rows (218,937 bytes). The estimated time for this step is 0.89 |
seconds. |
4) Finally, we send out an END TRANSACTION step to all AMPs involved |
in processing the request. |
-> The contents of Spool 1 are sent back to the user as the result of |
statement 1. The total estimated time is 0.89 seconds. |
Upon matching the value ‘X’ in the biased histogram, the estimation coincides with the number of rows containing ‘X’ in column CHAR1. This illustration’s main point is that a sense of “high confidence” does not guarantee accurate estimations.