In this article, we will analyze a statistics problem in detail. Step by step we will figure out why estimations for our example query are wrong and how we can fix them.

The first situation is the following:

We are running an explain on below query, and get the following execution plan:

SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X';

1) First, we lock BOOK.TMP_STATS for read on a reserved RowHash to |

prevent global deadlock. |

2) Next, we lock BOOK.TMP_STATS for read. |

3) We do an all-AMPs RETRIEVE step from BOOK.TMP_STATS by way of an |

all-rows scan with a condition of (“BOOK.TMP_STATS.Char1 = ‘X ‘”) |

into Spool 1 (group_amps), which is built locally on the AMPs. |

The size of Spool 1 is estimated with high confidence to be 36,707 |

rows (16,040,959 bytes). The estimated time for this step is 1.70 |

seconds. |

4) Finally, we send out an END TRANSACTION step to all AMPs involved |

in processing the request. |

-> The contents of Spool 1 are sent back to the user as the result of |

statement 1. The total estimated time is 1.70 seconds. |

The Optimizer estimates the result set to contain 36,707 rows. The estimation is with high confidence.

Unfortunately, when we execute the query, **only 501 rows are returned**. The Optimizer is over- estimating the number of result set rows:

SELECT COUNT(*) FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X';

501 |

Now it is our task to find out what is happening and how we can fix it. Let's start with the information the explain plan gives us.

What we immediately can say by now is that **collected** statistics are available on column CHAR1 as the Optimizer's estimation is with “high confidence”. Nevertheless, we don't know if these statistics are stale, therefore we will do a quick check:

HELP STATS TheDatabase.TMP_STATS;

16/02/12 | 19:08:34 | 73,414 | * | * | “*” | null |

16/02/12 | 16:41:39 | 73,414 | PK | PK | PK | null |

16/02/12 | 16:41:39 | 73,414 | TheDate | TheDate | TheDate | null |

16/02/12 | 19:08:34 | 100 | Int1 | Int1 | Int1 | null |

16/02/12 | 16:41:40 | 100 | Int2 | Int2 | Int2 | null |

16/02/12 | 18:32:09 | 2 | Char1 | Char1 | Char1 | null |

16/02/12 | 18:36:52 | 957 | Char2 | CHar2 | CHar2 | null |

We are using above summary statistics information to compare estimations against the real table data. We compare the number of distinct values (2) for column CHAR1 against the real number of distinct values:

SELECT COUNT(DISTINCT CHAR1) FROM TheDatabase.TMP_STATS;

3 |

As you can see, there is a mismatch. The optimizer estimates 2 distinct values, but the table has 3 distinct values. We run a quick check:

SELECT CHAR1,COUNT(*) FROM BOOK.TMP_STATS GROUP BY 1;

B | 36257 |

X | 501 |

A | 36656 |

Indeed, column CHAR1 has 3 distinct values. Now it's time to find out where the wrong estimations are coming from and how they can be fixed. We will use the SHOW STATISTICS statement to take a closer look at the statistic histograms:

SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;

COLLECT STATISTICS

COLUMN ( Char1 )

ON BOOK.TMP_STATS

VALUES

(

/** SummaryInfo **/

/* Data Type and Length: ‘CF:26' */

/* TimeStamp */ TIMESTAMP ‘2016-02-12 18:32:09-00:00',

/* Version */ 6,

/* OriginalVersion */ 6,

/* DBSVersion */ ‘15.10.00.07',

/* UsageType */ ‘D',

/* ComplexStatInfo */ ‘ComplexStatInfo',

/* NumOfBiasedValues */ 2,

/* NumOfEHIntervals */ 0,

/* NumOfHistoryRecords */ 0,

/* SamplePercent */ 0.00,

/* NumOfNulls */ 0,

/* NumOfAllNulls */ 0,

/* NumOfPartialNullVals */ 0,

/* PartialNullHMF */ 0,

/* AvgAmpRPV */ 0.000000,

/* MinVal */ ‘A ‘,

/* MaxVal */ ‘B ‘,

/* ModeVal */ ‘A ‘,

/* HighModeFreq */ 36916,

**/* NumOfDistinctVals */ 2, **

** /* NumOfRows */ 73414,**

/* CPUUsage */ 0.000000,

/* IOUsage */ 0.000000,

/* Reserved */ 0,

/* Reserved */ 0,

/* Reserved */ 0.000000,

/* Reserved */ 0.000000,

/* Reserved */ ‘T0000D00000S00000',

/* StatsSkipCount */ 0,

/* SysInsertCnt */ 0,

/* SysDeleteCnt */ 0,

/* SysUpdateCnt */ 0,

/* SysInsDelLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00',

/* SysUpdLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00',

/* IsSampleFollowingTrend*/ 0,

/** Biased: Value, Frequency **/

**/* 1 */ ‘A ‘, 36916, **

** /* 2 */ ‘B ‘, 36498**

);

The output of the SHOW STATISTIC statement shows:

The value ‘X' of column CHAR1 is not available in any of the histograms. It looks that at the time when the statistics where collected, these rows with value ‘X' where not available in column CHAR1.

But why are the number of total and estimated rows matching? The conclusion is: The value has been changed with a UPDATE statement, leaving the total number of rows unchanged.

As we understand now that we are dealing with stale statistics caused by a UPDATE statement, one question is unanswered: Where does this estimation of 37,707 rows come from? How is it calculated?

If you take a closer look at the output of the SHOW STATISTICS command, you will see that we have two values there:

**/* NumOfDistinctVals */ 2, **

**/* NumOfRows */ 73414,**

If the predicate value (in our case ‘X') is not available in any statistic histogram (as it is in our case, caused by stale statistics), the optimizer estimates the retrieved number of rows by dividing the value of column NumOfRows by the value of column NumOfDistinctVals:

73,414 / 2 = 36,707

We learned from this example that statistic estimations could be very misleading if we don't understand in detail how statistics are implemented in Teradata.

You might wonder why extrapolation is not helping us in this case? Extrapolation only kicks in if we add rows, but in our case, the number of rows stayed unchanged.

We are almost finished. We identified the reason for our stale statistics and we know why the Optimizer estimated 37,707 rows. Finally, we can fix our problem by refreshing statistics on column CHAR1:

COLLECT STATS COLUMN(CHAR1) ON TheDatabase.TMP_STATS;

We can rerun the SHOW STATISTICS statement, and prove that we have correct estimations now:

SHOW STATISTICS VALUES COLUMN(CHAR1) ON BOOK.TMP_STATS;

COLLECT STATISTICS

COLUMN ( Char1 )

ON BOOK.TMP_STATS

VALUES

(

/** SummaryInfo **/

/* Data Type and Length: ‘CF:26' */

/* TimeStamp */ TIMESTAMP ‘2016-02-13 12:31:06-00:00',

/* Version */ 6,

/* OriginalVersion */ 6,

/* DBSVersion */ ‘15.10.00.07',

/* UsageType */ ‘D',

/* ComplexStatInfo */ ‘ComplexStatInfo',

/* NumOfBiasedValues */ 3,

/* NumOfEHIntervals */ 0,

/* NumOfHistoryRecords */ 1,

/* SamplePercent */ 0.00,

/* NumOfNulls */ 0,

/* NumOfAllNulls */ 0,

/* NumOfPartialNullVals */ 0,

/* PartialNullHMF */ 0,

/* AvgAmpRPV */ 0.000000,

/* MinVal */ ‘A ‘,

/* MaxVal */ ‘X ‘,

/* ModeVal */ ‘A ‘,

/* HighModeFreq */ 36656,

**/* NumOfDistinctVals */ 3, **

** /* NumOfRows */ 73414,**

/* CPUUsage */ 0.000000,

/* IOUsage */ 0.000000,

/* Reserved */ 0,

/* Reserved */ 0,

/* Reserved */ 0.000000,

/* Reserved */ 0.000000,

/* Reserved */ ‘T0000D00003S00003',

/* StatsSkipCount */ 1,

/* SysInsertCnt */ 0,

/* SysDeleteCnt */ 0,

**/* SysUpdateCnt */ 501,**

/* SysInsDelLastResetTS */ TIMESTAMP ‘2016-02-13 10:24:09-00:00',

/* SysUpdLastResetTS */ TIMESTAMP ‘2016-02-13 10:24:09-00:00',

/* IsSampleFollowingTrend*/ 0,

/** Biased: Value, Frequency **/

**/* 1 */ ‘A ‘, 36656, **

** /* 2 */ ‘B ‘, 36257, **

** /* 3 */ ‘X ‘, 501,**

/** HistoryOfSummaryInfo **/

/* SummaryRecord[1] */

/* TimeStamp */ TIMESTAMP ‘2016-02-12 18:32:09-00:00',

/* NumOfBiasedValues */ 0,

/* NumOfEHIntervals */ 0,

/* NumOfHistoryRecords */ 0,

/* SamplePercent */ 0.00,

/* NumOfNulls */ 0,

/* NumOfAllNulls */ 0,

/* NumOfPartialNullVals */ 0,

/* PartialNullHMF */ 0,

/* AvgAMPRPV */ 0.000000,

/* ValuesNotPresengFlg */ 0,

/* MinVal */ ‘A ‘,

/* MaxVal */ ‘B ‘,

/* ModeVal */ ‘A ‘,

/* HighModeFreq */ 36916,

/* NumDistinctVals */ 2,

/* NumRows */ 73414,

/* CPUUsage */ 0.000000,

/* IOUsage */ 0.000000,

/* Reserved */ 0,

/* Reserved */ 0,

/* Reserved */ 0.000000,

/* Reserved */ 0.000000,

/* AvgEHDeviation */ 0.000000,

/* UsageType */ ‘D',

/* StatsSkipCount */ 0,

/* SysInsertCnt */ 0,

/* SysDeleteCnt */ 0,

/* SysUpdateCnt */ 0,

/* SysInsDelLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00',

/* SysUpdLastResetTS */ TIMESTAMP ‘9999-12-31 23:59:59-00:00',

/* IsSampleFollowingTrend*/ 0

);

Above output shows that the 501 rows have been indeed updated:

**/* SysUpdateCnt */ 501,**

Finally, we are checking the new explain plan:

SELECT * FROM TheDatabase.TMP_STATS WHERE CHAR1 = ‘X';

1) First, we lock BOOK.TMP_STATS for read on a reserved RowHash to |

prevent global deadlock. |

2) Next, we lock BOOK.TMP_STATS for read. |

3) We do an all-AMPs RETRIEVE step from BOOK.TMP_STATS by way of an |

all-rows scan with a condition of (“BOOK.TMP_STATS.Char1 = ‘X ‘”) |

into Spool 1 (group_amps), which is built locally on the AMPs. |

The size of Spool 1 is estimated with high confidence to be 501 |

Rows (218,937 bytes). The estimated time for this step is 0.89 |

seconds. |

4) Finally, we send out an END TRANSACTION step to all AMPs involved |

in processing the request. |

-> The contents of Spool 1 are sent back to the user as the result of |

statement 1. The total estimated time is 0.89 seconds. |

As the selected value ‘X' matches exactly a biased histogram value, the estimation matches now correctly the number of rows with value ‘X' in column CHAR1.

The central message of this example is that “high confidence” doesn't make sure correct estimations.