When evaluating the accuracy of a set of statistics for a given table, people often rely solely on the collection date. However, this may not always be sufficient. Some tables, such as static lookup tables, may not have stale Teradata statistics even if they are never refreshed. On the other hand, other tables may undergo various changes, with some fields having expanding value ranges and others remaining constant with only a few codes.
To better understand the validity of your table’s statistics, one approach is to run a simple query on the table with a WHERE condition on the column you want to test. For example:
SELECT * FROM TABLE WHERE TEST_COLUMN = 'X';
You can then compare the number of estimated records (which you can see in the Teradata Administrator) with the number of records returned. If the divergence is more than a small one-digit percentage, depending on the overall size of the table, the statistics may be stale.
Remember that the optimizer may not always be accurate down to the individual row level. However, the divergence between the estimated and actual number of records should not be significant.
One example comes to my mind:
In case the Teradata system configuration changes (for example, by adding nodes) and rows of a PPI Table are distributed to new partitions.
This happens for example if you are using the HASHING functions of Teradata in the PARTITIONING definition of your table DDL:
As a bigger system has more AMPS, rows are distributed differently to the partitions and the statistics become stale and have to be recollected.
This is an example, where the table content stayed the same but statistics became stale.
I was once advised to recollect statistics on all the static ‘lookup’ tables held in a database, on a monthly basis, ‘just in case’ the statistics needed to be refreshed…needless to say, I never saw a reason to do this as it never had any impact. Would there ever be a case where statistics become ‘stale’ and therefore ignored by the optimizer in such circumstances?