Teradata Statistics – The USING option in Teradata 14
Teradata 14.00 comes with three improvements, allowing you to tailor the collection of statistics more specifically to your needs:
COLLECT STATISTICS USING SAMPLE n PERCENT COLUMN(column_name) ON table_name;
COLLECT STATISTICS USING MAXINTERBALS n COLUMN(column_name) ON table_name;
COLLECT STATISTICS USING MAXVALUELENGTH n COLUMN(column_name) ON table_name;
The USING SAMPLE option
Although SAMPLE statistics have been available for a while, one global percentage of table rows to be used for sampling was set for all statistics. Starting with Teradata 14.00, you can set a different sample size for each collection statement.
The USING MAXVALUELENGTH option
There is a restriction on how many bytes of a column (starting with the most-left character) are considered when creating the Teradata statistics information.
Teradata 14.00 shifted the threshold of differentiation between individual row entries from 16 to 25 bytes, but still the columns consuming more bytes, are being regarded as equal. Furthermore, in the case of character columns, no padding with spaces takes place anymore. Padding unnecessarily took space before Teradata 14.00 and occurred for all row entries consuming fewer than the threshold level of bytes.
Even if the default value is now 25 bytes, you may be confronted with wider columns – such as VARCHAR(2000) – where the majority of entries differ in details at the far right end of the string.
Capturing the true difference between such entries requires that all characters from a position beyond 25 and up to 2000 should be considered for histogram creation, etc. In such cases, the MAXVALUELENGTH option is useful, by allowing you to increase the number of considered bytes.
The USING MAXINTERVALS option
Another new Teradata statistics feature is related to the number of intervals being used for building the statistics histogram. In such a histogram, each interval contains the most frequent value plus the record count for this value available in the interval.
While in previous releases the histogram class number was a fixed value (default value currently on Teradata 14 is 250 intervals), now you can choose a value between 0 and 500. Why might it be useful to change the number of intervals?
Slicing the standard histogram finer, i.e. allowing for more classes, increases the likelihood of catching local concentrations of values. This will be of great help for the optimizer when the overall distribution of row entries shows many local sharp peaks and valleys narrowly concentrated on a small area of the entire range of possible values. Catching them all in one histogram class is somewhat distorting.