Introduction to Teradata Sample Statistics
This article will show you when, how, and why you can/should use Teradata Sample Statistics.
What are Teradata Sample Statistics?
Sample statistics take only a percentage of rows into account when producing estimates.
What are the Advantages of Sample Statistics?
We use Teradata Sample Statistics for performance tuning. Collecting full statistics requires a full table scan. In a second step, Teradata must sort all rows to calculate the various metrics. Sample statistics are limited to a part of the rows and are therefore more efficient but can only be used with many distinct values.
When is it worth using Sample Statistics?
Sample statistics are beneficial for large tables with many different column values. These include UPI, USI, etc. On small tables, we should avoid sample statistics. They do not save resources, and the risk of wrong estimations is high. The rows should be well distributed over all AMPs to use sample statistics.
When shouldn’t you consider Sample Statistics?
As already mentioned, sample statistics should not be collected on small tables. Sample statistics are not a substitute for full statistics collection. The Optimizer can consistently achieve the best execution plan with full statistics. Sample statistics should be used when the resource consumption for full statistics is not justified, and the plan is usable with sample statistics. The figure shows a skewed table that is not suitable for sample statistics:
How is the Size of the Sample determined?
With SAMPLE or SYSTEM SAMPLE, the Optimizer determines the sample size. The first time the Optimizer continuously collects full statistics. Later the Optimizer switches to Sample Statistics and reduces the sample size until it detects that Skew is a problem. If we want to decide the sample size, we can use COLLECT STATISTICS USING SAMPLE n PERCENT.
Do I have to specify the desired Collection Method when the Statistics are collected again?
The straightforward COLLECT STATISTICS without specifying the method is sufficient. Teradata memorizes the last used method (complete or sample) and the sample size for the sample statistics and reuses it in the future.
Columns must have as many different values as possible when using sample statistics. For example, a UPI fulfills this requirement. We should collect sample statistics for the NUPI only if it has as many distinct values.
We should not use sample statistics on skewed columns!
The Syntax for Sample Statistics
If we want to use the sample size specified by Teradata, then with this syntax:
COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (CustomerId) ON Customer;
If we want to use a sample size of 5% of the rows of the table customer below syntax is needed:
COLLECT STATISTICS USING SYSTEM 5 PERCENT COLUMN (CustomerId) ON Customer;