Introduction to Teradata Sample Statistics
In this article, I will show you when, how, and why you can/should use Teradata Sample Statistics.
Sample statistics take only a percentage of rows into account when producing estimates.
Sample Statistics are used for performance tuning. Collecting full statistics requires that a full table scan be performed on the table. Then the rows must be sorted to count the number of different values.
This can be prevented by sample statistics.
If the tables are large and the resource consumption causes performance problems by collecting the statistics.
The rows should be evenly distributed across all AMPs.
Especially suitable for this are e.g. indexed columns. Below is a
Full statistics should always be collected on small tables. Sample statistics are not a general substitute for full statistics! The figure shows a skewed table that is not suitable for sample statistics:
With SAMPLE or SYSTEM SAMPLE the sample size is determined by the optimizer. The first time full statistics are collected, then the optimizer switches to Sample Statistics and reduces the sample size until it detects that Skew is a problem.
If you want to determine the size of the sample yourself, use COLLECT STATISTICS USING SAMPLE n PERCENT.
The simple COLLECT STATISTICS without specifying the method again is sufficient. Teradata remembers the last used method (full or sample) as well as the sample size for the sample statistics.
When using sample statistics, it is important that the columns have as many different values as possible. A UPI, for example, fulfills this requirement, but the NUPI can also be used if it has as many distinct values as possible.
Unusable for sample statistics are skewed columns
The Syntax for Sample Statistics
Use the sample size determined by Teradata:
COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (CustomerId) ON Customer;
Use a sample size of 5% of the rows of the table customer
COLLECT STATISTICS USING SYSTEM 5 PERCENT COLUMN (CustomerId) ON Customer;