# Teradata Sample Statistics: When, How, and Why to Use Them

Roland Wenzlofsky

April 28, 2023

## Introduction to Teradata Sample Statistics

Discover the optimal utilization of Teradata Sample Statistics, including when, how, and why to implement them.

### What is Teradata Sample Statistics?

Sample statistics take only a percentage of rows into account when producing estimates.

### What are the Advantages of Sample Statistics?

We use Teradata Sample Statistics for performance tuning. Collecting full statistics requires a full table scan. In the second step, Teradata must sort all rows to calculate the various metrics. Sample statistics are limited to a part of the rows and are therefore more efficient but can only be used with many distinct values.

### When is it worth using Sample Statistics?

Sample statistics are beneficial for large tables with many different column values. These include UPI, USI, etc. On small tables, we should avoid sample statistics. They do not save resources, and the risk of wrong estimations is high. The rows should be well distributed over all AMPs to use sample statistics.

### When shouldn’t you consider Sample Statistics?

As already mentioned, sample statistics should not be collected on small tables. Sample statistics are not a substitute for full statistics collection. The Optimizer can consistently achieve the best execution plan with full statistics. Sample statistics should be used when the resource consumption for full statistics is not justified and the plan is usable with sample statistics. The figure shows a skewed table that is not suitable for sample statistics:

### How is the Size of the Sample determined?

With SAMPLE or SYSTEM SAMPLE, the Optimizer determines the sample size. The first time the Optimizer continuously collects full statistics. Later the Optimizer switches to Sample Statistics and reduces the sample size until it detects that Skew is a problem. If we want to decide the sample size, we can use COLLECT STATISTICS USING SAMPLE n PERCENT.

### Do I have to specify the desired Collection Method when the Statistics are collected again?

The straightforward COLLECT STATISTICS without specifying the method is sufficient. Teradata memorizes the last used method (complete or sample) and the sample size for the sample statistics and reuses it in the future.

Sample statistics require columns with a high degree of diversity in values. A UPI satisfies this criterion, and only columns with numerous unique values should be considered for collecting sample statistics for the NUPI.

Avoid employing sample statistics for columns with a skewed distribution.

## The Syntax for Sample Statistics

#### If we want to use the sample size specified by Teradata, then with this syntax:

`COLLECT STATISTICS USING SYSTEM SAMPLE COLUMN (CustomerId) ON Customer;`

#### If we want to use a sample size of 5% of the rows of the table customer below syntax is needed:

`COLLECT STATISTICS USING SYSTEM 5 PERCENT COLUMN (CustomerId) ON Customer;`