# 8 Rules you should follow when collecting Teradata Statistics

## What are Statistics in Teradata?

Statistics are information about data such as the number of rows in a table, the number of rows per value of a particular column, the number of different values in a column, the information about skewed columns, and much more. Statistics are used by the optimizer to generate the best possible execution plan. Statistics determine which existing indexes are used, in which order joins are executed, which join strategy is used, etc.

## Methods used by Teradata to Collect Statistics

### Random AMP Sampling

For indexed columns (such as the Primary Index), where no statistics are collected, Teradata applies Random AMP Sampling.

Random AMP sampling takes place when the execution plan is created, i.e. it must be fast. By default, one AMP per table is defined, on which a sample of the data is taken to estimate data demographics. Only the most important measures are calculated, such as the number of rows per table.

The collection of statistics is one of the most important tasks on a Teradata system, as proper statistics ensure consistent and excellent performance.

This post will give you all the details needed to create an optimized statistics landscape. While the information provided is related to Teradata Release 14.10 and above, most of the ideas can be applied in the same way to any previous release of Teradata.

## Teradata Statistics – Full or Sample?

The decision in which scenarios **full statistics** should be collected, and in which cases **sample statistics** are sufficient, is mainly driven by one consideration: How likely is it, that the optimizer's estimation about data demographics is wrong? Expect the risk of a wrong estimation to be high for skewed tables.

**1. Tiny tables** always require fully collected statistics.

Imagine the following scenario: You system consists of 300 AMPs, but your small table contains only five rows. As you may recognize immediately, the risk is quite high that the **random AMP sample** is taken from an AMP which is not containing any row at all.

As a logical consequence, the optimizer will assume that the table is empty and will create a bad execution plan. Since the collection of statistics on small tables only consumes little resources, never spend a single thought about not collecting full statistics!

**2.** In the case of any** unique index (UPI, USI)** it is sufficient to collect sample statistics. You might even decide to omit statistics for huge tables at all (remember: gathering statistics may require plenty of resources). Runtime or random AMP sampling gives the optimizer an undistorted picture about data demographics. Nevertheless, the execution plan might be more conservative in the case of random AMP sampling.

One exception: if you are using non-equality constraints, you should always collect full statistics on the constraint column(s).

**3.** In the case of any** not unique index (NUPI, NUSI) **we are back again to assess the skewness of the table. If the table is evenly distributed across all AMPs, **sample statistics may be sufficient.**

**4.**You should always go for full statistics on any

**columns used in predicates**.

## Teradata Statistics on Partitioned Tables

**5.** Collecting complete statistics on the dummy PARTITION column is an efficient way to give the optimizer information about empty partitions and the number of rows per partition. You should always collect them.

**6.** You should always collect statistics on **each partitioning column**.

**7.** If the partition columns are not part of the Primary Index, you should additionally collect the following statistics:

**PARTITION & Primary Index**, as this tells the optimizer in how many partitions a particular primary index value is available. The optimizer uses this information in the case of dynamic partition elimination and sliding window merge joins.

**PARTITION & Primary Index & each partitioning column**: The optimizer uses this statistics to estimate costs for row key based joins.

## Teradata Multicolumn Statistics

**8.** As the optimizer assumes independence for statistics collected on single columns, it makes sense to gather statistics on columns together **if there is a high correlation** between them. The higher the correlation, the bigger is the advantage of collecting multi-column statistics.

Final words: In opposition to what its title might lead one to conclude, the list only contains eight topics. People love to read “Top 10 Lists” and this helps us to rank high in Google 😉

See also:

Collect Stats in Teradata

Teradata Statistics Basics

The collection of statistics is one of the most important tasks on a Teradata system, as proper statistics ensure consistent and excellent performance.

What is dummy PARTITION?How to collect stats on them?