Teradata Statistics Extrapolation: What It Is And How It Works

Table of Contents

What is Teradata Statistics Extrapolation?

In data warehousing, populating historized tables with snapshot data valid for the current day is typical.

This is a common illustration:

SELECT Snapshot_Date,Count(*) CNT FROM The_Snapshot_Table;

Snapshot Date	CNT
2015-04-10	10
2015-04-11	15
2015-04-12	10
2015-04-13	25
2015-04-14	15
2015-04-15	30

— We assume in our example that statistics are collected on 2015-05-15, after the daily snapshot for 2015-01-15 was loaded:

COLLECT STATISTICS ON THE_SNAPSHOT_TABLE COLUMN(SNAPSHOT_DATE);

Without extrapolation, querying the snapshot date of 2015-04-16 after loading it, the optimizer anticipates zero rows being returned (assuming statistics are not recollected). An instance of such a query is shown below: SELECT * FROM table WHERE date = ‘2015-04-16’

SELECT * FROM The_Snapshot_Table WHERE Snapshot_Date=DATE'2015-04-16'

Extrapolation on DATE columns allows the optimizer to make reasonable estimations in such cases, even when statistics are stale. The optimizer considers date extrapolation only if it assumes that the current statistics are stale.

Which Statistics Information is Extrapolated?

The optimizer extrapolates the following statistical information:

The number of table rows, based on the available statistics summary information and random-AMP sampling
The number of distinct primary index values and the maximum value per statistic histogram
The number of rows per value for the date column

How is the Statistics Information extrapolated?

It is important to understand that the optimizer estimates a limited number of future dates (i.e., any date after the last recorded statistics). The highest predicted date for extrapolation relies on a random AMP sample, which indicates the number of distinct dates in the extrapolation process.

The following extrapolation formula calculates the maximum future date:

Maximum Extrapolated Date = Maximum Date with Collected Statistics + ((Maximum Date with Collected Statistics – Minimum Date with Collected Statistics) * ((Distinct Random AMP Sample Dates * Average Rows Per Date from Collected Statistics) / (Distinct Dates from Collected Statistics * Average Rows Per Date from Collected Statistics))

The collected statistics indicate an average of 1000 rows per date, as demonstrated by this example. We assume that the random AMP sample will provide 10 distinct dates in the future.

Maximum Extrapolated Date = 2015-03-30 + ((2015-03-30 – 2015-01-01) * ((10 * 1000) / (89 * 1000)) = 2015-04-09

Assuming our query selects a date range spanning from March 25th, 2015, to April 9th, 2015, we can estimate the number of rows.

Estimated Rows from Collected Stats + Estimated Rows from Extrapolation: (6*1000) + (10*1000) = 16000 Rows.

The extrapolation has improved the estimation significantly compared to without it (where the optimizer would have only estimated 6000 rows based on the outdated statistics).

The quality of the estimation based on a random AMP sample highly depends on the skew factor of the table.

Random AMP sampling that delivers only unique future dates is an ideal scenario. However, a biased table could result in greatly inaccurate estimations of future dates.

Another crucial factor is the coverage of the date range in your query by the extrapolated dates.

Ideally, the extrapolated statistics should encompass the entire query date range. However, even if only a portion of the statistics are extrapolated, including them will improve the estimated statistics overall.

Extrapolation is not beneficial if the extrapolated data does not encompass your desired timeframe.

Utilizing statistics extrapolation can be beneficial for daily snapshot tables; however, it should not replace the collection of statistics entirely. Extrapolation can aid in moving the recollection process to a less loaded point in the Teradata system.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

2 thoughts on “Teradata Statistics Extrapolation: What it is and How it Works”

Roland Wenzlofsky

04/21/2015 at 3:03 pm

You are right regarding point (1) and I always wondered why this may be the reason? My assumption was that during the process it’s easier as “division by zero” handling can be avoided?

Regarding point (2), as far as I remember, the calculation formula is available in the official documentation.
David Wellman

04/21/2015 at 8:27 am

Hi Roland,
Interesting article, thanks. I have a couple of points/questions:
1) (minor point), AFAIK the optimizer never estimates ‘0 rows’, the minimum value it uses is 1 row or ‘1 row per amp’.
2) (question) where did you get the extrapolation calculation from?
Cheers,
Dave

Teradata Statistics Extrapolation: What it is and How it Works

What is Teradata Statistics Extrapolation?

Which Statistics Information is Extrapolated?

How is the Statistics Information extrapolated?

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

2 thoughts on “Teradata Statistics Extrapolation: What it is and How it Works”

Leave a Comment Cancel reply