How to Use Biased Value Histograms for Accurate Statistics Estimation on Teradata SQL

Biased histogram entries store accurate counts and should be used to update statistics reflecting the number of rows obtained for that particular value.

The Optimizer can rapidly ascertain the number of retrieved rows by referencing the histogram’s bias value.

Regrettably, this is not always true.

I generated a test scenario by populating a table with 50,000 rows, each with one of 100 unique dates.

CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD’ NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE’2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;

I executed the following statement to generate a new statistics index. The system can condense 50,000 rows into 100 histograms, which will be saved as biased values.

Ensuring the estimated number of rows matches the retrieved number of rows is crucial in our test setup.

COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);

The histograms can be checked using the following statement:

SHOW  STATISTICS VALUES ON Statistics_Estimation1 ;

/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01’, 515,
/* 2 */ DATE ‘2010-01-02’, 498,
/* 3 */ DATE ‘2010-01-03’, 510,
/* 4 */ DATE ‘2010-01-04’, 482,
/* 5 */ DATE ‘2010-01-05’, 490,
/* 6 */ DATE ‘2010-01-06’, 518,
/* 7 */ DATE ‘2010-01-07’, 510,
/* 8 */ DATE ‘2010-01-08’, 512,
/* 9 */ DATE ‘2010-01-09’, 493,
/* 10 */ DATE ‘2010-01-10’, 513,
/* 11 */ DATE ‘2010-01-11’, 498,
/* 12 */ DATE ‘2010-01-12’, 513,
/* 13 */ DATE ‘2010-01-13’, 509,
/* 14 */ DATE ‘2010-01-14’, 500,
/* 15 */ DATE ‘2010-01-15’, 491,
/* 16 */ DATE ‘2010-01-16’, 489,
/* 17 */ DATE ‘2010-01-17’, 481,
/* 18 */ DATE ‘2010-01-18’, 508,
/* 19 */ DATE ‘2010-01-19’, 496,
/* 20 */ DATE ‘2010-01-20’, 505,
/* 21 */ DATE ‘2010-01-21’, 498,
/* 22 */ DATE ‘2010-01-22’, 496,
/* 23 */ DATE ‘2010-01-23’, 505,
/* 24 */ DATE ‘2010-01-24’, 504,
/* 25 */ DATE ‘2010-01-25’, 485,
/* 26 */ DATE ‘2010-01-26’, 508,
/* 27 */ DATE ‘2010-01-27’, 494,
/* 28 */ DATE ‘2010-01-28’, 507,
/* 29 */ DATE ‘2010-01-29’, 495,
/* 30 */ DATE ‘2010-01-30’, 501,
/* 31 */ DATE ‘2010-01-31’, 486,
/* 32 */ DATE ‘2010-02-01’, 517,
/* 33 */ DATE ‘2010-02-02’, 501,
/* 34 */ DATE ‘2010-02-03’, 505,
/* 35 */ DATE ‘2010-02-04’, 496,
/* 36 */ DATE ‘2010-02-05’, 514,
/* 37 */ DATE ‘2010-02-06’, 492,
/* 38 */ DATE ‘2010-02-07’, 496,
/* 39 */ DATE ‘2010-02-08’, 504,
/* 40 */ DATE ‘2010-02-09’, 486,
/* 41 */ DATE ‘2010-02-10’, 509,
/* 42 */ DATE ‘2010-02-11’, 490,
/* 43 */ DATE ‘2010-02-12’, 499,
/* 44 */ DATE ‘2010-02-13’, 487,
/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,
/* 60 */ DATE ‘2010-03-01’, 518,
/* 61 */ DATE ‘2010-03-02’, 491,
/* 62 */ DATE ‘2010-03-03’, 484,
/* 63 */ DATE ‘2010-03-04’, 510,
/* 64 */ DATE ‘2010-03-05’, 511,
/* 65 */ DATE ‘2010-03-06’, 494,
/* 66 */ DATE ‘2010-03-07’, 493,
/* 67 */ DATE ‘2010-03-08’, 500,
/* 68 */ DATE ‘2010-03-09’, 531,
/* 69 */ DATE ‘2010-03-10’, 497,
/* 70 */ DATE ‘2010-03-11’, 512,
/* 71 */ DATE ‘2010-03-12’, 481,
/* 72 */ DATE ‘2010-03-13’, 500,
/* 73 */ DATE ‘2010-03-14’, 510,
/* 74 */ DATE ‘2010-03-15’, 487,
/* 75 */ DATE ‘2010-03-16’, 511,
/* 76 */ DATE ‘2010-03-17’, 515,
/* 77 */ DATE ‘2010-03-18’, 513,
/* 78 */ DATE ‘2010-03-19’, 517,
/* 79 */ DATE ‘2010-03-20’, 516,
/* 80 */ DATE ‘2010-03-21’, 516,
/* 81 */ DATE ‘2010-03-22’, 504,
/* 82 */ DATE ‘2010-03-23’, 493,
/* 83 */ DATE ‘2010-03-24’, 510,
/* 84 */ DATE ‘2010-03-25’, 486,
/* 85 */ DATE ‘2010-03-26’, 495,
/* 86 */ DATE ‘2010-03-27’, 497,
/* 87 */ DATE ‘2010-03-28’, 481,
/* 88 */ DATE ‘2010-03-29’, 501,
/* 89 */ DATE ‘2010-03-30’, 486,
/* 90 */ DATE ‘2010-03-31’, 512,
/* 91 */ DATE ‘2010-04-01’, 487,
/* 92 */ DATE ‘2010-04-02’, 501,
/* 93 */ DATE ‘2010-04-03’, 490,
/* 94 */ DATE ‘2010-04-04’, 508,
/* 95 */ DATE ‘2010-04-05’, 518,
/* 96 */ DATE ‘2010-04-06’, 478,
/* 97 */ DATE ‘2010-04-07’, 494,
/* 98 */ DATE ‘2010-04-08’, 503,
/* 99 */ DATE ‘2010-04-09’, 494,
/* 100 */ DATE ‘2010-04-10’, 512,

Now we execute the explain for the below query:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-16′;

1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.

The Optimizer obtains primary index values through single-AMP primary index access because the selected rows are significantly fewer than the total number of rows in the table.

Did you know that the estimated count is 1530 rows? It’s peculiar because the total of the three histogram values for the dates (2010-02-14, 2010-02-15, and 2010-02-16) amounts to only 1528 rows!

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

513+517+498 = 1528

Additionally, the Optimizer has estimated 510 rows for each single-AMP access. Is it possible that a bug exists within the Optimizer?

No.

Internal optimization occurs, resulting in a slightly imprecise but faster estimation. The Optimizer aggregates histogram values and calculates the mean by dividing the sum by their quantity. As in our example:

1528/3 = 509,33

The ceiling value determines the even distribution of the estimation among the single-AMP estimations, such as 510 – 510 – 510. Consequently, the estimation is slightly inaccurate.

We can demonstrate this by increasing the retrieved row count, compelling the Optimizer to perform a full table scan rather than individual-AMP lookups.

SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE'2010-02-14' AND DATE'2010-02-28';

The plan changes when the date range is extended to 2010-02-28.

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14’) AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28’)”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.

Summing up the histogram values below confirms that the estimation of 7,438 matches the number of retrieved rows precisely.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,

When analyzing an execution plan, strive to understand how the estimations are constructed. This skill improves with practice, allowing you to tackle statistical problems proficiently.

Also, check out:
Teradata Statistics – Case Study
Extrapolating Teradata Statistics for Equality Predicates

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

📊 Data Platform Migration Survey

Help us map where the industry is heading. Results are public — see what others chose.

1. What is your current data platform?

2. Where are you migrating to (or evaluating)?

Migrating FROM
Migrating TO

Thanks for voting! Share this with your network.

Follow me on LinkedIn for daily insights on data warehousing and platform migrations.

Stay Ahead in Data Warehousing

Get expert insights on Teradata, Snowflake, BigQuery, Databricks, Microsoft Fabric, and modern data architecture — delivered to your inbox.

3 thoughts on “How to Use Biased Value Histograms for Accurate Statistics Estimation on Teradata SQL”

  1. Thanks, Roland for a wonderful article.

    It may sound silly but I didn’t get why taking avg of histogram values is considered better by optimizer than using the value itself.

    /* 45 */ DATE ‘2010-02-14’, 513,
    /* 46 */ DATE ‘2010-02-15’, 517,
    /* 47 */ DATE ‘2010-02-16’, 498,

    I mean taking exact values of histograms should be better than taking the values, dividing it by no histograms, rounding off to nearest decimal and then multiply by no of histograms used.

    It would have made more sense if it has used avg rows per histogram( in this case 50000/100=500)* no histogram values being retrieved (500*3=1500) which is clearly not the case.

    Cheers !!

    Reply
  2. “By extending the date range to 2016-02-28, the plan changes”
    the Date must be 2010-02-28 and not 2016-02-28.

    Reply

Leave a Comment

DWHPro

Expert network for enterprise data platforms. Senior consultants, project teams built for your challenge — across Teradata, Snowflake, Databricks, and more.

📍Vienna, Austria & Jacksonville, Florida

Quick Links
Services Team Teradata Book Blog Contact Us
Connect
LinkedIn → [email protected]
Newsletter

Join 4,000+ data professionals.
Weekly insights on Teradata, Snowflake & data architecture.