How to Use Biased Value Histograms for Accurate Statistics Estimation on Teradata SQL

Roland Wenzlofsky

April 23, 2023

minutes reading time

Biased histogram entries store accurate counts and should be used to update statistics reflecting the number of rows obtained for that particular value.

The Optimizer can rapidly ascertain the number of retrieved rows by referencing the histogram’s bias value.

Regrettably, this is not always true.

I generated a test scenario by populating a table with 50,000 rows, each with one of 100 unique dates.

CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD’ NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE’2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;

I executed the following statement to generate a new statistics index. The system can condense 50,000 rows into 100 histograms, which will be saved as biased values.

Ensuring the estimated number of rows matches the retrieved number of rows is crucial in our test setup.

COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);

The histograms can be checked using the following statement:

SHOW STATISTICS VALUES ON Statistics_Estimation1 ;

/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01’, 515,
/* 2 */ DATE ‘2010-01-02’, 498,
/* 3 */ DATE ‘2010-01-03’, 510,
/* 4 */ DATE ‘2010-01-04’, 482,
/* 5 */ DATE ‘2010-01-05’, 490,
/* 6 */ DATE ‘2010-01-06’, 518,
/* 7 */ DATE ‘2010-01-07’, 510,
/* 8 */ DATE ‘2010-01-08’, 512,
/* 9 */ DATE ‘2010-01-09’, 493,
/* 10 */ DATE ‘2010-01-10’, 513,
/* 11 */ DATE ‘2010-01-11’, 498,
/* 12 */ DATE ‘2010-01-12’, 513,
/* 13 */ DATE ‘2010-01-13’, 509,
/* 14 */ DATE ‘2010-01-14’, 500,
/* 15 */ DATE ‘2010-01-15’, 491,
/* 16 */ DATE ‘2010-01-16’, 489,
/* 17 */ DATE ‘2010-01-17’, 481,
/* 18 */ DATE ‘2010-01-18’, 508,
/* 19 */ DATE ‘2010-01-19’, 496,
/* 20 */ DATE ‘2010-01-20’, 505,
/* 21 */ DATE ‘2010-01-21’, 498,
/* 22 */ DATE ‘2010-01-22’, 496,
/* 23 */ DATE ‘2010-01-23’, 505,
/* 24 */ DATE ‘2010-01-24’, 504,
/* 25 */ DATE ‘2010-01-25’, 485,
/* 26 */ DATE ‘2010-01-26’, 508,
/* 27 */ DATE ‘2010-01-27’, 494,
/* 28 */ DATE ‘2010-01-28’, 507,
/* 29 */ DATE ‘2010-01-29’, 495,
/* 30 */ DATE ‘2010-01-30’, 501,
/* 31 */ DATE ‘2010-01-31’, 486,
/* 32 */ DATE ‘2010-02-01’, 517,
/* 33 */ DATE ‘2010-02-02’, 501,
/* 34 */ DATE ‘2010-02-03’, 505,
/* 35 */ DATE ‘2010-02-04’, 496,
/* 36 */ DATE ‘2010-02-05’, 514,
/* 37 */ DATE ‘2010-02-06’, 492,
/* 38 */ DATE ‘2010-02-07’, 496,
/* 39 */ DATE ‘2010-02-08’, 504,
/* 40 */ DATE ‘2010-02-09’, 486,
/* 41 */ DATE ‘2010-02-10’, 509,
/* 42 */ DATE ‘2010-02-11’, 490,
/* 43 */ DATE ‘2010-02-12’, 499,
/* 44 */ DATE ‘2010-02-13’, 487,
/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,
/* 60 */ DATE ‘2010-03-01’, 518,
/* 61 */ DATE ‘2010-03-02’, 491,
/* 62 */ DATE ‘2010-03-03’, 484,
/* 63 */ DATE ‘2010-03-04’, 510,
/* 64 */ DATE ‘2010-03-05’, 511,
/* 65 */ DATE ‘2010-03-06’, 494,
/* 66 */ DATE ‘2010-03-07’, 493,
/* 67 */ DATE ‘2010-03-08’, 500,
/* 68 */ DATE ‘2010-03-09’, 531,
/* 69 */ DATE ‘2010-03-10’, 497,
/* 70 */ DATE ‘2010-03-11’, 512,
/* 71 */ DATE ‘2010-03-12’, 481,
/* 72 */ DATE ‘2010-03-13’, 500,
/* 73 */ DATE ‘2010-03-14’, 510,
/* 74 */ DATE ‘2010-03-15’, 487,
/* 75 */ DATE ‘2010-03-16’, 511,
/* 76 */ DATE ‘2010-03-17’, 515,
/* 77 */ DATE ‘2010-03-18’, 513,
/* 78 */ DATE ‘2010-03-19’, 517,
/* 79 */ DATE ‘2010-03-20’, 516,
/* 80 */ DATE ‘2010-03-21’, 516,
/* 81 */ DATE ‘2010-03-22’, 504,
/* 82 */ DATE ‘2010-03-23’, 493,
/* 83 */ DATE ‘2010-03-24’, 510,
/* 84 */ DATE ‘2010-03-25’, 486,
/* 85 */ DATE ‘2010-03-26’, 495,
/* 86 */ DATE ‘2010-03-27’, 497,
/* 87 */ DATE ‘2010-03-28’, 481,
/* 88 */ DATE ‘2010-03-29’, 501,
/* 89 */ DATE ‘2010-03-30’, 486,
/* 90 */ DATE ‘2010-03-31’, 512,
/* 91 */ DATE ‘2010-04-01’, 487,
/* 92 */ DATE ‘2010-04-02’, 501,
/* 93 */ DATE ‘2010-04-03’, 490,
/* 94 */ DATE ‘2010-04-04’, 508,
/* 95 */ DATE ‘2010-04-05’, 518,
/* 96 */ DATE ‘2010-04-06’, 478,
/* 97 */ DATE ‘2010-04-07’, 494,
/* 98 */ DATE ‘2010-04-08’, 503,
/* 99 */ DATE ‘2010-04-09’, 494,
/* 100 */ DATE ‘2010-04-10’, 512,

Now we execute the explain for the below query:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-16′;

1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.

The Optimizer obtains primary index values through single-AMP primary index access because the selected rows are significantly fewer than the total number of rows in the table.

Did you know that the estimated count is 1530 rows? It’s peculiar because the total of the three histogram values for the dates (2010-02-14, 2010-02-15, and 2010-02-16) amounts to only 1528 rows!

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

513+517+498 = 1528

Additionally, the Optimizer has estimated 510 rows for each single-AMP access. Is it possible that a bug exists within the Optimizer?

No.

Internal optimization occurs, resulting in a slightly imprecise but expedited estimation. The Optimizer aggregates histogram values and calculates the mean by dividing the sum by their quantity. As in our example:

1528/3 = 509,33

The ceiling value determines the even distribution of the estimation among the single-AMP estimations, such as 510 – 510 – 510. Consequently, the estimation is slightly inaccurate.

We can demonstrate this by increasing the retrieved row count, compelling the Optimizer to engage in a full table scan rather than individual-AMP lookups.

SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE'2010-02-14' AND DATE'2010-02-28';

The plan changes when the date range is extended to 2010-02-28.

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14’) AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28’)”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.

Summing up the histogram values below confirms that the estimation of 7,438 matches the number of retrieved rows precisely.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,

When analyzing an execution plan, strive to comprehend the construction of the estimations. This skill will improve with practice, allowing you to tackle statistical quandaries proficiently.

Also, check out:
Teradata Statistics – Case Study
Extrapolating Teradata Statistics for Equality Predicates

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

Thanks, Roland for a wonderful article.

It may sound silly but I didn’t get why taking avg of histogram values is considered better by optimizer than using the value itself.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

I mean taking exact values of histograms should be better than taking the values, dividing it by no histograms, rounding off to nearest decimal and then multiply by no of histograms used.

It would have made more sense if it has used avg rows per histogram( in this case 50000/100=500)* no histogram values being retrieved (500*3=1500) which is clearly not the case.

Cheers !!

“By extending the date range to 2016-02-28, the plan changes”
the Date must be 2010-02-28 and not 2016-02-28.

Roland Wenzlofsky says:
at
Thanks, I corrected it.
Reply

How to Use Biased Value Histograms for Accurate Statistics Estimation on Teradata SQL

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes