fbpx

Improved Teradata Statistics Extrapolation

By Roland Wenzlofsky

August 24, 2020


When Teradata introduced several new features with version 14.10, one of them, based on the so-called object use counts (OUC), caught our attention. This feature can significantly improve our query performance by optimizing the process of calculating extrapolated statistics. Up to version 13.10 of Teradata, the changes made by DML statements are not logged; the optimizer relies solely on dynamic amp sampling. As we know, this can lead to incorrect estimates for skewed tables. Another major drawback is how Teradata 13.10 and below performs extrapolation by using heuristics. We will explain with an example of how Teradata 14.10 improved the extrapolation process. First, we create our test table, populate it with data, and collect statistics on the relevant columns:

CREATE TABLE TheDB.TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(01)
) PRIMARY INDEX (PK);

INSERT INTO TheDB.TheTable
SELECT
ROW_NUMBER() OVER (ORDER BY 1) AS PK,
TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN 'A' ELSE 'B' END) AS TheCol,
FROM
(
SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000
) x;

COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK);
COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);

Teradata Statistics Extrapolation with up to version 13.10

EXPLAIN SELECT * FROM TheDB.TheTable  WHERE TheCol  = 'C';

 3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol =  'C '") into Spool 1 (group_amps),
which is built locally on the   AMPs.  The size of Spool 1 is estimated with high confidence to be  20,000 rows (42,580,000 bytes).
The estimated time for this step is 0.31 seconds.

In the query above, we select a value in the WHERE condition that does not exist in the column. The optimizer will always try to avoid underestimating the table cardinality.

In this case, it will apply a simple heuristic that assumes 10% of rows in the result set, even with high confidence. The correct estimation would be 0 rows. The idea behind this is clear: it is better to overestimate the size of the table than to underestimate it; a conservative execution plan is a result.

Teradata Statistics Extrapolation starting with Teradata version 14.10

Now we come to a new feature that allows the optimizer since Teradata version 14.10 to produce better estimates: the UDI counts. The UDI counts log all changes to tables made by DML statements. This allows the changes since the last time statistics were collected to be determined precisely.

The optimizer does not use extrapolation, because, in our example, it did not detect any change, and estimates the number of rows correctly (well, almost correct; often instead of 0 rows, exactly one row is delivered as result estimate, why I could not explain until today; if you know the reason, please leave a comment at the blog post). Below you can see again our example select statement, this time executed on a Teradata version 16.20:

3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of
an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps),
which is built locally on the AMPs.  The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes).
The estimated time for this step is 0.04 seconds.

Based on the facts mentioned above, we recommend starting with Teradata 14.10 to always activate Object Use Counts for all important databases. The costs for maintenance are minimal compared to the benefits. Object Use Counts require negligible resources because rows are cached and written in intervals (by default every 10 minutes or when the cache is full) to the table DBC.ObjectUsage. If OUC is disabled, the newer Teradata versions will have to rely solely on dynamic AMP sampling.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Buy the Book Teradata Query Performance Tuning

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • how come the second time with high confidence e with one row whereas there is now for the result in fact?
    ideally, it should say zero rows or no rows returned right?

    • You are right. Nevertheless, usually, the result is at least estimated to be one row. I don’t know why but I assume (and I am guessing here) that the reasons might be:

      (1)

      Division by zero error handling when creating the execution plan can be avoided.

      (2)

      It makes calculations much easier if at least 1 row is assumed. For example:
      How to estimated the percentage for 0 rows being returned? Assuming one row simplifies things a lot.

      But as I said, I am just guessing here and this is the reason I could think about it.

      Maybe somebody working at R & D in Teradata is reading this and has a better answer?

      Roland

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >