How Teradata Optimizer Uses Multi-Column Statistics

A recent question came in about how the Teradata Optimizer uses multi-column statistics.

Here are the essential details:

The Optimizer uses multi-column statistics when the query’s WHERE clause covers all columns.

COLLECT STATISTICS ON DB.THE_TABLE COLUMN (first_column,second_column);
EXPLAIN
SELECT *  FROM DB.THE_TABLE WHERE first_column ='A'  AND second_column = 'B';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of ("(DB.THE_TABLE.second_column =
'B ') AND (DB.THE_TABLE.first_column = 'A ')") into Spool 1
(group_amps), which is built locally on the AMPs.  The size of
Spool 1 is estimated with low confidence to be 669,939 rows (
1,426,300,131 bytes).  The estimated time for this step is 9.58
seconds.

This example pertains to Teradata 13.10. The query was executed without gathering Primary Index statistics, resulting in low confidence from the Optimizer. To boost confidence to a high level, collecting Primary Index statistics is necessary. These statistics provide the Optimizer with details on table cardinality.

Primary Index statistics are unnecessary in Teradata 14.10 due to the availability of summary statistics that include table cardinalities. These statistics are obtained automatically when any statistics are collected. To add summary statistics in Teradata 14.10 without collecting statistics, execute the following statement to achieve a high level of confidence:

HELP STATS DB.THE_TABLE; -- Adds summary statistics for free…

Changing the order of statistic columns has no impact on Teradata 13.10 or Teradata 14.10. The example below delivers the same result as the previous one:

DROP STATISTICS ON DB.THE_TABLE;
COLLECT STATISTICS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAIN
SELECT *  FROM DB.THE_TABLE WHERE first_column ='A'  AND second_column = 'B';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of ("(DB.THE_TABLE.second_column =
'B ') AND (DB.THE_TABLE.first_column = 'A ')") into Spool 1
(group_amps), which is built locally on the AMPs.  The size of
Spool 1 is estimated with low confidence to be 669,939 rows (
1,426,300,131 bytes).  The estimated time for this step is 9.58
seconds.

Can the Optimizer use multi-column statistics if a query only references some of the WHERE condition’s columns?

In Teradata 13.10, the Optimizer can only use a multi-column statistic when all columns in the WHERE condition are referenced in the query.

The Optimizer’s heuristics assume that Teradata will retrieve 10% of the table’s rows in the following example:

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = ‘A’;

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way of
an all-rows scan with a condition of (“DB.The_Table.first_column =
‘A'”) into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with no confidence to be
669,939 rows (1,426,300,131 bytes). The estimated time for this
step is 9.58 seconds.

Teradata 14.10 introduces substantial enhancements that help the Optimizer make more informed decisions. Specifically, the Optimizer uses multi-column statistics when columns are referenced, but only if the referenced columns are located at the start of the statistical definition.

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = ‘A’;

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way of
an all-rows scan with a condition of ("DB.The_Table.first_column =
'A'") into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with low confidence to be
309,034 rows (333,447,686 bytes).  The estimated time for this
step is 0.09 seconds.
Here is an example where the Optimizer can't use the multi-column statistics: The referenced column is not located at the beginning of the definition of the multi-column statistics:

COLLECT STATS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAIN SELECT * FROM DB.THE_TABLE WHERE first_column = 'A';

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of ("DB.THE_TABLE.first_column =
'A'") into Spool 1 (group_amps), which is built locally on the
AMPs. The size of Spool 1 is estimated with no confidence to be
61,853 rows (66,739,387 bytes).  The estimated time for this step
is 0.05 seconds.

I hope this post has clarified how multi-column statistics are used. Feel free to ask if anything remains unclear.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

6 thoughts on “How Teradata Optimizer Uses Multi-Column Statistics”

bhasker

07/03/2015 at 7:36 am

excellent explanation. much helpful
Carlos

06/26/2015 at 10:21 am

I see your point.

The optimizer does know the summary statistics after HELP STATISTICS but SHOW SUMMARY STATISTICS VALUES… does not show them.

If I issue a COLLECT SUMMARY both optimizer and SHOW SUMMARY STATISTICS VALUES know the statistics.

Cheers.

Carlos.
Carlos..

06/24/2015 at 1:16 pm

>>”HELP STATS DB.THE_TABLE; — Adds summary statistics for free…”

Are you sure about that? It doesn’t work this way in my TD 14.10.

“COLLECT SUMMARY STATISTICS ON ” is what I use.

Cheers.

Carlos.
Roland Wenzlofsky

06/24/2015 at 1:28 pm

It should work. If you run the statement and there are no statistics defined on the table you will receive the error message:

“HELP failed. 3624: There are no statistics defined for the table”

Nevertheless, silently the summary statistics have been refreshed. You can prove this by explaining the SELECT * FROM TABLE before and after the HELP STATS statement.
- CarlosAL
  
  06/25/2015 at 7:17 am
  
  Nope.
  
  If I run a Collect Statistics I get Summary Statistics too, but if I do some INSERTS and execute HELP STATISTICS (statistics presumed refreshed, as you say) and then SHOW SUMMARY STATISTICS VALUES it still shows the row count that existed BEFORE the INSERT was run. Then If I run COLLECT SUMMARY STATISTICS and then again SHOW SUMMARY STATISTICS VALUES it shows the real row count with the new rows inserted.
  
  This behavior happens in my TD 14.10.
  
  Cheers.
  - Roland Wenzlofsky
    
    06/25/2015 at 12:54 pm
    
    Strange. I tried it again and it works. If you run below SQL, what version/release to you have?
    
    SELECT * FROM DBC.DBCINFO
    
    VERSION 14.10.05.05
    RELEASE 14.10.05.05

How Teradata Optimizer Uses Multi-Column Statistics

Can the Optimizer use multi-column statistics if a query only references some of the WHERE condition’s columns?

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

6 thoughts on “How Teradata Optimizer Uses Multi-Column Statistics”

Leave a Comment Cancel reply