6

Deep Dive: Teradata Multi-Column Statistics

Recently somebody asked me how the Optimizer employs multi-column statistics.

Here are the most important facts:

The Optimizer uses multi-column statistics if the query’s WHERE condition references all columns:

COLLECT STATISTICS ON DB.THE_TABLE COLUMN (first_column,second_column);
EXPLAIN
SELECT *  FROM DB.THE_TABLE WHERE first_column =’A’  AND second_column = ‘B’;

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of (“(DB.THE_TABLE.second_column =
‘B ‘) AND (DB.THE_TABLE.first_column = ‘A ‘)”) into Spool 1
(group_amps), which is built locally on the AMPs.  The size of
Spool 1 is estimated with low confidence to be 669,939 rows (
1,426,300,131 bytes).  The estimated time for this step is 9.58
seconds.

Above example relates to Teradata 13.10. The query executed without collected Primary Index statistics. The Optimizer’s confidence is “low.” You can increase it to “high” by collecting Primary Index statistics. Primary index statistics give the Optimizer information about the table cardinality.

Teradata 14.10 wouldn’t need Primary Index statistics: Summary statistics hold the information about table cardinalities. They are available as soon as you collect any statistics.
A neat trick for Teradata 14.10 to add summary statistics without collecting statistics: Just execute below statement, and you will achieve “high confidence.” For example:

HELP STATS DB.THE_TABLE; — Adds summary statistics for free…
Changing the order of the statistic columns doesn’t have any impact, neither on Teradata 13.10 nor Teradata 14.10.  Below example delivers the same result as the previous one:

DROP STATISTICS ON DB.THE_TABLE;
COLLECT STATISTICS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAIN
SELECT *  FROM DB.THE_TABLE WHERE first_column =’A’  AND second_column = ‘B’;

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of (“(DB.THE_TABLE.second_column =
‘B ‘) AND (DB.THE_TABLE.first_column = ‘A ‘)”) into Spool 1
(group_amps), which is built locally on the AMPs.  The size of
Spool 1 is estimated with low confidence to be 669,939 rows (
1,426,300,131 bytes).  The estimated time for this step is 9.58
seconds.

Can the Optimizer use multi-column statistics if a query only references some of the WHERE condition’s columns?

In Teradata 13.10, the Optimizer only can use multi-column statistic if the query references all WHERE condition columns.

In our example below, the Optimizer’s heuristics will assume that 10% of the table rows will be retrieved:

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = ‘A’;

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way of
an all-rows scan with a condition of (“DB.The_Table.first_column =
‘A ‘”) into Spool 1 (group_amps), which is built locally on the
AMPs.  The size of Spool 1 is estimated with no confidence to be
669,939 rows (1,426,300,131 bytes).  The estimated time for this
step is 9.58 seconds.

In Teradata 14.10 significant improvements are helping the Optimizer to make better decisions.  It uses multi-column statistics if some of the columns are referenced. But only if these columns are at the begin of the statistics definition:

COLLECT STATS ON DB.The_Table COLUMN (first_column,second_column);

EXPLAIN SELECT * FROM DB.The_Table WHERE first_column = ‘A’;

3) We do an all-AMPs RETRIEVE step from DB.The_Table by way of
an all-rows scan with a condition of (“DB.The_Table.first_column =
‘A ‘”) into Spool 1 (group_amps), which is built locally on the
AMPs.  The size of Spool 1 is estimated with low confidence to be
309,034 rows (333,447,686 bytes).  The estimated time for this
step is 0.09 seconds.
Here is an example where the Optimizer can’t use the multi-column statistics: The referenced column is not located at the begin of the multi-column statistics definition:

COLLECT STATS ON DB.THE_TABLE COLUMN (second_column,first_column);

EXPLAIN SELECT * FROM DB.THE_TABLE WHERE first_column = ‘A’;

3) We do an all-AMPs RETRIEVE step from DB.THE_TABLE by way of
an all-rows scan with a condition of (“DB.THE_TABLE.first_column =
‘A ‘”) into Spool 1 (group_amps), which is built locally on the
AMPs.  The size of Spool 1 is estimated with no confidence to be
61,853 rows (66,739,387 bytes).  The estimated time for this step
is 0.05 seconds.

I hope this post clarified the usage of multi-column statistics. As always, feel free to ask if something is still unclear.

 One last plea to our readers: Please use the possibility of rating our postings. Having some feedback is helping us to improve this website. 
Roland Wenzlofsky
 

Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 15 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

  • Avatar Carlos.. says:

    >>”HELP STATS DB.THE_TABLE; — Adds summary statistics for free…”

    Are you sure about that? It doesn’t work this way in my TD 14.10.

    “COLLECT SUMMARY STATISTICS ON ” is what I use.

    Cheers.

    Carlos.

  • It should work. If you run the statement and there are no statistics defined on the table you will receive the error message:

    “HELP failed. 3624: There are no statistics defined for the table”

    Nevertheless, silently the summary statistics have been refreshed. You can prove this by explaining the SELECT * FROM TABLE before and after the HELP STATS statement.

    • Avatar CarlosAL says:

      Nope.

      If I run a Collect Statistics I get Summary Statistics too, but if i do some INSERTS and execute HELP STATISTICS (statistics presumed refreshed, as you say) and then SHOW SUMMARY STATISTICS VALUES it still shows the rowcount that existed BEFORE the INSERT was run. Then If I run COLLECT SUMMARY STATISTICS and then again SHOW SUMMARY STATISTICS VALUES it shows the real rowcount with the new rows inserted.

      This behaviour happens in my TD 14.10.

      Cheers.

  • Avatar Carlos says:

    I see your point.

    The optimizer does know the summary statistics after HELP STATISTICS but SHOW SUMMARY STATISTICS VALUES… does not show them.

    If I issue a COLLECT SUMMARY both optimizer and SHOW SUMMARY STATISTICS VALUES know the statistics.

    Cheers.

    Carlos.

  • Avatar bhasker says:

    excellent explanation. much helpful

  • >