Teradata introduced several new features, including one that caught our attention: object use counts (OUC). This feature optimizes the calculation of extrapolated statistics, improving query performance significantly. Before version 13.10, changes made by DML statements were not logged, and the optimizer relied solely on dynamic amp sampling, leading to incorrect estimates for skewed tables. Additionally, Teradata 13.10 and below relied on heuristics for extrapolation, which was a major drawback. This example will explain how Teradata 14.10 improved the extrapolation process. First, we create a test table, populate it with data, and gather statistics on relevant columns.
CREATE TABLE TheDB.TheTable ( PK INTEGER NOT NULL, TheCol CHAR(01) ) PRIMARY INDEX (PK); INSERT INTO TheDB.TheTable SELECT ROW_NUMBER() OVER (ORDER BY 1) AS PK, TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN 'A' ELSE 'B' END) AS TheCol, FROM ( SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000 ) x; COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK); COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);
Teradata Statistics Extrapolation with up to version 13.10
EXPLAIN SELECT * FROM TheDB.TheTable WHERE TheCol = 'C'; 3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 20,000 rows (42,580,000 bytes). The estimated time for this step is 0.31 seconds.
We choose a value in the WHERE clause that is absent from the queried column. The optimizer endeavors to prevent underestimating the table’s cardinality.
The heuristic applied in this instance assumes that 10% of the rows in the result set will be present, even with high confidence. However, in this case, the accurate estimation is that there will be 0 rows. The rationale behind this approach is straightforward: overestimating the table size is preferable to underestimating it, as it will result in a conservative execution plan.
Teradata Statistics Extrapolation starting with Teradata version 14.10
Teradata version 14.10 introduced a novel feature, the UDI counts, which enables the optimizer to generate more accurate estimates. These counts record all alterations to tables carried out by DML statements, precisely determining changes made since the previous statistics collection.
The optimizer refrains from extrapolation as it correctly estimates the number of rows without any detected changes. However, in certain cases, it delivers exactly one row instead of delivering no rows, which has remained unexplained until now. Kindly comment in the blog post if you have insights on the matter. Here is the same select statement example executed on Teradata version 16.20:
3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes). The estimated time for this step is 0.04 seconds.
Based on the aforementioned data, we suggest activating Object Use Counts for all significant databases by beginning with Teradata 14.10. The advantages considerably outweigh the maintenance expenses. Object Use Counts utilize minimal resources as rows are cached and written periodically (typically every 10 minutes or when the cache reaches full capacity) to the DBC.ObjectUsage table. If disabled, newer Teradata versions entirely depend on dynamic AMP sampling.