When Teradata introduced several new features with version 14.10, one of them, based on the so-called object use counts (OUC), caught our attention. This feature can significantly improve our query performance by optimizing the process of calculating extrapolated statistics. Up to version 13.10 of Teradata, the changes made by DML statements are not logged; the optimizer relies solely on dynamic amp sampling. As we know, this can lead to incorrect estimates for skewed tables. Another major drawback is how Teradata 13.10 and below performs extrapolation by using heuristics. We will explain with an example of how Teradata 14.10 improved the extrapolation process. First, we create our test table, populate it with data, and collect statistics on the relevant columns:
CREATE TABLE TheDB.TheTable ( PK INTEGER NOT NULL, TheCol CHAR(01) ) PRIMARY INDEX (PK); INSERT INTO TheDB.TheTable SELECT ROW_NUMBER() OVER (ORDER BY 1) AS PK, TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN 'A' ELSE 'B' END) AS TheCol, FROM ( SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000 ) x; COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK); COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);
Teradata Statistics Extrapolation with up to version 13.10
EXPLAIN SELECT * FROM TheDB.TheTable WHERE TheCol = 'C'; 3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 20,000 rows (42,580,000 bytes). The estimated time for this step is 0.31 seconds.
In the query above, we select a value in the WHERE condition that does not exist in the column. The optimizer will always try to avoid underestimating the table cardinality.
In this case, it will apply a simple heuristic that assumes 10% of rows in the result set, even with high confidence. The correct estimation would be 0 rows. The idea behind this is clear: it is better to overestimate the size of the table than to underestimate it; a conservative execution plan is a result.
Teradata Statistics Extrapolation starting with Teradata version 14.10
Now we come to a new feature that allows the optimizer since Teradata version 14.10 to produce better estimates: the UDI counts. The UDI counts log all changes to tables made by DML statements. This allows the changes since the last time statistics were collected to be determined precisely.
The optimizer does not use extrapolation, because, in our example, it did not detect any change, and estimates the number of rows correctly (well, almost correct; often instead of 0 rows, exactly one row is delivered as result estimate, why I could not explain until today; if you know the reason, please leave a comment at the blog post). Below you can see again our example select statement, this time executed on a Teradata version 16.20:
3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes). The estimated time for this step is 0.04 seconds.
Based on the facts mentioned above, we recommend starting with Teradata 14.10 to always activate Object Use Counts for all important databases. The costs for maintenance are minimal compared to the benefits. Object Use Counts require negligible resources because rows are cached and written in intervals (by default every 10 minutes or when the cache is full) to the table DBC.ObjectUsage. If OUC is disabled, the newer Teradata versions will have to rely solely on dynamic AMP sampling.