Teradata introduced several new features, including one that caught our attention: object use counts (OUC). This feature optimizes the calculation of extrapolated statistics, improving query performance significantly. Before version 13.10, changes made by DML statements were not logged, and the optimizer relied solely on dynamic amp sampling, leading to incorrect estimates for skewed tables. Additionally, Teradata 13.10 and below relied on heuristics for extrapolation, which was a major drawback. This example will explain how Teradata 14.10 improved the extrapolation process. First, we create a test table, populate it with data, and gather statistics on relevant columns.
CREATE TABLE TheDB.TheTable ( PK INTEGER NOT NULL, TheCol CHAR(01) ) PRIMARY INDEX (PK); INSERT INTO TheDB.TheTable SELECT ROW_NUMBER() OVER (ORDER BY 1) AS PK, TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN 'A' ELSE 'B' END) AS TheCol, FROM ( SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000 ) x; COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK); COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);
Teradata Statistics Extrapolation with up to version 13.10
EXPLAIN SELECT * FROM TheDB.TheTable WHERE TheCol = 'C'; 3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 20,000 rows (42,580,000 bytes). The estimated time for this step is 0.31 seconds.
We choose a value in the WHERE clause that is absent from the queried column. The optimizer endeavors to prevent underestimating the table’s cardinality.
The heuristic applied in this instance assumes that 10% of the rows in the result set will be present, even with high confidence. However, in this case, the accurate estimation is that there will be 0 rows. The rationale behind this approach is straightforward: overestimating the table size is preferable to underestimating it, as it will result in a conservative execution plan.
Teradata Statistics Extrapolation starting with Teradata version 14.10
Teradata version 14.10 introduced a novel feature, the UDI counts, which enables the optimizer to generate more accurate estimates. These counts record all alterations to tables carried out by DML statements, precisely determining changes made since the previous statistics collection.
The optimizer refrains from extrapolation as it correctly estimates the number of rows without any detected changes. However, in certain cases, it delivers exactly one row instead of delivering no rows, which has remained unexplained until now. Kindly comment in the blog post if you have insights on the matter. Here is the same select statement example executed on Teradata version 16.20:
3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of an all-rows scan with a condition of ("TheDB.TheTable.TheCol = 'C '") into Spool 1 (group_amps), which is built locally on the AMPs. The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes). The estimated time for this step is 0.04 seconds.
Based on the aforementioned data, we suggest activating Object Use Counts for all significant databases by beginning with Teradata 14.10. The advantages considerably outweigh the maintenance expenses. Object Use Counts utilize minimal resources as rows are cached and written periodically (typically every 10 minutes or when the cache reaches full capacity) to the DBC.ObjectUsage table. If disabled, newer Teradata versions entirely depend on dynamic AMP sampling.
how come the second time with high confidence e with one row whereas there is now for the result in fact?
ideally, it should say zero rows or no rows returned right?
You are right. Nevertheless, usually, the result is at least estimated to be one row. I don’t know why but I assume (and I am guessing here) that the reasons might be:
Division by zero error handling when creating the execution plan can be avoided.
It makes calculations much easier if at least 1 row is assumed. For example:
How to estimated the percentage for 0 rows being returned? Assuming one row simplifies things a lot.
But as I said, I am just guessing here and this is the reason I could think about it.
Maybe somebody working at R & D in Teradata is reading this and has a better answer?