Teradata Statistics - No Confidence?

It is important to exercise caution when utilizing SQL functions on WHERE condition columns as they can limit the Optimizer’s ability to employ available statistics. Without collected statistics, estimations are conducted via either random AMP sampling (for indexed columns) or heuristics (for non-indexed columns). A notable disadvantage of heuristics is discussed in the following article: Teradata Heuristics.

Incorrect physical data models often use functions on columns within WHERE conditions. While minimizing costs is important, one should also consider the adage that purchasing something cheaply may ultimately result in higher costs.

The following example violates basic normalization rules:

The CustomerType column contains dual information and requires division into two distinct columns.

COLLECT STATISTICS ON Customer COLUMN(CustomerType);
SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) || SUBSTR(CustomerType,10,4) = ‚GOLDPRIV';

The Optimizer cannot utilize the gathered statistics for the CustomerType column and must instead rely on basic heuristic approximations.

Assuming our Customer table has 100,000 rows and we have 100 customers with a “Gold” status in the “Private” category, the execution plan will appear as follows:

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
(“((SUBSTR(Customer .ProductCode,1,4))||(SUBSTR(Customer .ProductCode,10,4)))=‚GOLDPRIV'”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows.“

The Optimizer assumes a 10% row selection rate based on heuristics, which is inaccurate. The actual selection rate is only 100 rows.

The example is simple: the execution plan includes just one retrieve step. The column in the WHERE condition is not indexed, and its estimation does not affect the execution plan. The plan will inevitably entail a full table scan if no alternative access paths (such as join indexes or secondary indexes) are present.

Query execution plans are often more intricate than in the aforementioned simple example. Retrieval would be only one stage among numerous other steps. Miscalculations would result in poor query performance, as illustrated by the example above.

How can we enhance our query? We need to improve the estimations for retrieval.

Below is an example of how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieval step evaluates to “low confidence.” The estimated number of selected rows reflects reality:

SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD’ AND SUBSTR(CustomerType,10,4) = ‚PRIV’;

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of (“((SUBSTR(Customer. CustomerType,1,3))= ‘001’)
AND ((SUBSTR(Customer .CustomerType,6,3))= ‘999’)”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows.“

When designing queries with complex WHERE conditions, adhering to a fundamental rule is crucial.

When nesting WHERE condition column functions, prioritize applying the SUBSTR and LIKE functions first.

The Optimizer can utilize column statistics even when querying only a partial character string by leveraging the information stored in statistic histograms.

Unfortunately, statistics cannot be applied if the first function used is TRIM, although there are some exceptions, such as the UPPER or LOWER functions.

My most valuable advice is to prioritize fixing your data model before attempting to address problematic queries.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

1 thought on “The Importance of Avoiding SQL Functions on WHERE Condition Columns for Optimizer Performance”

Akilan V

06/23/2015 at 5:16 am

Hi –

Thanks for the post.

This is fine for the retrieval SQL if the substring() is part of the join operation how the stats histogram will act.

SUBSTRING() consumes more CPU, What is the equivalent function that would result in the same with less CPU and I/O.

The Importance of Avoiding SQL Functions on WHERE Condition Columns for Optimizer Performance

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

1 thought on “The Importance of Avoiding SQL Functions on WHERE Condition Columns for Optimizer Performance”

Leave a Comment Cancel reply