fbpx

We have to be careful when applying SQL functions on WHERE condition columns.  Many times they will take the possibility from the Optimizer to use available statistics. Without collected statistics, estimations are based on random-AMP sampling (indexed columns) or Heuristics (non-indexed columns). The downside of Heuristics is described in this article: Teradata Heuristics.

Wrong physical data models are almost always the reason behind using functions on WHERE condition columns. We live in a time of cost minimization,  but keeping the famous sentence “bought cheaply will be paid dearly” in mind.

In the example below, basic normalization rules have been violated:

The column CustomerType contains two kinds of information, and should be better split into two columns:

COLLECT STATISTICS ON Customer COLUMN(CustomerType);
SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) || SUBSTR(CustomerType,10,4) = ‚GOLDPRIV‘;

The Optimizer can’t use the collected statistics on column CustomerType.  It has to apply simple heuristic estimations.

If we assume that our Customer table contains 100,000 rows and we store 100 “Gold” customers in the “Private” sector, the resulting execution plan will look like this:

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
(“((SUBSTR(Customer .ProductCode,1 ,4 ))||(SUBSTR(Customer .ProductCode ,10 ,4)))=‚GOLDPRIV'”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows.

The Optimizer assumes that 10% of the table rows will be selected (Heuristics). This estimation is completely wrong. In reality, we are selecting only 100 rows!

Our example is very simple. The execution plan contains only one retrieve step. The WHERE condition column is not indexed, and the estimation doesn’t impact the execution plan. Assuming that there are not alternative access paths (join indexes, secondary indexes) available, the plan will always be the same:  A full table scan.

The execution plans of queries are usually much more complex than in our simple example, and the above retrieve step would be just one step out of many.
Like the one in our example, an entirely wrong estimation would most likely lead to a bad performing query.

How can we improve our example query? We have to improve the estimations for the retrieving step.

Below is an example of how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieving step evaluates to “low confidence.” The estimated number of selected rows is reflecting reality:

SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD‘ AND SUBSTR(CustomerType,10,4) = ‚PRIV‘;

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of (“((SUBSTR(Customer.CustomerType,1 ,3 ))= ‘001’)
AND ((SUBSTR(Customer .CustomerType,6 ,3 ))= ‘999’)”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows.

There is a basic rule you should follow when designing your queries using complex WHERE condition expressions:

If WHERE condition column functions are nested, the SUBSTR and LIKE functions should always be applied first.

The reason: The Optimizer can use statistics of columns where only a part of a character string is queried – the way the statistic histograms are storing information makes this possible.

Unfortunately, statistics can’t be used if a TRIM function is the first applied (not always, there are some exceptions like the functions UPPER or LOWER).

The best advice I can give: Before ending up fixing bad queries, try to fix your data model!

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • Hi –

    Thanks for the post.

    This is fine for the retrieval SQL if the substring() is part of join operation how the stats histogram will act.

    SUBSTRING() consumes more CPU, What is the equivalent function would result in the same with less CPU and I/O.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >