Functions applied to join columns are always the result of a wrong physical data model.
It is often the case that not even the first normal form was adhered to in the data model.
But apart from causing the characteristic abnormalities and problems, it also hurts query performance. Here is a summary:
What is the effect of functions on the join performance?
Most functions prevent the optimizer from using existing statistics on the join columns when functions are applied to them.
Functions such as SUBSTR(), TRIM(), COALESCE(), CASE statements, and often the LIKE function cause the optimizer to estimate the number of rows based on heuristics.
What should I do if I cannot prevent the use of functions?
There are different techniques to prevent bad join performance. All of them are just workarounds for a wrong data model.
One way is to work with volatile tables by normalizing the columns in a volatile table before the join.
What other disadvantages do functions on join columns have?
Most functions force the optimizer to perform a full table scan.
Are there functions that have no negative impact on performance?
Yes, for example, the optimizer statistic can use histograms if a LIKE function is used in which the placeholder is at the end, like in this statement:
SELECT * FROM Customer WHERE LastName LIKE ‘Wen%’;
But here, too, various methods can be used to improve the implementation plans:
LIKE ‘a%’ versus SUBSTR(column,1,1) = ‘a
In conclusion, I would like to stress that the problems mentioned above are mainly due to a poor data model. So this is the primary starting point. Workarounds are seldom suitable for long-term survival.
Functions on join columns should be avoided only in the join statement? Or throughout the entire SQL statement?
I’m asking from a statistics usage perspective.
If my join is clean but I’m doing a function on join column while projecting it in the SELECT part, does the optimizer still not use column’s statistics?
Only in the join part. The select list doesn’t matter.
Still, if you can avoid functions on your WHERE conditions it’s an advantage.
Since Teradata 14.10 you can apply statistics on functions, still, they are most likely only useful for exactly the one SQL you designed them for. On the other hand, simple column statistics are normally useful in many queries.
If no statistics can be used, heuristics are applied, which can be completely wrong as they assume a fixed amount of rows being returned.