Teradata GROUP BY or DISTINCT – stop guessing what’s better
I the past have seen many discussions about which statement is better from a performance point of view:
SELECT <COLUMN> GROUP BY 1 or
SELECT DISTINCT <COLUMN>
Many personal experiences are announced. Cause and effect are often confused. People tend to create just one test setup and based on this, general rules are concluded.
The guessing is over now. The truth is:
None of this statements is better in general. Data demographics decide for the winner.
To understand, in which situations you should use DISTINCT and when it should be the Teradata GROUP BY, you have to understand how Teradata executes each of this statements.
In principle one can say, DISTINCT means that data is distributed across the responsible AMPs immediately, rows will be sorted (to remove duplicates). In the case of GROUP BY, Teradata at first does an AMP local grouping and is distributing the remaining rows afterward.
After understanding basic principles, it is easy to conclude when to use which statement on a Teradata system:
If there are many distinct rows for the columns used for grouping, there is little value in doing the AMP local aggregation first, it is better that you use the DISTINCT statement.
If there are just a few rows for the columns used for grouping, the AMP local grouping step will cut a lot the number of rows moved to the AMPs in the second (last) aggregation step. This is the situation when you should use the GROUP BY statement.
Just one remark:
If of a high skew on the grouped columns it could happen, that you end up with an AMP local “out of spool space” situation (as many rows may be moved to a few or just one AMP). This is the one exception where you should use the GROUP BY statement although in principle the DISTINCT statement should be utilized.
I hope for many of you guessing is over There is no winner between DISTINCT and GROUP BY.