In the past, we have seen many discussions about which statement is better from a performance point of view:
SELECT <COLUMN> GROUP BY 1 or
SELECT DISTINCT <COLUMN>
Many personal experiences are announced. Cause and effect are often confused. People tend to create just one test setup, and based on this, general rules are concluded.
The guessing is over now. The truth is:
None of these statements is better in general. Data demographics decide the winner.
To understand in which situations you should use DISTINCT and when it should be the Teradata GROUP BY, you have to know how Teradata executes each of these statements.
In principle, one can say, DISTINCT means that data is distributed across the responsible AMPs immediately, and rows will be sorted (to remove duplicates). In the case of GROUP BY, Teradata first does an AMP local grouping and distributes the remaining rows afterward.
After understanding basic principles, it is easy to conclude when to use which statement on a Teradata system:
If there are many distinct rows for the columns used for grouping, there is little value in doing the AMP local aggregation. First, it is better that you use the DISTINCT statement.
If there are just a few rows for the columns used for grouping, the AMP local grouping step will cut the number of rows moved to the AMPs in the second (last) aggregation step. This is the situation when you should use the GROUP BY statement.
Just one remark:
If of a high skew on the grouped columns, you might end up with an AMP local “out of spool space” situation (as many rows may be moved to a few or just one AMP). This is the one exception where you should use the GROUP BY statement, although in principle, the DISTINCT statement should be utilized.
I hope many of your guessings are over. There is no winner between DISTINCT and GROUP BY.
For 14.0,14.10,15, these two are basically the same thing