Previously, numerous debates have emerged concerning the superior performance of specific statements:

SELECT <COLUMN> GROUP BY 1 or
SELECT DISTINCT <COLUMN>

Many personal experiences are often shared but tend to misattribute causality. People usually construct a single test scenario and extrapolate sweeping conclusions from it.

The speculation has ceased. Herein lies the truth:

The validity of these statements hinges on the data’s demographic composition.

Grasping the execution of each statement is crucial for discerning the appropriate use of DISTINCT versus GROUP BY in Teradata.

DISTINCT distributes data to the responsible AMPs and eliminates duplicates, while GROUP BY performs local grouping on the AMP before distributing the remaining rows.

Once the basic principles are understood, it becomes simple to identify the appropriate statement to use on a Teradata system.

Using AMP local aggregation is not beneficial if there are multiple unique rows for the columns used in the grouping. Instead, it is recommended to use the DISTINCT statement.

To reduce the number of rows transferred to the AMPs during the final aggregation step, it is advisable to employ the GROUP BY statement when there are only a handful of rows in the grouping columns. This scenario triggers the AMP local grouping step.

One comment:

A high skew on grouped columns can cause an “out of spool space” situation on a local AMP due to the movement of many rows to a single or few AMPs. In this particular scenario, it is recommended to use the GROUP BY statement instead of the DISTINCT statement, which is typically preferred.

I hope most of your guesses have been resolved. There is no victor between DISTINCT and GROUP BY.

  • For 14.0,14.10,15, these two are basically the same thing

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >