We can often choose between GROUP BY and DISTINCT for aggregations. In terms of content, both options return the same result. Looking at previous versions of Teradata, for a specific query one of the two options might have been more efficient depending on the data demographics. This is because the chosen algorithm was fixed in earlier Teradata versions. One algorithm for DISTINCT, another for GROUP BY.
Times have changed, so I had to adapt this article. The Teradata Optimizer now takes the work out of deciding whether to use GROUP BY or DISTINCT. Therefore, the following only describes the two algorithms and when the optimizer should ideally choose which one.
In older versions of Teradata, when aggregating with DISTINCT, the rows for the final aggregation were constantly redistributed first to the target AMPs. No preaggregation took place.
When we used GROUP BY for aggregation, Teradata always performed preaggregation and distributed the rows to the target AMPs.
As already mentioned, this fixed assignment of the algorithm to aggregation syntax (DISTINCT vs. GROUP BY) is no longer given. Instead, the Teradata Optimizer selects the appropriate algorithm depending on the data demographics.
In the case of DISTINCT, the rows are redistributed immediately without any preaggregation. In contrast, in the case of GROUP BY, in the first step, a preaggregation is done, and only then are the distinct values redistributed across the AMPs.
We must therefore ask how the data demographics affect the algorithm. It is relatively simple: The algorithm which does pre-aggregation is preferable if the aggregated columns can be efficiently pre-aggregated (i.e., a few different values). If this is not the case, distributing the rows and aggregating them afterward is often better.
When does the Teradata Optimizer use which method?
Pre-aggregation has the advantage that Teradata can do part of the work AMP-locally, and fewer rows must be distributed over the BYNET to the target AMPs. A disadvantage of the initial preaggregation arises if the aggregated columns contain many different values and can hardly be compressed. In addition, each AMP must first sort the rows according to the ROWHASH of the aggregated columns and remove duplicates, which is an additional effort. This algorithm is more suitable if the aggregated columns have many different values.
With many different values, it is therefore often advantageous if the rows are immediately distributed to the target AMPs.
But the optimizer has to consider another problem: If the aggregated columns are skewed, there is a risk that one or a few AMPs are “out of spool” if the rows are redistributed immediately. Then again, it is advantageous to perform a preaggregation.
Fortunately, for some time now, we no longer have to worry about whether to use GROUP BY or DISTINCT. Nevertheless, collecting statistics on the aggregated columns is essential so that the optimizer can detect skew and choose a suitable method.
Earlier, we described which decision criteria the optimizer mainly uses to choose the aggregation method. Our task is to provide the optimizer with all the necessary statistics and to check whether the choice has been made as described below:
- GROUP BY for many duplicates or skew
- DISTINCT for a few duplicates only and no skew