Teradata DISTINCT vs. GROUP BY

Roland Wenzlofsky

June 26, 2022

minutes reading time


Introduction

We can often choose between GROUP BY and DISTINCT for aggregations. In terms of content, both options return the same result. Looking at previous versions of Teradata, for a specific query one of the two options might have been more efficient depending on the data demographics. This is because the chosen algorithm was fixed in earlier Teradata versions. One algorithm for DISTINCT, another for GROUP BY.

Times have changed, so I had to adapt this article. The Teradata Optimizer now takes the work out of deciding whether to use GROUP BY or DISTINCT. Therefore, the following only describes the two algorithms and when the optimizer should ideally choose which one.

In older versions of Teradata, when aggregating with DISTINCT, the rows for the final aggregation were constantly redistributed first to the target AMPs. No preaggregation took place.

When we used GROUP BY for aggregation, Teradata always performed preaggregation and distributed the rows to the target AMPs.

As already mentioned, this fixed assignment of the algorithm to aggregation syntax (DISTINCT vs. GROUP BY) is no longer given. Instead, the Teradata Optimizer selects the appropriate algorithm depending on the data demographics.

In the case of DISTINCT, the rows are redistributed immediately without any preaggregation. In contrast, in the case of GROUP BY, in the first step, a preaggregation is done, and only then are the distinct values redistributed across the AMPs.

We must therefore ask how the data demographics affect the algorithm. It is relatively simple: The algorithm which does pre-aggregation is preferable if the aggregated columns can be efficiently pre-aggregated (i.e., a few different values). If this is not the case, distributing the rows and aggregating them afterward is often better.

When does the Teradata Optimizer use which method?

Pre-aggregation has the advantage that Teradata can do part of the work AMP-locally, and fewer rows must be distributed over the BYNET to the target AMPs. A disadvantage of the initial preaggregation arises if the aggregated columns contain many different values and can hardly be compressed. In addition, each AMP must first sort the rows according to the ROWHASH of the aggregated columns and remove duplicates, which is an additional effort. This algorithm is more suitable if the aggregated columns have many different values.

With many different values, it is therefore often advantageous if the rows are immediately distributed to the target AMPs.

But the optimizer has to consider another problem: If the aggregated columns are skewed, there is a risk that one or a few AMPs are “out of spool” if the rows are redistributed immediately. Then again, it is advantageous to perform a preaggregation.

Fortunately, for some time now, we no longer have to worry about whether to use GROUP BY or DISTINCT. Nevertheless, collecting statistics on the aggregated columns is essential so that the optimizer can detect skew and choose a suitable method.

Summary

Earlier, we described which decision criteria the optimizer mainly uses to choose the aggregation method. Our task is to provide the optimizer with all the necessary statistics and to check whether the choice has been made as described below:

  • GROUP BY    for many duplicates or skew
  • DISTINCT     for a few duplicates only and no skew
  • As far as I remember we can use GROUP BY in any case, optimizer deside what will be best solution. Depends on table size, row size, row count etc. That working from Teradata 14.10 .

    • Yes, you are right. That’s why I adapted this article a little bit. In older releases of Teradata, it was our task to decide, based on data demographics which approach to use. Since Release 14.10 Teradata takes care to choose the approach but it will only choose the best approach if the statistics are correct.

  • Avatar
    MALIK HASSAN QAYYUM says:

    In case of Distinct Clause, Rows are distributed immediately without any pre-aggregation. I got this point. I know what redistribution is and how it happens behind the scene.

    In the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

    I am confused what really preaggregation are here? and How they are done? Any Explanation Thanks.

    • Let’s assume we group a table per country code:

      In the first step (Local Aggregation), each AMP locally sums up the counts in its table rows per country code. Teradata uses a dedicated cache memory only for local aggregations. It devotes 90% of its capacity to the so-called main cache.

      The main cache is progressively filled and updated as the table is scanned for matching rows, assuming that the most frequent group column values will be stored there (e.g., the country codes with the most rows). The remaining 10% capacity is devoted to the so-called overflow cache.

      A second local aggregation step may be needed for those group column values, especially when the aggregation remains with many distinct values that overflow the main cache.

      When the overflow cache is full, the current contents are swapped to the disk to make space for new cache entries.

      As soon as the main cache aggregation is ready, Teradata recalls all the overflow cache snapshots from the disk, sorts the entries, and performs a second and final aggregation step over all overflow rows.

      As Teradata assumes only a few distinct country code values when. using this approach (based on statistics), the expectation is that swapping the overflow cache to the disk happens not at all or only limited.

      After the above-described process is finished, the remaining aggregated rows are distributed to their respective AMPs to do the final aggregation.

      I hope this was clear. If not, just please ask me for further details.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >