Teradata GROUP BY vs. DISTINCT- Aggregation Methods

Roland Wenzlofsky

May 3, 2023

minutes reading time


Introduction

When performing data analysis, deciding between using GROUP BY and DISTINCT for aggregations is often necessary. While both approaches yield similar results, earlier versions of Teradata allowed users to choose the more efficient option based on specific query data characteristics. This was feasible because previous versions utilized distinct algorithms for each method.

This article has been updated to reflect advancements in technology. The Teradata Optimizer alleviates the need to choose between GROUP BY and DISTINCT. Therefore, this article will outline the two algorithms and their optimal selection by the optimizer.

Earlier versions of Teradata did not pre-aggregate data when using DISTINCT for aggregation. Instead, the rows intended for final aggregation were repeatedly redistributed to the target AMPs.

Teradata preaggregates data and distributes it among target AMPs using GROUP BY aggregation.

As previously announced, the fixed algorithm assignment for aggregation syntax (DISTINCT vs. GROUP BY) has been discontinued. The Teradata Optimizer now determines the appropriate algorithm based on data demographics.

For DISTINCT, rows are immediately redistributed without preaggregation. However, GROUP BY has a preliminary preaggregation before redistributing distinct values among the AMPs.

We must analyze how data demographics affect the algorithm. It is simple: Pre-aggregation with the algorithm is advisable if the aggregated columns comprise a small number of distinct values. Alternatively, distributing and aggregating the rows afterward is often the optimal option if there are many distinct values.

When does the Teradata Optimizer use which method?

Pre-aggregation offers the advantage of localized AMP processing and minimized BYNET distribution of rows. However, this method may prove disadvantageous if compressed columns contain diverse values, as each AMP must sort and eliminate duplicates based on ROWHASH. Pre-aggregation is best utilized when columns possess numerous distinct values.

Distributing rows directly to the target AMPs is beneficial because of their varied values.

The optimizer must also account for skewed aggregated columns which may cause certain AMPs to be “out of spool” if rows are redistributed too quickly. Therefore, performing a preaggregation can be beneficial.

Thankfully, the question of whether to use GROUP BY or DISTINCT is no longer up for debate. However, collecting statistics on the aggregated columns is essential so that the optimizer can identify any imbalances and choose the appropriate method.

Summary

We outlined the decision criteria for the optimizer’s aggregation method selection. Our duty is to provide the optimizer with necessary statistics and ensure the selection meets the criteria.

  • GROUP BY    for many duplicates or skew
  • DISTINCT     for a few duplicates only and no skew
  • As far as I remember we can use GROUP BY in any case, optimizer deside what will be best solution. Depends on table size, row size, row count etc. That working from Teradata 14.10 .

    • Yes, you are right. That’s why I adapted this article a little bit. In older releases of Teradata, it was our task to decide, based on data demographics which approach to use. Since Release 14.10 Teradata takes care to choose the approach but it will only choose the best approach if the statistics are correct.

  • Avatar
    MALIK HASSAN QAYYUM says:

    In case of Distinct Clause, Rows are distributed immediately without any pre-aggregation. I got this point. I know what redistribution is and how it happens behind the scene.

    In the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

    I am confused what really preaggregation are here? and How they are done? Any Explanation Thanks.

    • Let’s assume we group a table per country code:

      In the first step (Local Aggregation), each AMP locally sums up the counts in its table rows per country code. Teradata uses a dedicated cache memory only for local aggregations. It devotes 90% of its capacity to the so-called main cache.

      The main cache is progressively filled and updated as the table is scanned for matching rows, assuming that the most frequent group column values will be stored there (e.g., the country codes with the most rows). The remaining 10% capacity is devoted to the so-called overflow cache.

      A second local aggregation step may be needed for those group column values, especially when the aggregation remains with many distinct values that overflow the main cache.

      When the overflow cache is full, the current contents are swapped to the disk to make space for new cache entries.

      As soon as the main cache aggregation is ready, Teradata recalls all the overflow cache snapshots from the disk, sorts the entries, and performs a second and final aggregation step over all overflow rows.

      As Teradata assumes only a few distinct country code values when. using this approach (based on statistics), the expectation is that swapping the overflow cache to the disk happens not at all or only limited.

      After the above-described process is finished, the remaining aggregated rows are distributed to their respective AMPs to do the final aggregation.

      I hope this was clear. If not, just please ask me for further details.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >