Teradata GROUP BY vs. DISTINCT- Aggregation Methods

Introduction

When performing data analysis, choosing between GROUP BY and DISTINCT for aggregations is often necessary. While both approaches yield similar results, earlier versions of Teradata allowed users to select the more efficient option based on specific query data characteristics. This was possible because those versions used distinct algorithms for each method.

This article has been updated to reflect advancements in technology. The Teradata Optimizer removes the need to choose between GROUP BY and DISTINCT. This article outlines the two algorithms and how the optimizer selects between them.

Earlier versions of Teradata did not pre-aggregate data when using DISTINCT for aggregation. Instead, the rows intended for final aggregation were repeatedly redistributed to the target AMPs.

Teradata pre-aggregates data and distributes it among target AMPs when using GROUP BY aggregation.

As noted above, the fixed algorithm assignment for aggregation syntax (DISTINCT vs. GROUP BY) has been discontinued. The Teradata Optimizer now determines the appropriate algorithm based on data demographics.

For DISTINCT, rows are immediately redistributed without pre-aggregation. For GROUP BY, a preliminary pre-aggregation step occurs before redistributing distinct values among the AMPs.

We must consider how data demographics affect the algorithm choice. It is straightforward: pre-aggregation is advisable if the aggregated columns contain a small number of distinct values. Alternatively, distributing and aggregating rows directly is often the better option when there are many distinct values.

When does the Teradata Optimizer use which method?

Pre-aggregation offers the advantage of localized AMP processing and reduced BYNET distribution of rows. However, this method can be disadvantageous when columns contain highly diverse values, as each AMP must sort and eliminate duplicates based on ROWHASH. Pre-aggregation is least beneficial when columns have many distinct values.

Distributing rows directly to the target AMPs is beneficial when columns have highly varied values.

The optimizer must also account for skewed aggregated columns, which may cause certain AMPs to run out of spool space if rows are redistributed too quickly. In such cases, performing a pre-aggregation can be beneficial.

Fortunately, the question of whether to use GROUP BY or DISTINCT no longer requires manual consideration. However, collecting statistics on the aggregated columns is essential so that the optimizer can identify any imbalances and choose the appropriate method.

Summary

We outlined the decision criteria for the optimizer’s aggregation method selection. Our responsibility is to provide the optimizer with the necessary statistics so it can make the right choice.

  • GROUP BY    for many duplicates or skew
  • DISTINCT     for a few duplicates only and no skew

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

📊 Data Platform Migration Survey

Help us map where the industry is heading. Results are public — see what others chose.

1. What is your current data platform?

2. Where are you migrating to (or evaluating)?

Migrating FROM
Migrating TO

Thanks for voting! Share this with your network.

Follow me on LinkedIn for daily insights on data warehousing and platform migrations.

Stay Ahead in Data Warehousing

Get expert insights on Teradata, Snowflake, BigQuery, Databricks, Microsoft Fabric, and modern data architecture — delivered to your inbox.

6 thoughts on “Teradata GROUP BY vs. DISTINCT- Aggregation Methods”

  1. As far as I remember we can use GROUP BY in any case, optimizer deside what will be best solution. Depends on table size, row size, row count etc. That working from Teradata 14.10 .

    Reply
    • Yes, you are right. That’s why I adapted this article a little bit. In older releases of Teradata, it was our task to decide, based on data demographics which approach to use. Since Release 14.10 Teradata takes care to choose the approach but it will only choose the best approach if the statistics are correct.

      Reply
  2. In case of Distinct Clause, Rows are distributed immediately without any pre-aggregation. I got this point. I know what redistribution is and how it happens behind the scene.

    In the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

    I am confused what really preaggregation are here? and How they are done? Any Explanation Thanks.

    Reply
    • Let’s assume we group a table per country code:

      In the first step (Local Aggregation), each AMP locally sums up the counts in its table rows per country code. Teradata uses a dedicated cache memory only for local aggregations. It devotes 90% of its capacity to the so-called main cache.

      The main cache is progressively filled and updated as the table is scanned for matching rows, assuming that the most frequent group column values will be stored there (e.g., the country codes with the most rows). The remaining 10% capacity is devoted to the so-called overflow cache.

      A second local aggregation step may be needed for those group column values, especially when the aggregation remains with many distinct values that overflow the main cache.

      When the overflow cache is full, the current contents are swapped to the disk to make space for new cache entries.

      As soon as the main cache aggregation is ready, Teradata recalls all the overflow cache snapshots from the disk, sorts the entries, and performs a second and final aggregation step over all overflow rows.

      As Teradata assumes only a few distinct country code values when. using this approach (based on statistics), the expectation is that swapping the overflow cache to the disk happens not at all or only limited.

      After the above-described process is finished, the remaining aggregated rows are distributed to their respective AMPs to do the final aggregation.

      I hope this was clear. If not, just please ask me for further details.

      Reply

Leave a Comment

DWHPro

Expert network for enterprise data platforms. Senior consultants, project teams built for your challenge — across Teradata, Snowflake, Databricks, and more.

📍Vienna, Austria & Jacksonville, Florida

Quick Links
Services Team Teradata Book Blog Contact Us
Connect
LinkedIn → [email protected]
Newsletter

Join 4,000+ data professionals.
Weekly insights on Teradata, Snowflake & data architecture.