Teradata GROUP BY Vs. DISTINCT- Aggregation Methods

Table of Contents

Introduction

When performing data analysis, choosing between GROUP BY and DISTINCT for aggregations is often necessary. While both approaches yield similar results, earlier versions of Teradata allowed users to select the more efficient option based on specific query data characteristics. This was possible because those versions used distinct algorithms for each method.

This article has been updated to reflect advancements in technology. The Teradata Optimizer removes the need to choose between GROUP BY and DISTINCT. This article outlines the two algorithms and how the optimizer selects between them.

Earlier versions of Teradata did not pre-aggregate data when using DISTINCT for aggregation. Instead, the rows intended for final aggregation were repeatedly redistributed to the target AMPs.

Teradata pre-aggregates data and distributes it among target AMPs when using GROUP BY aggregation.

As noted above, the fixed algorithm assignment for aggregation syntax (DISTINCT vs. GROUP BY) has been discontinued. The Teradata Optimizer now determines the appropriate algorithm based on data demographics.

For DISTINCT, rows are immediately redistributed without pre-aggregation. For GROUP BY, a preliminary pre-aggregation step occurs before redistributing distinct values among the AMPs.

We must consider how data demographics affect the algorithm choice. It is straightforward: pre-aggregation is advisable if the aggregated columns contain a small number of distinct values. Alternatively, distributing and aggregating rows directly is often the better option when there are many distinct values.

When does the Teradata Optimizer use which method?

Pre-aggregation offers the advantage of localized AMP processing and reduced BYNET distribution of rows. However, this method can be disadvantageous when columns contain highly diverse values, as each AMP must sort and eliminate duplicates based on ROWHASH. Pre-aggregation is least beneficial when columns have many distinct values.

Distributing rows directly to the target AMPs is beneficial when columns have highly varied values.

The optimizer must also account for skewed aggregated columns, which may cause certain AMPs to run out of spool space if rows are redistributed too quickly. In such cases, performing a pre-aggregation can be beneficial.

Fortunately, the question of whether to use GROUP BY or DISTINCT no longer requires manual consideration. However, collecting statistics on the aggregated columns is essential so that the optimizer can identify any imbalances and choose the appropriate method.

Summary

We outlined the decision criteria for the optimizer’s aggregation method selection. Our responsibility is to provide the optimizer with the necessary statistics so it can make the right choice.

GROUP BY for many duplicates or skew
DISTINCT for a few duplicates only and no skew

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

6 thoughts on “Teradata GROUP BY vs. DISTINCT- Aggregation Methods”

As far as I remember we can use GROUP BY in any case, optimizer deside what will be best solution. Depends on table size, row size, row count etc. That working from Teradata 14.10 .

DWH Pro Admin

09/20/2022 at 7:30 pm

Yes, you are right. That’s why I adapted this article a little bit. In older releases of Teradata, it was our task to decide, based on data demographics which approach to use. Since Release 14.10 Teradata takes care to choose the approach but it will only choose the best approach if the statistics are correct.
Reply

In case of Distinct Clause, Rows are distributed immediately without any pre-aggregation. I got this point. I know what redistribution is and how it happens behind the scene.

In the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

I am confused what really preaggregation are here? and How they are done? Any Explanation Thanks.

DWH Pro Admin

09/20/2022 at 7:50 pm

Let’s assume we group a table per country code:

In the first step (Local Aggregation), each AMP locally sums up the counts in its table rows per country code. Teradata uses a dedicated cache memory only for local aggregations. It devotes 90% of its capacity to the so-called main cache.

The main cache is progressively filled and updated as the table is scanned for matching rows, assuming that the most frequent group column values will be stored there (e.g., the country codes with the most rows). The remaining 10% capacity is devoted to the so-called overflow cache.

A second local aggregation step may be needed for those group column values, especially when the aggregation remains with many distinct values that overflow the main cache.

When the overflow cache is full, the current contents are swapped to the disk to make space for new cache entries.

As soon as the main cache aggregation is ready, Teradata recalls all the overflow cache snapshots from the disk, sorts the entries, and performs a second and final aggregation step over all overflow rows.

As Teradata assumes only a few distinct country code values when. using this approach (based on statistics), the expectation is that swapping the overflow cache to the disk happens not at all or only limited.

After the above-described process is finished, the remaining aggregated rows are distributed to their respective AMPs to do the final aggregation.

I hope this was clear. If not, just please ask me for further details.
Reply

Really Good Information 🙂

DWH Pro Admin

09/20/2022 at 8:03 pm

Thank you very much
Reply

Introduction

When does the Teradata Optimizer use which method?

Summary

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

6 thoughts on “Teradata GROUP BY vs. DISTINCT- Aggregation Methods”

Leave a Comment Cancel reply