DISTINCT vs. GROUP BY in Teradata

2
1451

From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.

From a performance point of view, it is not the same.

To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.

In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

Don’t get fooled by my statement above and think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.

In short, DISTINCT vs. GROUP BY in Teradata means:

  • GROUP BY   ->  for many duplicates
  • DISTINCT    -> no or a few duplicates only

At times, when using DISTINCT, you run out of spool space on an AMP.  The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.

If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.

 

Our Reader Score
[Total: 16    Average: 4.4/5]
DISTINCT vs. GROUP BY in Teradata written by Roland Wenzlofsky on March 10, 2014 average rating 4.4/5 - 16 user ratings

2 COMMENTS

  1. In case of Distinct Clause, Rows are distributed immediately without any pre-aggregation. I got this point. I know what redistribution is and how it happens behind the scene.

    In the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

    I am confused what really preaggregation are here? and How they are done? Any Explanation Thanks.

LEAVE A REPLY

Please enter your comment!
Please enter your name here