2

DISTINCT vs. GROUP BY in Teradata

From a result set point of view, it does not matter if you use DISTINCT or GROUP BY in Teradata. The answer set will be the same.

From a performance point of view, it is not the same.

To understand what impacts performance, you need to know what happens on Teradata when executing a statement with DISTINCT or GROUP BY.

In the case of DISTINCT, the rows are redistributed immediately without any preaggregation taking place, while in the case of GROUP BY, in a first step a preaggregation is done and only then are the unique values redistributed across the AMPs.

Don’t get fooled by my statement above and think now that GROUP BY is always better from a performance point of view. When you have many different values, the preaggregation step of GROUP BY is not very efficient. Teradata has to sort the data to remove duplicates. In this case, it may be better to the redistribution first, i.e. use the DISTINCT statement. Only if there are many duplicate values, the GROUP BY statement is probably the better choice as only once the deduplication step takes place, after redistribution.

In short, DISTINCT vs. GROUP BY in Teradata means:

  • GROUP BY   ->  for many duplicates
  • DISTINCT    -> no or a few duplicates only

At times, when using DISTINCT, you run out of spool space on an AMP.  The reason is that redistribution takes place immediately, and skewing could cause AMPs to run out of space.

If this happens, you have probably a better chance with GROUP BY, as duplicates are already removed in a first step, and less data is moved across the AMPs.

 

1

Teradata Collect Statistics Using Sample

The Teradata Collect Statistics Using Sample feature explained

Collecting sample statistics on a Teradata system helps to cut workload. If used cautiously, it allows creating as good execution plans as full statistics.

What is sample statistics? The optimizer will use a sample of 2% of the rows to get information about the data demographics. We can easily change the sample size, by running the statement below (The new sample size is valid until you log off your session):

DIAGNOSTIC “COLLECTSTATS, SAMPLESIZE=n” ON FOR SESSION;

When can sample statistics be applied? For sample statistics, it’s crucial, that the table is not skewed. Skewing can be found by executing below query:

SELECT HASHAMP (HASHBUCKET (HASHROW ())) , COUNT (*)

FROM <TABLE>

GROUP BY 1;

If the skew is small (<= 5%), sample statistics can be used. I would not be concerned about collecting full statistics on small tables, but a collection of comprehensive statistics on big tables consumes a lot of CPU and Disk IOs and should be considered as sample statistics candidates:

COLLECT STATISTICS USING SAMPLE ON <TABLE> COLUMN (<COLUMN>);

Sample statistics are not possible on GLOBAL TEMPORARY TABLES, JOIN INDICES and VOLATILE TABLES.

See also:
All you need to know about Teradata Statistics

2

Functions on JOIN columns in Teradata

There are two major problems, when using functions on join columns.

  1. The Teradata Optimizer cannot make use of statistics; String manipulation functions like SUBSTR(), TRIM(), the LIKE function, COALESCE() and CASE WHEN statements force the optimizer to use heuristics to make its estimations.In case you cannot avoid using a function,  try at least different strategies, and figure out which function allows for the best execution planNevertheless, using functions in your joins, means that you have a bad physical data model. Either it’s technically outdated, inaccurate on the logical data model, or it has to serve two masters.

    Ask yourself: Why there is no separate column that stores the essential information as you need it?

  2. Using functions on your join columns forces the optimizer to do a full table scan, slowing down your queries

Not each function is equal in its negative impact on performance. I experienced different results for applications of LIKE ‘a%’ versus SUBSTR(column,1,1) = ‘a’ .
You have to experiment a little bit with the various alternatives and chose the one that impairs performance the least. The best approach, though,  is to avoid the usage of functions on any join column by design.

1

Teradata AMP Flow Control Mode

Teradata manages the flow of work in a scalable and decentralized way, on each AMP. Each AMP handles all the workload assigned to it on its own. Each AMP keeps track of the amount of work it accepted, how much work is active, and how much work is waiting in a queue to be executed.

Flow control mode happens, when the workload on an AMP reaches a certain limit. It will decline any further work assigned to it by the parsing engine. The state of an AMP not allowing any further work is called flow control.

In flow control mode, the AMP will decline any newly arriving messages, which can be, for example, new work, rows being sent by other AMPs, or internal system work.

Any message arriving on an AMP tries to get an AMP worker task to fulfill the required task.  In the case of an All-AMP operation, such as a full table scan, it might happen that not all AMPs have free AMP worker tasks available.

Messages received by an AMP can be queued, and not all AMPs have to begin immediately to work on the All-AMP task.

Still, the message queues on each AMP can only hold a limited amount of messages per work type (new work, second level work, such as row distribution, etc.). Grouping queue messages by work type is used to prioritize them. New work, for example, is not as important as ongoing work which is needed to finish an already started task.

When any of the message queues reaches the limit, further messages of the work type are no longer accepted. The dispatcher is informed by the AMP and will retry to send the messages until the AMP finally accepts them again. If only one AMP closes the gates, none of the AMPs will accept further work for the same message.

One or more AMPs in the state of flow control usually are a sign of system overload. Proper workload management has to be applied to bring the AMPs back to a normal state.

Flow control mode can cause severe performance problem if it happens repeatedly and over an extended period. Single occasions of flow control every once in a while and for a couple of seconds are acceptable.

Frequent flow control situations over an extended period should be analyzed in detail.

Below is a simple test query, which gives detailed information about flow control on a Teradata system.

It delivers, per sample period (i.e. the end of the period),  the maximum tasks of type “new task,” “continued task,” the size of jobs queued, the number of flow control events and the maximum parallel flow control events which occurred during the sample period.

SELECT  
   TheDate
   ,TheTime   
   ,MAX(WorkTypeMax00)    AS max_wrk0   
   ,MAX(WorkTypeMax01)    AS max_wrk1
   ,MAX(MailBoxDepth / CollectIntervals)
   ,SUm(FlowControlled) 
   ,MAX(FlowCtlCnt)
    
FROM DBC.ResusageSAWT 
WHERE TheDATE BETWEEN DATE’2014-01-01′ AND DATE
GROUP BY  1,2
Questions?
If you have any questions about all this, please ask in the comments! I’ll be paying close attention and answering as many as I can. Thank you for reading. Whatever this blog has become, I owe it all to you.

The Teradata Fastload

In my career as Teradata professional, I have seen a lot of performance problems caused by the inappropriate use of fastloads. As we all know, each Teradata System has some limit on the number of parallel fastloads allowed. We should better be careful in choosing how we get data into the Teradata Database.

I would never recommend using the Teradata fastload utility for loads of less than 100.000 rows. 100.000 rows are just a rule of thumb, but a good starting point for loading on Teradata.

There are several options of how to get data into a Teradata database, each of them with its advantages and disadvantages. Usually, I use BTEQ loads if I load less than 100.000 rows, TPUMP to load up to 500.000 rows and probably I will use fastloads for more than 500.000 rows. Always pay attention to very wide rows as this could be a reason to use fast loading even if the number of rows points to a TPUMP or BTEQ load.

I remember one client who was loading almost every single table with fast loads, even the empty ones. As a result, nightly batch loads very heavily delayed as load jobs had to wait for free fast load slots and got queued. After a redesign, changing all small tables being loaded (with an arbitrarily chosen limit of 100.000 rows) to TPUMP, load times decreased by about 50%!

My experience is that the easiest way is to use DBC.DBQLOGTBL to determine the load type. Query the rows with where column StatementType equals “Fastload.” The column “UtilityRowCount” is the number of rows which has been loaded. All loads with a UtilityRowCount lass than 100.000 rows should be candidates for being changed to BTEQ or TPUMP.

2

Teradata Stale Statistics

When the degree to which a given set of statistics is a good representation of the current table demographics is under scrutiny, I often experience that people rely solely on the collection date.

Some tables will not have Teradata stale statistics even if you never refresh them. Just think about lookup tables which are quite static. For other tables, various degrees of changes can occur, with some fields having ever-expanding value ranges and others a constant share of only a few codes. How can get a better picture about statistics and find out if they are stale?  There are several possibilities. Maybe the easiest could be to just run a simple query on the table and using a WHERE CONDITION on the column which should be tested.

SELECT * FROM TABLE WHERE TEST_COLUMN = ‘X’;

Compare the number of estimated records (you can see this number in the Teradata Administrator) with the real number returned and you know if the statistics are stale.

Note that you cannot expect the optimizer to be right down to the individual row on any occasion, but the divergence should not be any larger than a small one-digit percentage,  depending on the overall table size.

>