Teradata Set operators simply explained

Roland Wenzlofsky

June 3, 2022

minutes reading time

In this article, we will learn about Teradata set operators, how they work, how we use them to achieve optimal performance, and how Teradata optimizes them.

Teradata UNION vs. UNION ALL

What is the difference between UNION and UNION all in Teradata?

The Teradata UNION statement combines the result sets of at least two queries. Unlike UNION ALL, UNION also removes row duplicates. Each SELECT statement must retrieve the same number of columns. We must mention the columns in the same order, and the column data types for all SELECT statements must be compatible (more about this later). Here is an example for UNION and UNION ALL

Input Table1
Input Table2
SELECT col1 FROM TheTable1 UNION SELECT col1 FROM TheTable2;

SELECT col1 FROM TheTable1 UNION ALL SELECT col1 FROM TheTable2; 

If the datatypes for the same column of different select statements don’t match, the very first SELECT statements will define the datatype for all rows of the column; if needed, Teradata does an implicit casting:

Table Customer
Table Customer_Hist
    CustomerId INTEGER,
    LastName CHAR(10)
) PRIMARY INDEX (CustomerId);
CREATE TABLE Customer_Hist
  CustomerId INTEGER,
  LastName CHAR(200)
) PRIMARY INDEX (CustomerId);
SELECT LastName FROM Customer 
SELECT LastName FROM Customer_Hist;

'A_very_lon'  --> Truncated after 10 characters
SELECT LastName FROM Customer_Hist 
SELECT LastName FROM Customer;

'A_very_long_last_name' --> Not truncated

The last names of longer than ten characters are truncated from the Customer_Hist table in the example above. An error is difficult to detect because the query runs without errors; incompatible data types lead to runtime errors.

Truncation in UNION statements is one of the few cases where the query order of an SQL statement influences the result set. The only other case I know of is using the RANK() OVER function.

Teradata UNION ALL vs. UNION Performance

teradata union

The performance of UNION ALL is always better than that of UNION.

Since Teradata UNION only returns distinct rows, Teradata must sort all select statements’ rows into a common spool. For this purpose, it might be necessary to redistribute the rows because the rows of all result sets must have the same primary index for the sort step. The redistribution of the rows and the subsequent sorting require many IOs for large tables. Therefore we only should use UNION if we need to remove row duplicates. If there are no row duplicates in the data, we should always avoid the additional sort (and eventually redistribution) step by using UNION ALL. Use Teradata UNION ALL to avoid sorting whenever possible, especially when retrieving data for a branch from outside Teradata, for example, from the HDFS filesystem.

Significance of the Character Set

We have discussed how different data types can affect the UNION or UNION ALL result set. The same is true for the character set. In the example shown below, the column “col1” of table “Table1” is defined as a character set LATIN, but column “col1” of table “Table2” is defined as a character set UNICODE.

A runtime error will occur if table “Table2” contains UNICODE characters in column “col1”, because the character set of the first select statement defines the character set of the result set:


Teradata UNION ALL Enhancements for Performance

The simplest method to merge all rows of a UNION ALL is to put the rows of all branches into a common spool. This approach is still used in Teradata, but nowadays (starting with Teradata Release 16), many additional optimizations can make the UNION ALL operator more efficient. These optimizations include:

  • Pushing of Joins into the UNION ALL branches
  • Grouping branches before join pushing
  • Iterative join decomposition of a pushed join into multiple joins
  • Optimized counting of rows
  • Multisource Joins

We will look at these optimizations in detail to understand how they work and their benefits. Please note that all described optimizations are performed in the context of cost-based optimization. Therefore, it is crucial to provide the Optimizer with all required statistics. All optimizations can be applied to views and derived sub queries as well.

Teradata Join Pushing

Join Pushing aims to prevent spooling of all UNION ALL branches into a common spool before further processing. Instead, the Optimizer pushes the joins into all or some branches. Here is an example. The figure below shows the execution plan without join pushing (both branches of the UNION ALL are spooled into a common spool); in the next step, the joins with Table 1 and Table 2 are done:

Teradata Set operators simply explained 1
UNION ALL without Join Pushing

The following figure shows a possible execution plan with Join Pushing applied. Whether this plan is chosen depends on how the Optimizer estimates the costs. The Optimizer can choose between different plans. The plan below is just one of them. Two other possible plans are to join Table 2 with Table 3 and Table 4 or to join Table 1 and Table 2 first and push the result into the branches.

Teradata Set operators simply explained 2
UNION ALL with Join Pushing

The advantage of join Pushing over the classic plan is that many alternative plans are now available to the Optimizer. Spooling all branches into a common spool can create issues such as “out of spool space” for branches containing massive rows.

When can the join be pushed to the UNION branches without changing the query result?

For UNION ALL, Teradata can always apply Join Pushing. We can easily verify this statement by comparing two different approaches. In the first one, we perform the UNION ALL and then the join. In the second one, vice versa:

Table1 Join (Table2 UNION ALL Table3)

gives the same result as:

(Table1 Join Table2 UNION ALL Table1 Join Table3).

In some cases, Join Pushing can also be used for UNION still delivering the correct result. This is the case for INNER joins if the join does not increase the number of rows. As an example, a join over unique join columns fulfills this requirement. For OUTER joins, the conditions for join pushing become even more complicated.

Table1 LEFT JOIN (Table2 UNION Table3)) cannot be pushed because the number of unmatched rows can differ depending on whether the join is executed before or after the UNION.

(Table2 UNION Table3) left Join Table1, on the other hand, can be pushed because here, both methods (join after UNION vs. UNION after join) will return the same unmatched rows.

As you will have noticed, it is recommended to use UNION ALL instead of UNION whenever possible for the reasons given above. Because then it is guaranteed that join pushing is available for performance optimization.

Teradata Branch Grouping

Branch Grouping is a method to reduce the number of UNION ALL branches and is applied before Join Pushing. Teradata can use aggregations for a single or group of UNION ALL branches. Instead of spooling all branches into a common spool and doing the aggregation afterward, early aggregation can be applied for two or more branches. The performance improvement can be significant if the partial aggregations decrease spool file size. Branch grouping is used for the branches for which the data geography is determined on the fly (this is the case if they include joins or aggregations). In contrast, the branches with a known data geography (such as base table retrieves without aggregation) are not considered for grouping.

Teradata Join Decomposition

Join decomposition is another tool the Optimizer can use to improve the execution plan. If we recall the Join Pushing example above, in one plan step, Table 1 was joined with Table 3 as shown in the following figure (The Optimizer pushed the join into the UNION ALL):

Teradata Set operators simply explained 3
No Join decomposition

Join decomposition means splitting a join into several joins. Join decomposition works great with columnar tables, as shown in the figure below. The Optimizer divides the join into two joins, each of them joining a subset of the columns of Table 3:

Teradata Set operators simply explained 4
Join decomposition

Join decomposition is not only limited to columnar tables. For row-oriented tables, The Optimizer can use a secondary index for joining; in our example, it contains the columns needed for the second join of table 3 (columns c,d).

Teradata Optimized Row Counts

Optimized counting of rows is another enhancement introduced for SELECT COUNT(*) queries based on UNION ALL.

Instead of moving all UNION ALL branches into a common spool and counting the number of rows one by one, the Optimizer can now use the cylinder index (the same technique used for single table row counting) to calculate each branch’s rows very efficiently on the cylinder index and sums the final result up afterward.

Teradata Multisource Joins

Finally, there is the possibility of doing multisource joins. Let us go back to the above example in which we applied join pushing. We were joining a table against the first branch of the UNION ALL and the second branch of the UNION ALL. We moved the join into the branches. Multisource joining is an optimization that The Optimizer can apply if the data geography of both joins is the same (same join strategy, i.e., the same Join preparation and join method is required). If this prerequisite is fulfilled, The Optimizer can do the joins at once. The advantage is that many required join tasks, such as sorting, building hash tables, etc., can be done in one go.

All optimizations we discussed are only available for UNION ALL; unfortunately, deduplicating the result set for UNION queries does not allow using these optimizations.


teradata intersect

The INTERSECT operator returns as the result row duplicate of two select statements (all rows identical in both SELECT statements).

SELECT col1,col2 FROM TheTable1 
SELECT col1,col2 FROM TheTable2

Similar to Teradata UNION ALL, INTERSECT ALL returns duplicates without filtering them.

We recommend using INTERSECT ALL instead of INTERSECT for better performance (the reasons are similar to UNION ALL described above).


teradata minus

The MINUS/EXCEPT operator returns a result set containing all rows of the first select statement that are not present in the second select statement; MINUS and EXCEPT are two different names for the same functionality:

SELECT col1,col2 FROM TheTable1 MINUS SELECT col1,col2 FROM TheTable2
SELECT col1,col2 FROM TheTable1 EXCEPT SELECT col1,col2 FROM TheTable2;

MINUS is often used to determine whether two tables are entirely identical. For this, we have to execute the MINUS operator two times, as shown below:

SELECT col1,col2 FROM TheTable1 MINUS SELECT col1,col2 FROM TheTable2
SELECT col1,col2 FROM TheTable2 EXCEPT SELECT col1,col2 FROM TheTable1; We must also check whether both tables to be compared are populated. Otherwise, MINUS might provide equality of the tables, although it is not given.

Precedence of SET Operators

The set operators (UNION, UNION ALL, EXCEPT, MINUS) have the same priority. We recommend always using parentheses to ensure the desired priority when using SET operators.

See also:

  • How to connect for Teradata freelancing projects. pls help

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like