# Teradata Set Operators: Understanding UNION vs. UNION ALL for Peak Performance Optimization

Roland Wenzlofsky

May 3, 2023

In this article, we will delve into the world of Teradata set operators, exploring their functionality, applications for attaining peak performance, and the optimization techniques employed by Teradata.

## Teradata UNION vs. UNION ALL

What sets UNION apart from UNION ALL in Teradata?

The Teradata UNION operator combines the results of two or more queries, removing duplicate rows. It is important to note that each SELECT statement must have the same number of columns, which must be specified in the same order. Additionally, the data types of the columns in all SELECT statements must be compatible. Let us now consider an example with two tables:

SELECT col1 FROM TheTable1 UNION SELECT col1 FROM TheTable2;
'A'
'B'

SELECT col1 FROM TheTable1 UNION ALL SELECT col1 FROM TheTable2;
'A'
'A'
'B'

When the data types of a column differ across multiple SELECT statements, Teradata uses the data type from the first SELECT statement to define the data type for all rows in that column. If needed, Teradata will perform implicit casting to ensure compatibility.

CREATE TABLE Customer
(
CustomerId INTEGER,
LastName CHAR(10)
) PRIMARY INDEX (CustomerId);
CREATE TABLE Customer_Hist
(
CustomerId INTEGER,
LastName CHAR(200)
) PRIMARY INDEX (CustomerId);
SELECT LastName FROM Customer
UNION
SELECT LastName FROM Customer_Hist;

'Vallin'
'A_very_lon'  --> Truncated after 10 characters
SELECT LastName FROM Customer_Hist
UNION
SELECT LastName FROM Customer;

'Vallin'
'A_very_long_last_name' --> Not truncated

Last names longer than ten characters are shortened in the Customer_Hist table. Detecting these errors can be difficult because the query produces no errors, but incompatible data types can cause runtime errors.

UNION statement truncation is a rare occurrence where the query order of an SQL statement impacts the result set. Another instance is when the RANK() OVER function is utilized.

## Teradata UNION ALL vs. UNION Performance

UNION ALL consistently outperforms UNION in terms of performance.

The Teradata UNION operator solely returns unique rows, requiring sorting all SELECT statement rows into a shared spool. This process may require row redistribution, as all result sets must share the same primary index during sorting. The redistribution and sorting process can consume substantial I/O resources for larger tables.

Use UNION only when duplicates must be removed. If there are no duplicates, it is recommended to use UNION ALL to avoid the extra sorting (and potential redistribution) step. When accessing data from external sources like the HDFS filesystem, it is especially beneficial to use Teradata UNION ALL to bypass sorting whenever possible.

## Character Set Significance

Earlier, we examined how different data types affect the outcome of UNION and UNION ALL operations. Similarly, character sets also play a significant role. For instance, in the following example, the character set for column “col1” in “Table1” is LATIN, whereas the character set for column “col1” in “Table2” is UNICODE.

A runtime error may occur if column “col1” in table “Table2” contains UNICODE characters. This is because the character set of the first SELECT statement determines the character set of the result set.

SELECT col1 FROM Table1 UNION SELECT col1 FROM Table2;

## Performance Enhancements for Teradata UNION ALL

Placing all rows from all branches into a common spool is the simplest method to combine all rows in a UNION ALL operation. However, Teradata has introduced further optimizations (beginning with Teradata Release 16) to improve the efficiency of the UNION ALL operator. These optimizations comprise:

• Pushing of Joins into the UNION ALL Branches
• Grouping branches before join pushing
• Iterative join decomposition of a pushed join into multiple joins
• Optimized counting of rows
• Multisource Joins

We will thoroughly analyze these optimizations to understand how they work and their benefits. It is crucial to provide the Optimizer with all necessary statistics as these optimizations operate within the framework of cost-based optimization. Views and derived subqueries are also eligible for these optimizations.

Pushing Join eliminates the requirement of spooling all UNION ALL branches into a shared spool before additional processing. Alternatively, the Optimizer pushes the joins into some or all the branches. For instance, the diagram below represents the execution plan without Join Pushing (both branches of UNION ALL are spooled into a common spool). In the following step, the joins with Table 1 and Table 2 are executed:

The figure below displays a potential execution plan with implemented Join Pushing, chosen based on the Optimizer’s cost estimation. The Optimizer has multiple plans, and the illustrated plan is merely an example. Other potential plans consist of joining Table 2 with Table 3 and Table 4 or initially joining Table 1 and Table 2 and subsequently pushing the result into the branches.

Join Pushing offers a significant advantage over the traditional plan, providing the Optimizer with access to multiple alternate plans. Merging all branches into a shared spool can result in issues such as insufficient spool space for branches with numerous rows.

When can the join be pushed to UNION branches without altering the query result?

Teradata consistently applies to Join Pushing for UNION ALL. This can be validated by comparing two approaches: executing the join after the UNION ALL and reversing the order by executing the UNION ALL before the join.

Table1 Join (Table2 UNION ALL Table3)

Produces identical outcomes as:

(Table1 Join Table2 UNION ALL Table1 Join Table3).

Join Pushing can sometimes be used for UNION and still produce the correct outcome. This applies to INNER joins that do not increase the row count, such as a join over unique columns. The criteria for join pushing become more intricate with OUTER joins.

Table1 LEFT JOIN (Table2 UNION Table3) cannot be pushed, as the number of unmatched rows may vary depending on whether the join is executed before or after the UNION.

Conversely, (Table2 UNION Table3) LEFT JOIN Table1 can be pushed because, in this case, both methods (join after UNION vs. UNION after join) will return the same unmatched rows.

It is recommended to use UNION ALL instead of UNION whenever possible. This enables join pushing as a performance optimization choice.

Branch grouping reduces the number of UNION ALL branches before Join Pushing. Teradata can apply aggregations to an individual or group of UNION ALL branches. Instead of spooling all branches into a common spool and performing aggregation subsequently, early aggregation can be applied to two or more branches.

Partial aggregations can significantly enhance performance by reducing spool file size. Branch Grouping is applied exclusively to branches with dynamically determined data geography, typically involving joins or aggregations. Conversely, branches with established data geography, such as base table retrievals without aggregation, are excluded from grouping consideration.

The Optimizer can utilize Join Decomposition to optimize the execution plan. In a previous example of Join Pushing, Table 1 and Table 3 were joined in one step, as shown in the following figure (the Optimizer pushed the join into the UNION ALL).

Decomposition involves breaking down a join into smaller ones, which is particularly effective for columnar tables. The accompanying figure demonstrates this approach, with the Optimizer dividing the join into two individual joins that link subset of the columns in Table 3.

Decomposition extends beyond columnar tables. The Optimizer can employ a secondary index to perform joins when working with row-oriented tables. In this case, the index contains the necessary columns for the second join of Table 3, specifically columns c and d.

Optimized row counting is another enhancement introduced for SELECT COUNT(*) queries based on UNION ALL.

The Optimizer can efficiently calculate the row count for each UNION ALL branch using the cylinder index, the same technique used for single table row counting. This eliminates the need to transfer all branches into a common spool, count the number of rows individually, and allows for the final result to be summed up accurately.

Additionally, multisource joins can be utilized. In the previous example where join pushing was implemented, we joined a table with both branches of the UNION ALL.

We merged the branches using a multisource join optimization, which is only possible when both joins have the same data geography and require the same preparation and method. By meeting this requirement, the Optimizer can execute the joins simultaneously, completing many required tasks (such as sorting and building hash tables) in one step.

Regrettably, the optimizations mentioned are solely accessible for UNION ALL. Eliminating duplicates in the outcome set for UNION queries restricts the utilization of said optimizations.

## INTERSECT / INTERSECT ALL

The INTERSECT operator yields the common rows between two SELECT statements, which are the identical ones present in both.

SELECT col1,col2 FROM TheTable1
INSERTSECT
SELECT col1,col2 FROM TheTable2;

Like Teradata UNION ALL, INTERSECT ALL returns duplicates without filtering them out.

For improved performance, we recommend using INTERSECT ALL instead of INTERSECT, as the reasons for this preference align with those described earlier for UNION ALL.

## MINUS / EXCEPT

The MINUS/EXCEPT operator retrieves a result set comprising the rows from the first SELECT statement that do not appear in the second SELECT statement. MINUS and EXCEPT refer to the same function.

SELECT col1,col2 FROM TheTable1 MINUS SELECT col1,col2 FROM TheTable2;
SELECT col1,col2 FROM TheTable1 EXCEPT SELECT col1,col2 FROM TheTable2;

The MINUS operator is commonly utilized for verifying if two tables are identical. To achieve this, we must perform the MINUS operation twice, as demonstrated below:

SELECT col1,col2 FROM TheTable1 MINUS SELECT col1,col2 FROM TheTable2;
SELECT col1,col2 FROM TheTable2 EXCEPT SELECT col1,col2 FROM TheTable1;

We must verify that both compared tables are populated to ensure accurate results from MINUS. If either table is empty, MINUS may incorrectly suggest that the tables are equal.

## Precedence of SET Operators

The set operators (UNION, UNION ALL, EXCEPT, MINUS) have equal priority. Using parentheses to ensure the desired precedence when using these operators is highly recommended.

Topics of relevance:

• shikher says:

How to connect for Teradata freelancing projects. pls help