In this article, we will learn about Teradata set operators, how they work, how we use them to achieve optimal performance, and how Teradata optimizes them.
Teradata UNION vs. UNION ALL
What is the difference between UNION and UNION all in Teradata?
The Teradata UNION statement combines the result sets of at least two queries. Unlike UNION ALL, UNION also removes row duplicates. Each SELECT statement must retrieve the same number of columns. We must mention the columns in the same order, and the column data types for all SELECT statements must be compatible (more about this later). Here is an example for UNION and UNION ALL
SELECT col1 FROM TheTable1 UNION SELECT col1 FROM TheTable2; 'A' 'B' SELECT col1 FROM TheTable1 UNION ALL SELECT col1 FROM TheTable2; 'A' 'A' 'B'
If the datatypes for the same column of different select statements don’t match, the very first SELECT statements will define the datatype for all rows of the column; if needed, Teradata does an implicit casting:
CREATE TABLE Customer ( CustomerId INTEGER, LastName CHAR(10) ) PRIMARY INDEX (CustomerId);
CREATE TABLE Customer_Hist ( CustomerId INTEGER, LastName CHAR(200) ) PRIMARY INDEX (CustomerId);
SELECT LastName FROM Customer UNION SELECT LastName FROM Customer_Hist; 'Vallin' 'A_very_lon' --> Truncated after 10 characters
SELECT LastName FROM Customer_Hist UNION SELECT LastName FROM Customer; 'Vallin' 'A_very_long_last_name' --> Not truncated
The last names of longer than ten characters are truncated from the Customer_Hist table in the example above. An error is difficult to detect because the query runs without errors; incompatible data types lead to runtime errors.
Truncation in UNION statements is one of the few cases where the query order of an SQL statement influences the result set. The only other case I know of is using the RANK() OVER function.
Teradata UNION ALL vs. UNION Performance
The performance of UNION ALL is always better than that of UNION.
Since Teradata UNION only returns distinct rows, Teradata must sort all select statements’ rows into a common spool. For this purpose, it might be necessary to redistribute the rows because the rows of all result sets must have the same primary index for the sort step. The redistribution of the rows and the subsequent sorting require many IOs for large tables. Therefore we only should use UNION if we need to remove row duplicates. If there are no row duplicates in the data, we should always avoid the additional sort (and eventually redistribution) step by using UNION ALL. Use Teradata UNION ALL to avoid sorting whenever possible, especially when retrieving data for a branch from outside Teradata, for example, from the HDFS filesystem.
Significance of the Character Set
We have discussed how different data types can affect the UNION or UNION ALL result set. The same is true for the character set. In the example shown below, the column “col1” of table “Table1” is defined as a character set LATIN, but column “col1” of table “Table2” is defined as a character set UNICODE.
A runtime error will occur if table “Table2” contains UNICODE characters in column “col1”, because the character set of the first select statement defines the character set of the result set:
SELECT col1 FROM Table1 UNION SELECT col1 FROM Table2
Teradata UNION ALL Enhancements for Performance
The simplest method to merge all rows of a UNION ALL is to put the rows of all branches into a common spool. This approach is still used in Teradata, but nowadays (starting with Teradata Release 16), many additional optimizations can make the UNION ALL operator more efficient. These optimizations include:
- Pushing of Joins into the UNION ALL branches
- Grouping branches before join pushing
- Iterative join decomposition of a pushed join into multiple joins
- Optimized counting of rows
- Multisource Joins
We will look at these optimizations in detail to understand how they work and their benefits. Please note that all described optimizations are performed in the context of cost-based optimization. Therefore, it is crucial to provide the Optimizer with all required statistics. All optimizations can be applied to views and derived sub queries as well.
Teradata Join Pushing
Join Pushing aims to prevent spooling of all UNION ALL branches into a common spool before further processing. Instead, the Optimizer pushes the joins into all or some branches. Here is an example. The figure below shows the execution plan without join pushing (both branches of the UNION ALL are spooled into a common spool); in the next step, the joins with Table 1 and Table 2 are done:
The following figure shows a possible execution plan with Join Pushing applied. Whether this plan is chosen depends on how the Optimizer estimates the costs. The Optimizer can choose between different plans. The plan below is just one of them. Two other possible plans are to join Table 2 with Table 3 and Table 4 or to join Table 1 and Table 2 first and push the result into the branches.
The advantage of join Pushing over the classic plan is that many alternative plans are now available to the Optimizer. Spooling all branches into a common spool can create issues such as “out of spool space” for branches containing massive rows.
When can the join be pushed to the UNION branches without changing the query result?
For UNION ALL, Teradata can always apply Join Pushing. We can easily verify this statement by comparing two different approaches. In the first one, we perform the UNION ALL and then the join. In the second one, vice versa:
Table1 Join (Table2 UNION ALL Table3)
gives the same result as:
(Table1 Join Table2 UNION ALL Table1 Join Table3).
In some cases, Join Pushing can also be used for UNION still delivering the correct result. This is the case for INNER joins if the join does not increase the number of rows. As an example, a join over unique join columns fulfills this requirement. For OUTER joins, the conditions for join pushing become even more complicated.
Table1 LEFT JOIN (Table2 UNION Table3)) cannot be pushed because the number of unmatched rows can differ depending on whether the join is executed before or after the UNION.
(Table2 UNION Table3) left Join Table1, on the other hand, can be pushed because here, both methods (join after UNION vs. UNION after join) will return the same unmatched rows.
As you will have noticed, it is recommended to use UNION ALL instead of UNION whenever possible for the reasons given above. Because then it is guaranteed that join pushing is available for performance optimization.
Teradata Branch Grouping
Branch Grouping is a method to reduce the number of UNION ALL branches and is applied before Join Pushing. Teradata can use aggregations for a single or group of UNION ALL branches. Instead of spooling all branches into a common spool and doing the aggregation afterward, early aggregation can be applied for two or more branches. The performance improvement can be significant if the partial aggregations decrease spool file size. Branch grouping is used for the branches for which the data geography is determined on the fly (this is the case if they include joins or aggregations). In contrast, the branches with a known data geography (such as base table retrieves without aggregation) are not considered for grouping.
Teradata Join Decomposition
Join decomposition is another tool the Optimizer can use to improve the execution plan. If we recall the Join Pushing example above, in one plan step, Table 1 was joined with Table 3 as shown in the following figure (The Optimizer pushed the join into the UNION ALL):
Join decomposition means splitting a join into several joins. Join decomposition works great with columnar tables, as shown in the figure below. The Optimizer divides the join into two joins, each of them joining a subset of the columns of Table 3:
Join decomposition is not only limited to columnar tables. For row-oriented tables, The Optimizer can use a secondary index for joining; in our example, it contains the columns needed for the second join of table 3 (columns c,d).
Teradata Optimized Row Counts
Optimized counting of rows is another enhancement introduced for SELECT COUNT(*) queries based on UNION ALL.
Instead of moving all UNION ALL branches into a common spool and counting the number of rows one by one, the Optimizer can now use the cylinder index (the same technique used for single table row counting) to calculate each branch’s rows very efficiently on the cylinder index and sums the final result up afterward.
Teradata Multisource Joins
Finally, there is the possibility of doing multisource joins. Let us go back to the above example in which we applied join pushing. We were joining a table against the first branch of the UNION ALL and the second branch of the UNION ALL. We moved the join into the branches. Multisource joining is an optimization that The Optimizer can apply if the data geography of both joins is the same (same join strategy, i.e., the same Join preparation and join method is required). If this prerequisite is fulfilled, The Optimizer can do the joins at once. The advantage is that many required join tasks, such as sorting, building hash tables, etc., can be done in one go.
All optimizations we discussed are only available for UNION ALL; unfortunately, deduplicating the result set for UNION queries does not allow using these optimizations.
INTERSECT / INTERSECT ALL
The INTERSECT operator returns as the result row duplicate of two select statements (all rows identical in both SELECT statements).
SELECT col1,col2 FROM TheTable1
SELECT col1,col2 FROM TheTable2
Similar to Teradata UNION ALL, INTERSECT ALL returns duplicates without filtering them.
We recommend using INTERSECT ALL instead of INTERSECT for better performance (the reasons are similar to UNION ALL described above).
MINUS / EXCEPT
The MINUS/EXCEPT operator returns a result set containing all rows of the first select statement that are not present in the second select statement; MINUS and EXCEPT are two different names for the same functionality:
SELECT col1,col2 FROM TheTable1
MINUS is often used to determine whether two tables are entirely identical. For this, we have to execute the MINUS operator two times, as shown below:
SELECT col1,col2 FROM TheTable1
Precedence of SET Operators
The set operators (UNION, UNION ALL, EXCEPT, MINUS) have the same priority. We recommend always using parentheses to ensure the desired priority when using SET operators.