Teradata MERGE INTO – the best tricks

Roland Wenzlofsky

June 18, 2022

minutes reading time


Teradata MERGE INTO vs. UPDATE

In this article, we compare an UPDATE statement with a MERGE INTO statement and work out the differences in performance. We also show a MERGE INTO statement’s limitations compared to an UPDATE statement.

Considering only the performance, the Teradata MERGE INTO statement has some properties which have a positive effect on the number of Ios and thus on the performance:

  • MERGE INTO does not require spool space
  • MERGE INTO typically consumes fewer CPU seconds
  • MERGE INTO typically consumes fewer IOs

The advantage of lower IOs results with MERGE INTO is that Teradata must process each data block only once.

The advantage of MERGE INTO in lower IOs is that Teradata must process each data block only once, and no spool is built up. The benefit is even more significant if secondary indexes are present on the table since these are also processed block by block.

To demonstrate the performance benefits, we have created a test scenario consisting of two tables that both contain approximately 3.5 million rows and have the same primary index as well as the same partitioning:

CREATE MULTISET TABLE DWHPro.Sales4 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH ,NO RANGE);

CREATE MULTISET TABLE DWHPro.Sales5 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH ,NO RANGE);

Next, I executed an UPDATE statement. Once with UPDATE, the second time with MERGE INTO:

UPDATE Sales4 FROM Sales5 SET Quantity = Sales5.Quantity 
WHERE  
     Sales4.SalesID=Sales5.SalesID AND 
     Sales4.SalesDate=Sales5.SalesDate
;

Let us first analyze the execution plan of the UPDATE statement:


4) We do an all-AMPs JOIN step from DWHPRO.Sales5 by way of a RowHash
match scan with no residual conditions, which is joined to
DWHPRO.Sales4 by way of a RowHash match scan with no residual
conditions. DWHPRO.Sales5 and DWHPRO.Sales4 are joined using a
rowkey-based merge join, with a join condition of (
"(DWHPRO.Sales4.SalesId = DWHPRO.Sales5.SalesId) AND
(DWHPRO.Sales4.SalesDate = DWHPRO.Sales5.SalesDate)"). The input
tables DWHPRO.Sales5 and DWHPRO.Sales4 will not be cached in
memory, but DWHPRO.Sales5 is eligible for synchronized scanning.
The result goes into Spool 1 (all_amps), which is built locally on
the AMPs. Then we do a SORT to order Spool 1 by the sort key in
spool field1 (DWHPRO.Sales4.ROWID). The result spool file will
not be cached in memory. The size of Spool 1 is estimated with
low confidence to be 3,461,454 rows (89,997,804 bytes). The
estimated time for this step is 1 minute and 5 seconds.
5) We do a MERGE Update to DWHPRO.Sales4 from Spool 1 (Last Use) via
ROWID. The size is estimated with low confidence to be 3,461,454
rows (100,382,166 bytes). The estimated time for this step is 1
minute and 52 seconds.
-> No rows are returned to the user as the result of statement 1.
The total estimated time is 2 minutes and 57 seconds.

As we can see above, an AMP-local spool is created, which must be sorted by ROWID. Both steps cause a lot of IOs, especially with large tables.

In comparison, let’s look at the same task solved using MERGE INTO:

MERGE INTO Sales4 USING Sales5 
ON
Sales4.SalesID=Sales5.SalesId AND
Sales4.SalesDate=Sales5.SalesDate
WHEN MATCHED THEN UPDATE SET Quantity=Sales5.Quantity;

The following is the Execution Plan for the MERGE INTO statement:

1) First, we lock a distinct DWHPRO."pseudo table" for read on a
RowHash to prevent global deadlock for DWHPRO.Sales5.
2) Next, we lock a distinct DWHPRO."pseudo table" for write on a
RowHash to prevent global deadlock for DWHPRO.Sales4.
3) We lock DWHPRO.Sales5 for read, and we lock DWHPRO.Sales4 for
write.
4) We do an all-AMPs merge with matched updates into DWHPRO.Sales4
from DWHPRO.Sales5 with a condition of ("(DWHPRO.Sales4.SalesId =
DWHPRO.Sales5.SalesId) AND (DWHPRO.Sales4.SalesDate =
DWHPRO.Sales5.SalesDate)"). The number of rows merged is
estimated with no confidence to be 3,461,454 rows.
-> No rows are returned to the user as the result of statement 1.

After we have executed both variants, we can now compare the required resources:

 Logical IOsCPU SecondsSpool (GB)
UPDATE5991,005,973
MERGE INTO4441,003,570
MERGE INTO vs. UPDATE

As we can see, the MERGE INTO statement requires about 26% fewer IOs and 40% fewer CPU seconds than the UPDATE method. We could save considerable resources by solving the task with MERGE INTO instead of UPDATE. This is quite considerable.

We have already briefly mentioned that in the case of indexes on the affected tables, these are also maintained block by block in the case of MERGE INTO and not row by row. The following test scenario shows that in such cases, the resource savings shift even more towards the MERGE INTO statement:

CREATE INDEX (Quantity) ON Sales4; -- Create a NUSI

We have changed only one thing in this test: to put a single NUSI on the table we are changing.

 Logical IOsCPU SecondsSpool (GB)
UPDATE12439,0011,103
MERGE INTO4459,004,460
MERGE INTO vs. UPDATE with NUSI

If we look at the results regarding the resources in the QueryLog, we can see that the required IOs and CPU seconds for the UPDATE statement have doubled compared to the test without index. The CPU seconds and IOs for the MERGE INTO statement are almost unchanged, i.e., the maintenance of the NUSI requires hardly any resources. We are talking about 64% fewer IOs for the MERGE INTO statement, which is considerable. We could extend this example further and add more indexes to the table, but we leave this to the interested reader. Finally, we should say that the advantage depends on the size of the tables and the number of changed rows. MERGE INTO may not always be the better choice, especially since restrictions make it impossible to use it instead of the UPDATE statement.

Since MERGE INTO executes INSERT and UPDATE statements simultaneously, this will increase the performance compared to single INSERT and UPDATE statements.

But now we come to a significant limitation that exists with MERGE INTO. This method is only possible for partitioned tables if all columns used in row partitions are also found in the join condition of two tables. We take our two tables from the previous test scenario for better understanding.

PRIMARY INDEX ( SalesId ) PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH ,NO RANGE);

Primary Index is the Column SalesId, and the Partition Column is SalesDate. Therefore a MERGE INTO statement, as shown below, would not be possible and would lead to an error:

MERGE INTO Sales4 USING Sales5  ON     
Sales4.SalesID=Sales5.SalesId AND     
Sales4.SalesDate=Sales5.SalesDate   -- If SalesDate is missing in the join we will get an error
WHEN MATCHED THEN UPDATE SET Quantity=Sales5.Quantity;

Suppose the column SalesId is a surrogate key, and the column SalesDate is already included when creating the key. In that case, we could redundantly have SalesDate in the joins without changing the result. But this does not always make sense, namely if we would give up the source independence of our data. It is, therefore, not advisable to blindly switch all UPDATE statements to MERGE INTO statements but always consider what effect this will have on our data warehouse design.

  • Thanks for this. Can we use a merge into using values rather than a subquery as the source table?

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >