Teradata MERGE INTO vs. UPDATE
This article compares the UPDATE statement to the MERGE INTO statement, analyzing their respective performance differences and limitations.
The Teradata MERGE INTO statement positively impacts performance by reducing I/O operations due to certain properties related to its performance.
- MERGE INTO does not require spool space
- MERGE INTO typically consumes fewer CPU seconds
- MERGE INTO typically consumes fewer IOs
MERGE INTO offers an advantage in lower IOs as Teradata processes each data block only once without building up any spool. This benefit is particularly significant when secondary indexes are present on the table, as they are also processed block by block.
We designed a test scenario comprising two tables with around 3.5 million rows each to showcase the performance advantages. These tables share identical primary indexes and partitioning.
CREATE MULTISET TABLE DWHPro.Sales4
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD')
PRIMARY INDEX ( SalesId ) PARTITION BY
RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH ,NO RANGE);
CREATE MULTISET TABLE DWHPro.Sales5
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT 'YYYY-MM-DD'
) PRIMARY INDEX ( SalesId )PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH ,NO RANGE);
I executed an UPDATE statement twice, once with UPDATE and the other with MERGE INTO.
UPDATE Sales4 FROM Sales5 SET Quantity = Sales5.Quantity
WHERE
Sales4.SalesID=Sales5.SalesID AND
Sales4.SalesDate=Sales5.SalesDate;
Firstly, let’s examine the UPDATE statement’s execution plan.
4) We do an all-AMPs JOIN step from DWHPRO.Sales5 by way of a RowHashmatch scan with no residual conditions, which is joined toDWHPRO.Sales4 by way of a RowHash match scan with no residualconditions. DWHPRO.Sales5 and DWHPRO.Sales4 are joined using arowkey-based merge join, with a join condition of ("(DWHPRO.Sales4.SalesId = DWHPRO.Sales5.SalesId) AND(DWHPRO.Sales4.SalesDate = DWHPRO.Sales5.SalesDate)"). The inputtables DWHPRO.Sales5 and DWHPRO.Sales4 will not be cached inmemory, but DWHPRO.Sales5 is eligible for synchronized scanning.The result goes into Spool 1 (all_amps), which is built locally onthe AMPs.Then we do a SORT to order Spool 1 by the sort key inspool field1 (DWHPRO.Sales4.ROWID). The result spool file willnot be cached in memory. The size of Spool 1 is estimated withlow confidence to be 3,461,454 rows (89,997,804 bytes). Theestimated time for this step is 1 minute and 5 seconds.5) We do a MERGE Update to DWHPRO.Sales4 from Spool 1 (Last Use) viaROWID. The size is estimated with low confidence to be 3,461,454rows (100,382,166 bytes). The estimated time for this step is 1minute and 52 seconds.-> No rows are returned to the user as the result of statement 1.The total estimated time is 2 minutes and 57 seconds.
An AMP-local spool is generated and subsequently sorted by ROWID, resulting in a substantial amount of IOs, particularly for extensive tables.
Now, let’s consider the same task accomplished with the usage of the MERGE INTO statement:
MERGE INTO Sales4 USING Sales5
ON
Sales4.SalesID=Sales5.SalesId AND
Sales4.SalesDate=Sales5.SalesDate
WHEN MATCHED THEN UPDATE SET Quantity=Sales5.Quantity;
Here is the Execution Plan for the MERGE INTO statement:
1) First, we lock a distinct DWHPRO."pseudo table" for read on aRowHash to prevent global deadlock for DWHPRO.Sales5.2) Next, we lock a distinct DWHPRO."pseudo table" for write on aRowHash to prevent global deadlock for DWHPRO.Sales4.3) We lock DWHPRO.Sales5 for read, and we lock DWHPRO.Sales4 forwrite.4) We do an all-AMPs merge with matched updates into DWHPRO.Sales4from DWHPRO.Sales5 with a condition of ("(DWHPRO.Sales4.SalesId =DWHPRO.Sales5.SalesId) AND (DWHPRO.Sales4.SalesDate =DWHPRO.Sales5.SalesDate)"). The number of rows merged isestimated with no confidence to be 3,461,454 rows.-> No rows are returned to the user as the result of statement 1.
Once we’ve executed both options, we can compare the necessary resources.
Logical IOs | CPU Seconds | Spool (GB) | |
UPDATE | 5991,00 | 5,97 | 3 |
MERGE INTO | 4441,00 | 3,57 | 0 |
MERGE INTO vs. UPDATE
Compared to the UPDATE method, the MERGE INTO statement necessitates fewer IOs and CPU seconds by approximately 26% and 40%, respectively. Hence, choosing MERGE INTO over UPDATE could result in significant resource savings.
As previously noted, indexes on affected tables are maintained block by block with MERGE INTO rather than row by row. A test scenario demonstrates that this results in even greater resource savings with the MERGE INTO statement.
CREATE INDEX (Quantity) ON Sales4; -- Create a NUSI
We have made a single modification to the test by introducing a solitary NUSI, causing a shift.
Logical IOs | CPU Seconds | Spool (GB) | |
UPDATE | 12439,00 | 11,10 | 3 |
MERGE INTO | 4459,00 | 4,46 | 0 |
MERGE INTO vs. UPDATE with NUSI
The QueryLog results reveal that the UPDATE statement consumes twice the required IOs and CPU seconds compared to the test that lacked an index. Notably, the NUSI maintenance necessitates minimal resources as the CPU seconds and IOs for the MERGE INTO statement remains nearly unchanged.
Using the MERGE INTO statement results in a significant reduction of 64% in IOs. It’s worth noting that the advantage depends on the tables’ size and the number of changed rows. While adding more indexes to the table can further extend this example, we will leave this to the interested reader. It’s important to consider that the use of MERGE INTO may not always be the optimal choice, as restrictions may prevent its use in place of the UPDATE statement.
Using MERGE INTO improves performance as it executes INSERT and UPDATE statements concurrently, unlike single INSERT and UPDATE statements.
However, there is a notable constraint when using MERGE INTO. This approach exclusively works with partitioned tables if all row partition columns are present in the join condition of both tables. Let’s utilize the two tables from the previous test scenario to aid comprehension.
PRIMARY INDEX ( SalesId ) PARTITION BY RANGE_N(SalesDate BETWEEN DATE '1900-01-01' AND DATE '2022-12-31' EACH INTERVAL '1' MONTH, NO RANGE);
The primary index is SalesId, and the partition column is SalesDate. Attempting a MERGE INTO statement like the one displayed below would result in an error:
MERGE INTO Sales4 USING Sales5 ON
Sales4.SalesID=Sales5.SalesId AND
Sales4.SalesDate=Sales5.SalesDate -- If SalesDate is missing in the join, we will get an error when MATCHED THEN UPDATE SET Quantity=Sales5.Quantity;
If SalesId is a surrogate key and SalesDate is already included when creating the key, including SalesDate in the joins may be redundant without affecting the outcome. However, it is not always advisable as it may compromise the source independence of our data. Blindly switching all UPDATE statements to MERGE INTO statements is not recommended, and we should always evaluate its impact on our data warehouse design.
Case1:
USING
SPV1 (DATE)
MERGE INTO DWHPRO.TEST1 AS Test_Feed
USING
(SELECT a1, a2, a3 from DWHPRO.TEST2
WHERE test1.a1 = test2.a1
)
…
…
WHEN MATCHED THEN UPDATE SET DATE = “Current_Date”
…..
WHEN NOT MATCHED THEN INSERT …. …
;
Case2: Delete from DWHPRO.TEST1 … WHERE A1, A2, A3 IN
(SEL A1, A2 , A3 FROM DWHPRO.TEST2 FROM DWHPRO.TEST2);
Insert into DWHPRO.Test1…. ( );
@Roland, In the above case (If Tables are Non partitioned),
Wouldn’t Deleting all the matching records and then INSERT the new records will be more beneficial?
I see, most of the Developers prefer writing MERGE statements. Can you please share your thoughts?
Thanks for this. Can we use a merge into using values rather than a subquery as the source table?
Thanks for the information, it is very helpful.