October 10

2 comments

Teradata Performance – Don’t UPDATE, MERGE!

By Roland Wenzlofsky

October 10, 2015

MERGE INTO, performance, tuning, UPDATE

There are several performance benefits if you use MERGE INTO requests instead of UPDATE statements:

  • No spool required
  • Fewer CPU seconds consumed
  • Less Disk IOs needed

IO reduction arises from the fact that each data block is only touched once and that no spooling is required during the merge operation.

The advantage of the MERGE INTO statement over the traditional UPDATE statement is becoming even bigger if secondary indexes are available, as index maintenance for the MERGE INTO statement is done for whole data blocks, not per row.

As usual, I will demonstrate this with a small example, which is based on the following tables:

CREATE MULTISET TABLE DWHPro.Sales4 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT ‘YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE ‘1900-01-01' AND DATE ‘2016-12-31' EACH INTERVAL ‘1' MONTH ,NO RANGE);

CREATE MULTISET TABLE DWHPro.Sales5 ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
SalesId INTEGER NOT NULL,
Quantity DECIMAL(18,2),
SalesDate DATE FORMAT ‘YYYY-MM-DD')
PRIMARY INDEX ( SalesId )
PARTITION BY RANGE_N(SalesDate BETWEEN DATE ‘1900-01-01' AND DATE ‘2016-12-31' EACH INTERVAL ‘1' MONTH ,NO RANGE);

Both tables have the same primary index and partitions and contain about 3.5 million rows.

I have been executing two statements, exactly doing the same update, but the first statement is a traditional UPDATE and the second one the MERGE INTO update:

UPDATE Sales4 FROM Sales5 SET Quantity = Sales5.Quantity WHERE  Sales4.SalesID=Sales5.SalesID AND Sales4.SalesDate=Sales5.SalesDate;
1) First, we lock a distinct DWHPRO.”pseudo table” for read on a
RowHash to prevent global deadlock for DWHPRO.Sales5.
2) Next, we lock a distinct DWHPRO.”pseudo table” for write on a
RowHash to prevent global deadlock for DWHPRO.Sales4.
3) We lock DWHPRO.Sales5 for read, and we lock DWHPRO.Sales4 for
write.
4) We do an all-AMPs JOIN step from DWHPRO.Sales5 by way of a RowHash
match scan with no residual conditions, which is joined to
DWHPRO.Sales4 by way of a RowHash match scan with no residual
conditions. DWHPRO.Sales5 and DWHPRO.Sales4 are joined using a
rowkey-based merge join, with a join condition of (
“(DWHPRO.Sales4.SalesId = DWHPRO.Sales5.SalesId) AND
(DWHPRO.Sales4.SalesDate = DWHPRO.Sales5.SalesDate)”). The input
tables DWHPRO.Sales5 and DWHPRO.Sales4 will not be cached in
memory, but DWHPRO.Sales5 is eligible for synchronized scanning.
The result goes into Spool 1 (all_amps), which is built locally on
the AMPs. Then we do a SORT to order Spool 1 by the sort key in
spool field1 (DWHPRO.Sales4.ROWID). The result spool file will
not be cached in memory. The size of Spool 1 is estimated with
low confidence to be 3,461,454 rows (89,997,804 bytes). The
estimated time for this step is 1 minute and 5 seconds.
5) We do a MERGE Update to DWHPRO.Sales4 from Spool 1 (Last Use) via
ROWID. The size is estimated with low confidence to be 3,461,454
rows (100,382,166 bytes). The estimated time for this step is 1
minute and 52 seconds.
-> No rows are returned to the user as the result of statement 1.
The total estimated time is 2 minutes and 57 seconds.

As the EXPLAIN statement reveals, a spool is created and has to be sorted before the MERGE update takes place. Now let's take a look at the MERGE INTO example request:

MERGE INTO Sales4 USING Sales5 ON Sales4.SalesID=Sales5.SalesId AND Sales4.SalesDate=Sales5.SalesDate
WHEN MATCHED THEN UPDATE SET Quantity=Sales5.Quantity;
1) First, we lock a distinct DWHPRO.”pseudo table” for read on a
RowHash to prevent global deadlock for DWHPRO.Sales5.
2) Next, we lock a distinct DWHPRO.”pseudo table” for write on a
RowHash to prevent global deadlock for DWHPRO.Sales4.
3) We lock DWHPRO.Sales5 for read, and we lock DWHPRO.Sales4 for
write.
4) We do an all-AMPs merge with matched updates into DWHPRO.Sales4
from DWHPRO.Sales5 with a condition of (“(DWHPRO.Sales4.SalesId =
DWHPRO.Sales5.SalesId) AND (DWHPRO.Sales4.SalesDate =
DWHPRO.Sales5.SalesDate)”). The number of rows merged is
estimated with no confidence to be 3,461,454 rows.
-> No rows are returned to the user as the result of statement 1.

Now let's compare the resource usage of both statements:

Logical IOs CPU Seconds Spool (GB)
UPDATE 5991,00 5,97 3
MERGE INTO 4441,00 3,57 0

Obviously, the MERGE INTO consumes about 26% less IOs and 40% less CPU seconds than the UPDATE statement.

We mentioned earlier, that index maintenance is done in the case of MERGE INTO for whole data blocks. Therefore, I will show you another example, which carves out the impact of an NUSI being maintained at the same time:

CREATE INDEX (Quantity) ON Sales4;

Everything else stays the same. Here is the excerpt from the Query Log:

Logical IOs CPU Seconds Spool (GB)
UPDATE 12439,00 11,10 3
MERGE INTO 4459,00 4,46 0

Above table shows that the index maintenance was almost for free with the MERGE INTO approach, but IOs for the traditional UPDATE statement suffered when maintaining the NUSI. IOs for the MERGE INTO statement are almost 64% less than for the UPDATE statement. The NUSI did not impact spool usage.

Similar to our example above, MERGE INTO can be used to replace INSERT…SELECT or to apply UPDATE and INSERT steps at once (even further increasing performance).

Conclusion:

While MERGE INTO may not be the right choice  (or even not possible) in each single situation, you should consider it as an alternative each time when you are dealing with INSERTs and  UPDATEs. It drastically can improve the performance!

Roland Wenzlofsky


Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 20 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

You might also like

  • Thanks for this. Can we use a merge into using values rather than a subquery as the source table?

    Reply

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    Never miss a good story!

     Subscribe to our newsletter to keep up with the latest trends!

    >