fbpx

Teradata vs Redshift Merge Join Strategies

By DWH Pro Admin

February 13, 2020


Teradata and Redshift are very similar in architecture and how data is distributed. For Teradata, it is the AMPs that hold part of the data in a table and for Redshift, it is the slices.

There are major differences when it comes to how the data is stored on the file system.

Teradata can also be used as a Column Store (and this decision can be made per table), but the strengths are definitely the great performance that comes from storing tables in the traditional row format without having to switch to columnar tables with all their disadvantages in Teradata (we will cover in another blog post).

We won’t go into Teradata Columnar Storage in this comparison of join strategies, especially since Teradata anyway transforms the columns of a columnar table into rows before processing the data (early materialization) and therefore there is no strategic difference in the processing of the joins (even if only required columns are selected for columnar tables).

Merge Join Preparation: Co-Locating Rows

Both Teradata and Redshift use hashing to distribute data evenly among the parallel units (Teradata AMPs, Redshift Slices).

As we know from Teradata, rows can only be joined if they are on the same AMP.

Similarly, Redshift requires that the data be on the same slice. So there is not much difference.

Join Preparation in Redshift is similar to that in Teradata.
Copying the entire table to all AMPs in Teradata is equivalent to broadcast in Redshift.

Redistribution in Teradata is the same as in Redshift and happens by hashing the join columns in Teradata, in Redshift a new DISTKEY is determined and then distributed.

As in Teradata, it is possible that none of the tables needs to be redistributed at all because both have the same DISTKEY, or both tables need to be redistributed.

Like Teradata, Redshift always tries to copy or distribute the smaller table,

Where the two systems differ is in Redshift’s ability to keep copies of a table permanently on all slices. This may sound like an advantage at first, but I would like to point out that this only makes sense for small tables. I don’t think that this will result in an enormous performance advantage over Teradata.

Comparison Of Join Methods

The Teradata Merge Join

The requirements for the data are that both tables have the same primary index.

This automatically means that for non-row-partitioned tables, since rows are then sorted by hash, a binary search can be performed from a hash value of the left table in the right table and vice versa (depending on whether slow path or fast path merge joins are performed the joining of rows is slightly different, but the method is the same).

For Row-Partitioned Tables, there is also the requirement to join the partition columns, so that partitions can be compared directly (because within each partition, the rows are sorted by hash value).

The Redshift Merge Join

Merge Join in Redshift requires that the DISTKEY of both tables be the same. So there is no difference between the two systems. Accordingly, in both systems, one or both tables may have to be redistributed to a new primary index (Teradata) or DISTKEY (Redshift).

In Redshift, both tables must also have the same SORTKEY. The problem in Redshift is that the sorting of data is not as elegant as in Teradata.

In Teradata, the rows are always automatically sorted in ascending order by Rowhash. In Redshift, the DBA has to take care that the deleted data is finally removed but also that the data is sorted.

If more than 20% of the data is not sorted Redshift does not use a merge join as sorting a huge amount of data is expensive.

Both systems rely on statics when it comes to the creation of join plans. But as you can see Redshift is much more maintenance intensive than Teradata because there are additional things to consider.

Summary

Although both Redshift and Teradata offer the Merge Join method, Teradata is more mature and easier to maintain.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Teradata Book Query Performance Tuning
Buy the Book Teradata Query Performance Tuning

DWH Pro Admin

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>