Teradata vs Redshift Merge Join Strategies
Teradata and Redshift are very similar in architecture and how data is distributed. For Teradata, it is the AMPs that hold part of the data in a table and for Redshift, it is the slices.
There are major differences when it comes to how the data is stored on the file system.
Teradata can also be used as a Column Store (and this decision can be made per table), but the strengths are definitely the great performance that comes from storing tables in the traditional row format without having to switch to columnar tables with all their disadvantages in Teradata (we will cover in another blog post).
We won't go into Teradata Columnar Storage in this comparison of join strategies, especially since Teradata anyway transforms the columns of a columnar table into rows before processing the data (early materialization) and therefore there is no strategic difference in the processing of the joins (even if only required columns are selected for columnar tables).
Merge Join Preparation: Co-Locating Rows
Both Teradata and Redshift use hashing to distribute data evenly among the parallel units (Teradata AMPs, Redshift Slices).
As we know from Teradata, rows can only be joined if they are on the same AMP.
Similarly, Redshift requires that the data be on the same slice. So there is not much difference.
Join Preparation in Redshift is similar to that in Teradata.
Copying the entire table to all AMPs in Teradata is equivalent to broadcast in Redshift.
Redistribution in Teradata is the same as in Redshift and happens by hashing the join columns in Teradata, in Redshift a new DISTKEY is determined and then distributed.
As in Teradata, it is possible that none of the tables needs to be redistributed at all because both have the same DISTKEY, or both tables need to be redistributed.
Like Teradata, Redshift always tries to copy or distribute the smaller table,
Where the two systems differ is in Redshift's ability to keep copies of a table permanently on all slices. This may sound like an advantage at first, but I would like to point out that this only makes sense for small tables. I don't think that this will result in an enormous performance advantage over Teradata.
Comparison Of Join Methods
The Teradata Merge Join
The requirements for the data are that both tables have the same primary index.
This automatically means that for non-row-partitioned tables, since rows are then sorted by hash, a binary search can be performed from a hash value of the left table in the right table and vice versa (depending on whether slow path or fast path merge joins are performed the joining of rows is slightly different, but the method is the same).
For Row-Partitioned Tables, there is also the requirement to join the partition columns, so that partitions can be compared directly (because within each partition, the rows are sorted by hash value).
The Redshift Merge Join
Merge Join in Redshift requires that the DISTKEY of both tables be the same. So there is no difference between the two systems. Accordingly, in both systems, one or both tables may have to be redistributed to a new primary index (Teradata) or DISTKEY (Redshift).
In Redshift, both tables must also have the same SORTKEY. The problem in Redshift is that the sorting of data is not as elegant as in Teradata.
In Teradata, the rows are always automatically sorted in ascending order by Rowhash. In Redshift, the DBA has to take care that the deleted data is finally removed but also that the data is sorted.
If more than 20% of the data is not sorted Redshift does not use a merge join as sorting a huge amount of data is expensive.
Both systems rely on statics when it comes to the creation of join plans. But as you can see Redshift is much more maintenance intensive than Teradata because there are additional things to consider.
Although both Redshift and Teradata offer the Merge Join method, Teradata is more mature and easier to maintain.