Teradata and Redshift share similar architectures and data distribution methods. Teradata’s AMPs store portions of table data, while Redshift utilizes slices.
There are notable variations in the way data is stored on file systems.
Teradata can function as a Column Store, which can be determined per-table basis. However, the primary advantage lies in the superior performance achieved by storing tables in the standard row format without switching to columnar tables and their associated drawbacks in Teradata (which will be discussed in a future blog post).
We will not delve into Teradata Columnar Storage in this comparison of join strategies, as Teradata converts the columns of a columnar table into rows before processing the data (early materialization). As a result, there is no difference in the processing of the joins, even if only the necessary columns are chosen for columnar tables.
Merge Join Preparation: Co-Locating Rows
Teradata and Redshift utilize hashing to distribute data evenly among parallel units, such as Teradata AMPs and Redshift Slices.
Teradata dictates that rows must be on the same AMP to be joined.
Redshift also necessitates data on the same slice, resulting in little disparity.
Preparation in Redshift resembles that of Teradata. Broadcasting in Redshift is the equivalent of copying the entire table to all AMPs in Teradata.
Teradata and Redshift both use hashing to redistribute data based on join columns. However, in Redshift, a new DISTKEY is established before distribution.
In Teradata, it is possible that table redistribution may not be required if both tables share the same DISTKEY. Alternatively, both tables may require redistribution.
When copying or distributing tables, Redshift operates similarly to Teradata.
Redshift and Teradata vary in their approach to storing table copies on slices. While Redshift has the capability to do so permanently, it is only suitable for smaller tables. Thus, I do not believe this feature provides a significant performance advantage over Teradata.
Comparison Of Join Methods
The Teradata Merge Join
Both tables must possess identical primary indices to fulfill the data requirements.
For non-row-partitioned tables, rows are sorted by hash, enabling a binary search to occur from a hash value of the left table in the right table and vice versa. The joining of rows differs slightly depending on whether slow-path or fast-path merge joins are performed, but the method remains the same.
To compare partitions directly in Row-Partitioned Tables, it is necessary to join the partition columns, as the rows within each partition are sorted by hash value.
The Redshift Merge Join
For Merge Join in Redshift, the tables must have matching DISTKEYs, making the two systems interchangeable. Therefore, either one or both tables may need to be redistributed to a new primary index (Teradata) or DISTKEY (Redshift) in both systems.
Redshift requires both tables to have identical SORTKEYs. However, data sorting in Redshift is less sophisticated than in Teradata.
Teradata automatically sorts rows in ascending order by Rowhash, whereas in Redshift, the DBA must guarantee the deletion and sorting of data.
Redshift doesn’t employ a merge join if over 20% of the data is unsorted since sorting large amounts is costly.
Both systems utilize statics for creating join plans. However, Redshift requires more maintenance due to the need to consider additional factors compared to Teradata.
Teradata is a more mature and easier-to-maintain option for the Merge Join method.