Comparing The Architecture Of Amazon Redshift And Teradata: Similarities And Differences

This post will contrast the architecture of two widely-used database systems.

The similarities between Teradata and Amazon Redshift are notable, as much of your knowledge about Teradata can be applied to Amazon Redshift.

Although Amazon Redshift stores data in columns permanently, the similarities remain significant.

Teradata can store data in columns, though it was not originally intended for this and has some inherent design drawbacks.

Table of Contents

How does the high-level architecture of Amazon Redshift compare to Teradata?

For nodes, AMPs, BYNET, and Parsing Engine, there is a corresponding counterpart in Amazon Redshift: The Slices.

Slices in Amazon Redshift can be viewed as standalone computers. Each slice has its CPU, memory, and disks. This will remind you of the concept of AMPs and nodes in Teradata.

Like Teradata, where nodes are connected over a physical network (BYNET), slices are connected over a network. However, one should not forget that BYNET performs particular tasks, such as merging and sorting data, before it is returned to the client. This makes BYNET unique. Amazon Redshift, in contrast, runs on commodity hardware.

How is the data of a table distributed in Amazon Redshift?

The initial distribution of the data is analogous to Teradata. Here all massive parallel systems are similar: hashing is used.

In Teradata, the primary index is used for this purpose:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100)
) PRIMARY INDEX (CustomerId);

In Amazon Redshift, it does not look much different:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100)
) DISTKEY (CustomerId);

Therefore, an equivalent to the Primary Index in Teradata is the Distkey. As we see above, the two DDL statements are almost identical!

Just as the Primary Index identifies the AMP holding a row, the Distkey determines the correct slice if the WHERE condition of the SQL statement includes the column defined as Distkey.

What makes Teradata Columnar different from Redshift Columnar?

Amazon Redshift distributes the data across all slices and divides them into columns. The columns of a row are assigned to blocks so they can easily be found together again. Each block contains metadata that stores the value range of the block. This helps not to read blocks if they do not include the values you are looking for.

Teradata offers columnar table storage in different ways, but it retrieves the data from the column containers, decompresses it, and reconstructs rows. In a subsequent step, the row-based traditional database engine works but operates on the already decompressed data. This shows that Teradata was not initially designed as a columnar database.

Is there an equivalent to Teradata Partitioned Primary Index Tables?

That’s what the sortkey is for. The DDL below corresponds to a PPI table with Date Partitions:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTKEY (CustomerId);

Are there secondary indexes in Amazon Redshift?

No, but let’s be honest:

How often do you use a NUSI or USI in Teradata? And if so, isn’t it always hard to design it to be used? Statistics must be correct; selectivity must be suitable, etc.

Amazon Redshift uses an ingenious method for performance tuning: The value range is saved in metadata for each data block. This allows Amazon Redshift to restrict the search to blocks that match the WHERE condition.

What do Amazon Redshift data blocks look like?

As in Teradata, the size of the data blocks is dynamic. In Amazon Redshift, a data block grows up to one megabyte, and then the data block is divided into two blocks of equal size.

How do joins work in Amazon Redshift?

In this respect, Redshift is not very different from Teradata: The data must be on the same slice to be joined.

If the distkey of both tables is the same, then the data of both tables are already on the same slice.

But how can you prevent data from being copied during the join? Amazon Redshift allows you to copy a table to all slices in advance:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTSTYLE ALL;

While Teradata may choose this strategy to bring the rows onto a common AMP during the join, this can be pre-defined in Redshift.

However, this does not mean that this is generally a performance advantage for Redshift, as it will not make much sense to copy huge tables completely to all slices.

Is there an equivalent to Teradata NOPI Tables in Amazon Redshift?

Yes, we can evenly and randomly distribute data across all slices without distkey hashing:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTSTYLE EVEN;

Just as Teradata must instruct all AMPs in NOPI tables to search for data, it is necessary to search on all slices when using DISTSTYLE.

How does workload management differ between Teradata and Redshift?

Until 2019 Redshift Workload Management could only be described as minimalistic.

The system controlled it exclusively via manually managed queues. Queries were arranged in queues, and each queue could only execute a certain number of simultaneous queries. The query had to wait until a slot became free if the limit was reached.

Each queue also had a part of the main memory reserved, divided by the number of queries. Simple but inefficient to maintain a scaling system.

Since the end of 2019, workload management has been extended considerably.

Essential is the new automatic workload management with queue prioritization.
Query prioritization is essential for good workload management to meet service-level agreements.

Queues are now assigned a priority. Queries in queues with higher priority will dwell in the queue less than queries with lower priority and get more resources (I/Os and CPU seconds).

For Teradata people, this is nothing new and was available before TASM (or still on the application systems today).

A significant new feature is concurrency scaling. Redshift tries to improve the system’s scalability (which was always a weakness of Redshift).

The disadvantage of MPP systems in the cloud is that compute, and storage only scales together. Of course, you could use more nodes per cluster to cover peaks. But this is associated with permanent costs and does not differ from the problem of an on-premise system, which is idle for a part of the time but cannot be dimensioned smaller to cover peak loads.

Concurrency scaling goes a different way: One (or more) additional cluster(s) (in the same size as the original cluster) is (are) started in the background, which takes over a part of the queries.

This way, peaks can be covered. The advantage is that costs are only incurred if the additional cluster is provisioned. However, only reading queries can be executed on these parallel clusters.

Finally, the feature “short query acceleration” moves short queries to the front of the queue. Most likely, we could compare this to Teradata’s Tactical workload.

In short, Teradata workload management is much more sophisticated and offers workload groups, throttles, queues, dynamic reordering of queries, etc. I don’t want to go into all the details here because that would be too extensive. All the details you can find here:
Teradata Workload Management with TASM

A fundamental weakness of Redshift is that all available main memory is statically distributed among the queues. Memory can be wasted when a queue is not being used. In general, Redshift makes it much more complex to optimize workload management. Teradata is much better at this.

Amazon Redshift is far off but is catching up by using artificial intelligence for query prioritization.

Space Management on Redshift?

Unfortunately, this is necessary. Rows to be deleted by UPDATE or DELETE statements will remain for now.

We must execute the VACUUM DELETE statement to delete rows permanently.

This may seem strange to Teradata users (except those using Teradata columnar tables where this is not cleaned up unless a table is wholly deleted), but it is.

As you have seen, the similarities between Teradata and Amazon Redshift are remarkable. This does not mean that there are no significant differences in the implementation.

But for you as a user, it will not be too difficult to switch to Amazon Redshift and understand how the system works.

Related Services

🏗️ Planning a Data Platform Migration?

Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.

Our Migration Services →

2 thoughts on “Comparing the Architecture of Amazon Redshift and Teradata: Similarities and Differences”

sasanka ghosh

03/06/2021 at 10:44 am

Excellent . Same thinking . U r a genius . ppl like me get satisfaction when thoughts r matching with people like u .
One Q . If oracle removes GCS for data warehouse appliance , applies similar exclusive lock then it is the Best of the best . What is ur opinion.
- sasanka ghosh
  
  03/06/2021 at 10:47 am
  
  I was talking abt Exadata or oracle autonomous in cloud..

Comparing the Architecture of Amazon Redshift and Teradata: Similarities and Differences