Amazon Redshift Or Teradata?

DWH Pro Admin

January 19, 2020

minutes reading time

This blog post will compare the architecture of these two popular database systems.

As you will see, they are not that far apart, and you will find much of what you know about Teradata in Amazon Redshift.

Even though Amazon Redshift permanently stores the data in columns, the similarities are significant.

And meanwhile, Teradata is also capable of storing data in columns, although Teradata was not initially designed for this purpose and had certain disadvantages in design.

How does the high-level architecture of Amazon Redshift compare to Teradata?

For nodes, AMPs, BYNET, and Parsing Engine, there is a corresponding counterpart in Amazon Redshift: The Slices.

Slices in Amazon Redshift can be viewed as standalone computers. Each slice has its CPU, memory, and discs. This will remind you of the concept of AMPs and nodes in Teradata.

Like Teradata, where nodes are connected over a physical network (BYNET), slices are connected over a network. However, one should not forget that the BYNET performs particular tasks such as merging and sorting data before it is returned to the client. This makes the BYNET unique. Amazon Redshift, in contrast, runs on commodity hardware.

How is the data of a table distributed in Amazon Redshift?

The initial distribution of the data is analogous to Teradata. Here all massive parallel systems are similar: hashing is used.

In Teradata, the primary index is used for this purpose:

Customer_Name CHAR(100)
) PRIMARY INDEX (CustomerId);

In Amazon Redshift, it does not look much different:

Customer_Name CHAR(100)
) DISTKEY (CustomerId);

Therefore, an equivalent to the Primary Index in Teradata is the Distkey. As we see above, the two DDL statements are almost identical!

Just as the Primary Index identifies the AMP holding a row, the Distkey is used to determine the correct slice if the WHERE condition of the SQL statement includes the column defined as Distkey.

What makes Teradata Columnar different from Redshift Columnar?

Amazon Redshift distributes the data across all slices and divides them into columns. The columns of a row are assigned to blocks so they can easily be found together again. Each block contains metadata that stores the value range of the block. This helps not to read blocks if they do not include the values you are looking for.

Teradata offers columnar table storage in different ways, but it retrieves the data from the column containers, decompresses it, and reconstructs rows. In a subsequent step, the row-based traditional database engine works but operates on the already decompressed data. This shows that Teradata was not initially designed as a columnar database.

Is there an equivalent to Teradata Partitioned Primary Index Tables?

That’s what the sortkey is for. The DDL below corresponds to a PPI table with Date Partitions:

Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTKEY (CustomerId);

Are there secondary indexes in Amazon Redshift?

No, but let’s be honest:

How often do you use a NUSI or USI in Teradata? And if so, isn’t it always hard to design it to be used? Statistics must be correct; selectivity must be suitable, etc.

Amazon Redshift uses an ingenious method for performance tuning: The value range is saved in metadata for each data block. This allows Amazon Redshift to restrict the search to data to blocks that match the WHERE condition.

What do Amazon Redshift data blocks look like?

As in Teradata, the size of the data blocks is dynamic. In Amazon Redshift, a data block grows up to one megabyte, and then the data block is divided into two blocks of equal size.

How do joins work in Amazon Redshift?

In this respect, Redshift is not very different from Teradata: The data must be on the same slice to be joined.

If the distkey of both tables is the same, then the data of both tables are already on the same slice.

But how can you prevent data from being copied during the join? Amazon Redshift allows you to copy a table to all slices in advance:

Customer_Name CHAR(100),
OpenDate DATE NULL sortkey

While Teradata may choose this strategy during the join to bring the rows onto a common AMP, this can be pre-defined in Redshift.

However, this does not mean that this is generally a performance advantage for Redshift, as it will not make much sense to copy huge tables completely to all slices.

Is there an equivalent to Teradata NOPI Tables in Amazon Redshift?

Yes, we can evenly and randomly distribute data across all slices without distkey hashing:

Customer_Name CHAR(100),
OpenDate DATE NULL sortkey

Just as Teradata must instruct all AMPs in NOPI tables to search for data, it is necessary to search on all slices when using DISTSTYLE.

How does workload management differ between Teradata and Redshift?

Until 2019 Redshift Workload Management can only be described as minimalistic.

The system controlled it exclusively via manually managed queues. Queries were arranged in queues, and each queue could only execute a certain number of simultaneous queries. The query had to wait until a slot became free if the limit was reached.

Each queue also had a part of the main memory reserved, divided by the number of queries. Simple but inefficient to maintain a scaling system.

Since the end of 2019, workload management has been extended considerably.

Essential is the new automatic workload management with queue prioritization.
Query prioritization is essential for good workload management to meet service-level agreements.

Queues are now assigned a priority. Queries in queues with higher priority will dwell in the queue less than queries with lower priority and get more resources (I/Os and CPU seconds).

For Teradata people, this is nothing new and was available before TASM (or still on the application systems today).

A significant new feature is concurrency scaling. Redshift tries to improve the system’s scalability (which was always a weakness of Redshift).

The disadvantage of MPP systems in the cloud is that compute and storage only scales together. Of course, you could use more nodes per cluster to cover peaks. But this is associated with permanent costs and does not differ from the problem of an on-premise system, which is idle for a part of the time but cannot be dimensioned smaller to cover peak loads.

Concurrency scaling goes a different way: One (or more) additional cluster(s) (in the same size as the original cluster) is (are) started in the background, which takes over a part of the queries.

This way, peaks can be covered. The advantage is that costs are only incurred if the additional cluster is provisioned. However, only reading queries can be executed on these parallel clusters.

Finally, there is the feature “short query acceleration,” which puts short queries in the queue forward. Most likely, we could compare this to Teradata’s Tactical workload.

In short, Teradata workload management is much more sophisticated and offers workload groups, throttles, queues, dynamic reordering of queries, etc. I don’t want to go into all the details here because that would be too extensive. All the details you can find here:
Teradata Workload Management with TASM

A fundamental weakness of Redshift is that all available main memory is statically distributed among the queues. Memory can be wasted when a queue is not being used. In general, Redshift makes it much more complex to optimize workload management. Teradata is much better at this.

Amazon Redshift is a long way off but is catching up using artificial intelligence for query prioritization.

Space Management on Redshift?

Unfortunately, this is necessary. Rows to be deleted by UPDATE or DELETE statements will remain for the time being.

We must execute the VACUUM DELETE statement to delete rows permanently.

This may seem strange to Teradata users (except those using Teradata columnar tables where this is not cleaned up unless a table is wholly deleted), but it is.

As you have seen, the similarities between Teradata and Amazon Redshift are remarkable. This does not mean that there are no significant differences in the implementation.

But for you as a user, it will not be too difficult to switch to Amazon Redshift and understand how the system works.

  • Avatar
    sasanka ghosh says:

    Excellent . Same thinking . U r a genius . ppl like me get satisfaction when thoughts r matching with people like u .
    One Q . If oracle removes GCS for data warehouse appliance , applies similar exclusive lock then it is the Best of the best . What is ur opinion.

    • Avatar
      sasanka ghosh says:

      I was talking abt Exadata or oracle autonomous in cloud..

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like