fbpx

Redshift Or Teradata?

By DWH Pro Admin

January 19, 2020


This blog post will compare the architecture of these two popular database systems.

As you will see, they are not that far apart, and you will find much of what you know about Teradata in Amazon Redshift.

Even though Amazon Redshift always stores the data in columns, the similarities are big.

And meanwhile, Teradata is also capable of storing data in columns, although Teradata was not originally designed for this purpose and has certain disadvantages in design.

How does the high-level architecture of Amazon Redshift compare to Teradata?

For nodes, AMPs, BYNET, and Parsing Engine, there is a corresponding counterpart in Amazon Redshift: The Slices.

Slices in Amazon Redshift can be viewed as standalone computers. Each slice has its own CPU, memory, and discs. This will remind you of the concept of AMPs and nodes in Teradata.

Like Teradata, where nodes are connected over a physical network (BYNET), slices are connected over a network. However, one should not forget that the BYNET performs particular tasks such as merging and sorting data before it is returned to the client. This makes the BYNET unique. Amazon Redshift, in contrast, runs on commodity hardware.

How is the data of a table distributed in Amazon Redshift?

The initial distribution of the data is analogous to Teradata. Here all massive parallel systems are similar: hashing is used.

In Teradata, the primary index is used for this purpose:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100)
) PRIMARY INDEX (CustomerId);

In Amazon Redshift, it does not look much different:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100)
) DISTKEY (CustomerId);

An equivalent to the Primary Index in Teradata is, therefore, the Distkey. As we see above, the two DDL statements are almost identical!

Just as the Primary Index identifies the AMP holding a row, the Distkey is used to identify the correct slice if the WHERE condition of the SQL statement includes the column, which is defined as Distkey.

What makes Teradata Columnar different from Redshift Columnar?

Amazon Redshift distributes the data across all slices and then divides them into columns. The columns of a row are assigned to blocks so that they can easily be found together again. Each block contains metadata that stores the value range of the block. This helps not to read blocks if they do not contain the values you are looking for.

Teradata offers columnar tables storage in different ways, but it retrieves the data from the column containers, decompresses it, and reconstructs rows. In a subsequent step, the row-based traditional database engine does its work but operates on the already decompressed data. This shows that Teradata was not originally designed as a columnar database.

Is there an equivalent to Teradata Partitioned Primary Index Tables?

That’s what the sortkey is for. The DDL below corresponds to a PPI table with Date Partitions:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTKEY (CustomerId);

Are there secondary indexes in Amazon Redshift?

No, but let’s be honest:

How often do you use a NUSI or USI in Teradata? And if so, isn’t it always hard to design it to be used? Statistics must be correct; selectivity must be right, etc.

Amazon Redshift uses an ingenious method for performance tuning: For each data block, the value range is saved in metadata. This allows Amazon Redshift to restrict the search to data to blocks that match the WHERE condition.

What do Amazon Redshift data blocks look like?

As in Teradata, the size of the data blocks is dynamic. In Amazon Redshift, a data block grows up to a size of one megabyte, and then the data block is divided into two blocks of equal size.

How do joins work in Amazon Redshift?

In this respect, Redshift is not very different from Teradata: The data must be on the same slice to be joined.

If the distkey of both tables is the same, then the data of both tables are already on the same slice.

But how can you prevent data from being copied during the join? Amazon Redshift allows you to copy a table to all slices in advance:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTSTYLE ALL;

While Teradata may choose this strategy during the join to bring the rows onto a common AMP, this can be pre-defined in Redshift.

However, this does not mean that this is generally a performance advantage for Redshift, as it will not make much sense to copy huge tables completely to all slices.

Is there an equivalent to Teradata NOPI Tables in Amazon Redshift?

Yes, data can be evenly and randomly distributed across all slices without distkey hashing:

CREATE TABLE Customer
(
CustomerId INTEGER NOT NULL,
Customer_Name CHAR(100),
OpenDate DATE NULL sortkey
) DISTSTYLE EVEN;

Just as Teradata must instruct all AMPs in NOPI tables to search for data, it is even necessary to search on all slices when using DISTSTYLE.

How does workload management differ between Teradata and Redshift?

Until 2019 Redshift Workload Management can only be described as minimalistic.

It was controlled exclusively via manually managed queues. Queries were arranged in queues, and each queue could only execute a certain number of simultaneous queries. If the limit was reached, the query had to wait until a slot became free.

Each queue also had a part of the main memory reserved, divided by the number of queries. Simple but inefficient to maintain a scaling system.

Since the end of 2019, workload management has been extended considerably.

Essential is the new automatic workload management with queue prioritization.
Query prioritization is an essential part of good workload management to meet service level agreements.

Queues are now assigned a priority. Queries in queues with higher priority will dwell in the queue less than queries with lower priority and get more resources (I/Os and CPU seconds).

For Teradata people, this is nothing new and was available before TASM (or still on the application systems today).

A major new feature is concurrency scaling. Redshift tries to improve the scalability of the system (which was always a weakness of Redshift).

The big disadvantage of MPP systems in the cloud is that compute, and storage only scales together. Of course, you could use more nodes per cluster to cover peaks. But this is associated with permanent costs and does not differ from the problem of an on-premise system, which is idle for a part of the time but cannot be dimensioned smaller to cover peak loads.

Concurrency scaling goes a different way: One (or more) additional cluster(s) (in the same size as the original cluster) is (are) started in the background, which takes over a part of the queries.

This way, peaks can be covered. The advantage is that costs are only incurred as long as the additional cluster is provisioned. However, only reading queries can be executed on these parallel clusters.

Finally, there is the feature “short query acceleration,” which puts short queries in the queue forward. Most likely, this could be compared to Teradata Tactical workload.

In short, Teradata workload management is much more sophisticated and offers workload groups, throttles, queues, dynamic reordering of queries, etc. I don’t want to go into all the details here because that would be too extensive. All the details you can find here:
Teradata Workload Management with TASM

A real weakness of Redshift is that all available main memory is statically distributed among the queues. It can be that memory is wasted when a queue is not being used. In general, Redshift makes it much more complex to optimize workload management. Teradata is simply much better at this.

Amazon Redshift is a long way off but is catching up using artificial intelligence for query prioritization.

Space Management on Redshift?

Unfortunately, this is really necessary. Rows that are to be deleted by UPDATE or DELETE statements will remain for the time being.

The VACUUM DELETE statement must be executed to delete rows permanently.

This may seem strange to Teradata users (except for those using Teradata columnar tables where this is simply not cleaned up at all unless a table is completely deleted), but it is.

As you have seen, the similarities between Teradata and Amazon Redshift are great. This does not mean that there are no big differences in the implementation.

But for you as a user, it will not be too difficult to switch to Amazon Redshift and understand how the system works.

DWHPRO Teradata Quiz
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"rgb(255, 0, 0)"}},"gradients":[]}}]}__CONFIG_colors_palette__
GET OUR TERADATA BOOK

DWH Pro Admin

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>