Future And Past? Snowflake Versus Teradata | Can Snowflake Meet the High Expectations?
4

Future And Past? Snowflake Versus Teradata

In this article, I will explain the main differences in the Architecture between Teradata and Snowflake.

It is not meant to be a complete comparison of all features or advantages and disadvantages but is limited to the differences from an architecture point of view.

Snowflake vs. Teradata Architecture

How is the data distributed in Snowflake?

Snowflake is not a Shared Nothing architecture, rather the computing resources access shared data.

Therefore there is no concept like the Primary Index. All computing resources can access the data.

What happens if the system size has to be increased?

This is the advantage of cloud architectures such as Snowflake: It simply scales.

On a Teradata system, an upgrade always means a restructuring. When AMPs come in, tables need to be redistributed to ensure optimal parallelism.
A new hashmap is needed which contains the additional AMPs. To be fair, however, it must be said that since Teradata 16 the restructuring can be delayed by using several hashmaps.

A big advantage over a full Teradata system is given if you want to save space e.g. by MVC. The paradox: In this case, you first need a lot of space to save space. In Snowflake the data is always compressed and if there is no more space available, new space can be added at short notice.

What about indexes?

There is no such thing as a secondary index or join index in Snowflake.

What about workload management?

Snowflake relies on the concept of a virtual warehouse that separates the workload. Snowflake says there is no need for workload management, but when you look at Teradata, it makes sense to have both.
Teradata offers a truly sophisticated Workload Management (TASM) as well as the ability to partition the system. A virtual partition can access the full resources (CPU) if they are not needed by another partition. A big advantage.

What is a disadvantage of the Teradata architecture compared to Snowflake?

Scaling is not as flexible as with a cloud database in which compute nodes and data can scale independently.

In Teradata, computing power and data scale together. This ensures that IO (whatever the bottleneck is) is always available in sufficient quantities.

But the disadvantage is that more storage cannot simply be added without distributing the data. Modern database systems run as much as possible in the main memory. Teradata, on the other hand, uses spool files in each step of the execution plan, which often have to be stored temporarily on the mass storage.

How are statistics collected?

In contrast to Teradata, Snowflake collects the required statistics on its own, without the user having to do anything.

How does Snowflake access data?

Tables are stored in so-called micro partitions (max. 500MB in size). Within each micro partition the data columnar is stored (the individual column values are compressed).

Each micro partition has metadata. In the metadata, the offset of each column is stored as well as the range of values. Access is gained by looking up in the metadata in which micro partition is needed to query the tables and where the data of the required columns are located.

This is completely different from the hashing used in Teradata. But as you can see Snowflake is a pure column store. Similar to the row-level partitioning used in Teradata is the sorting of data within columns in Snowflake.

What does SQL Tuning look like in Snowflake?

The possibilities of tuning in Snowflake are limited in contrast to Teradata. However, the numerous options offered by Teradata are not necessary due to the architecture.

There is no workload management comparable to Teradata (TASM) in Snowflake.

Workload tuning is done by defining virtual warehouses. In principle nothing more than independent clusters of computers accessing the same data.

By forming clusters, performance can be optimized at the highest level.

Snowflake uses intensive disk-level caching. The Disk Cache is a combination of SSD and memory. When creating virtual warehouses, it is advisable to create the users that execute the same queries in the same virtual warehouse. Then the disk cache can be used optimally.

Snowflake also offers a cache of the result set, which is available to every user independent of the virtual warehouse.

This cache holds the result sets of all queries of the last 24 hours. Each time the cache is used the counter starts from zero again. This is far more extensive than the caching of the query plan in Teradata.

Teradata offers a block-level cache using Teradata Virtual Memory. The cache is available to all users of the data warehouse but purged frequently and temperature-based. But it's not a cache of the result set. It's caching of data blocks only.

It may be tempting to shut down a Snowflake virtual warehouse as soon as possible to save costs, but the disk cache is also lost when the warehouse is closed. Here, costs must be weighed up against performance. This is not an issue at Teradata, the system is available 24/7.

Snowflake does not use indexes or partitioning that can be determined by the user. However, it is possible to define a cluster key to locate specific column values within one or a few micro partitions:

ALTER TABLE Customer CLUSTER BY (CustomerId);

How are user sessions handled?

As a counterpart to the parsing engine which handles sessions and delivers result sets to the client, Snowflake has a service layer.

This manages sessions and everything that goes with them (queries, transactions, result sets, etc.) and also maintains a cache of the results of executed queries.

What about Referential Integrity?

Snowflake only offers what Teradata understands by Soft Referential Integrity. There is no forced referential integrity as in Teradata.

Is Skewing a problem in Snowflake?

When we speak of skew in Teradata we mean an uneven load on the AMP.
Skewing is also present in Snowflake at the node level. Snowflake uses a method to minimize skew automatically: file stealing.

When a worker process has finished its work (i.e. scanned its input files), it requests additional files from other worker tasks. The other worker tasks check how many input file sets they have left to process and pass them to the requesting worker task. This automatically balances the skew.

This kind of skew optimization is not possible on a shared-nothing system like Teradata.

Does Snowflake also come with its own hardware?

No. Snowflake runs exclusively in the cloud. In principle, Snowflake has understood how to make the best use of the various cloud services (Amazon, Google, Azure).

Take AWS (Amazon Web Services) as an example:
Data Storage uses Amazon S3.
Computing power in the form of a virtual warehouse is provided via clusters of EC2 instances.

Disk cache is provided by SSD/Main Memory of the EC2 instances.
Meanwhile, there is also Teradata in the cloud. However, it should be remembered that the disadvantages of Shared Nothing Architecture exist here.

What kind of optimizer does Snowflake use?

Like Teradata, Snowflake uses a cost-based optimizer.

However, all required statistics are maintained automatically while loading or updating the data. In Teradata, these must be defined at least once by the user (there is, however, the option of having Teradata analyze the query logs to suggest statistics).

When is the Execution Plan created in Snowflake?

We know IPE (Incremental Planning and Execution) in Teradata. Here, the execution plan is modular, and the final plan is determined during the execution of the query.

Snowflake also makes certain decisions only at query runtime. This includes, for example, the selection of the join method.
Both optimizers can, therefore, generate dynamic execution plans.

Will Snowflake squeeze Teradata out of the market?

I think not. I hope not.

Teradata is simply a perfectly tuned system and I don't think it will disappear from the market so quickly.

However, I do see Hadoop's chances dwindling.

For example, it is very popular to use Hadoop to store JSON and load after conversion into a database system. Snowflake can handle this task with flying colors. No need for Hadoop.

I must admit that I am impressed with Snowflake. I have always been skeptical of the latest hypes like Cloud, Data lake, Hadoop, etc. to name a few. It's difficult to separate the marketing scream from the true values. But here the developers of Snowflake seem to have made a big hit.

Will Teradata one day have the same advantages as Snowflake (i.e. elasticity through cloud/shared data architecture)?

The architecture of Teradata limits this. Even if Teradata is offered in the cloud, this does not mean that it is a full “cloud database”.

Here one must look very carefully at what the individual providers understand by cloud. There is a lot of marketing involved.

Snowflake vs. Teradata Summary

As you have seen, at Snowflake we are dealing with a completely different design.

Do you have experience with Teradata and Snowflake? Leave me a comment and tell me about your experience!

DWHPRO Teradata Quiz
DWHPRO Teradata Quiz
Developer: Roland Wenzlofsky
Price: Free

There are countless instructions on the internet on how to migrate from Teradata to Snowflake with seemingly no complications. To me, it almost seems like an affiliate program driven by Snowflake.

I would be surprised if migration were as easy as it is presented.

Write to me about your experiences with Snowflake and whether the high expectations of Snowflake Marketing keep their promises.

Try Out Snowflake For Free

You’ll receive $400 of Snowflake credit to use over 30-days:

https://trial.snowflake.com/

DWH Pro Admin
 

  • Avatar Samrat Pandiri says:

    Hi Ronald.. Can you shed some light on this paragraph – “Teradata offers to the cache using Teradata Virtual Memory. The cache is available to all users of the data warehouse. This cache holds the result sets of all queries of the last 24 hours. This is far more extensive than the caching of queries in Teradata.”

    • Hi Samrat. Somehow I mixed up the sentences by mistake. The 24-hour cache is available in Snowflake and the counter is starting from zero again if the result set of a query is taken from the cache. Teradata has intelligent memory which is data temperature driven. I hope it is clearer now (and correct).

      BR

      Roland

      • Avatar Samrat Pandiri says:

        I think this need to be tweaked in a bit more to clear the confusion.

        As I understand Snowflake does cache the results. But Teradata doesn’t cache the results. One thing it stores in the cache is the Plastic Steps.
        Coming to the Teradata Intelligent Memory, it just moves your data blocks from cold to hot storage based on the access. So, this methods will cache the data blocks in Memory rather than accessing it from the disk when they are used frequently.

        • You are completely right. Somehow I was mixing the different types of caching. I changed it again. Thanks for your help. I really appreciate it!

  • >