Teradata vs. Snowflake: Differences in Architecture

Roland Wenzlofsky

April 22, 2023

minutes reading time


This article will delineate the primary architectural distinctions between Teradata and Snowflake.

This text solely focuses on the contrasts from an architectural standpoint and does not aim to compare all characteristics, pros, and cons comprehensively.

Snowflake vs. Teradata Architecture

snowflake teradata

How is the data distributed in Snowflake?

Snowflake is not a Shared Nothing architecture; the computing resources access shared data.

Therefore there is no concept like the Primary Index. All computing resources can access the data.

What happens if the system size has to be increased?

This is the advantage of cloud architectures such as Snowflake: It simply scales.

On a Teradata system, an upgrade always means restructuring. When AMPs come in, tables must be redistributed to ensure optimal parallelism.
A new hashmap is needed, which contains the additional AMPs. However, it must be said that since Teradata 16, the restructuring can be delayed by using several hashmaps.

A significant advantage over a full Teradata system is given if you want to save space, e.g., by MVC. The paradox: In this case, you first need a lot of space to save space. In Snowflake, the data is always compressed, and if no more space is available, new space can be added at short notice.

What about indexes?

There is no such thing as a secondary index or join index in Snowflake.

What about workload management?

Snowflake relies on the concept of a virtual warehouse that separates the workload. Snowflake says there is no need for workload management, but it makes sense to have both when you look at Teradata.
Teradata offers a genuinely sophisticated Workload Management (TASM) and the system’s ability to partition. A virtual partition can access the full resources (CPU) if they are not needed by another partition—a significant advantage.

What is the disadvantage of the Teradata architecture compared to Snowflake?

Scaling is not as flexible as a cloud database in which compute nodes and data can scale independently.

In Teradata, computing power and data scale together. This ensures that IO (whatever the bottleneck is) is always available in sufficient quantities.

But the disadvantage is that more storage cannot merely be added without distributing the data. Modern database systems run as much as possible in the main memory. On the other hand, Teradata uses spool files in each step of the execution plan, which often has to be stored temporarily on the mass storage.

How are statistics collected?

In contrast to Teradata, Snowflake collects the required statistics on its own, without the user having to do anything.

How does Snowflake access data?

Tables are stored in so-called micro partitions (max. 500MB in size). Within each micro partition, the data columnar is stored (the individual column values are compressed).

Each micro partition has metadata. In the metadata, the offset of each column and the range of values are stored. Access is gained by looking up the metadata in which a micro partition is needed to query the tables and where the required columns’ data are located.

This is completely different from the hashing used in Teradata. But as you can see, Snowflake is a pure column store. Similar to the row-level partitioning used in Teradata is the sorting of data within columns in Snowflake.

What does SQL Tuning look like in Snowflake?

The possibilities of tuning in Snowflake are limited in contrast to Teradata. However, the numerous options Teradata offers are unnecessary due to the architecture.

There is no workload management comparable to Teradata (TASM) in Snowflake.

Workload tuning is done by defining virtual warehouses. In principle, nothing more than independent clusters of computers accessing the same data.

By forming clusters, performance can be optimized at the highest level.

Snowflake uses intensive disk-level caching. The Disk Cache is a combination of SSD and memory. When creating virtual warehouses, it is advisable to create users that execute the same queries in the same virtual warehouse. Then the disk cache can be used optimally.

Snowflake also offers a cache of the result set, which is available to every user independent of the virtual warehouse.

This cache holds the result sets of all queries of the last 24 hours. Each time the cache is used, the counter starts from zero again. This is far more extensive than caching the query plan in Teradata.

Teradata offers a block-level cache using Teradata Virtual Memory. The cache is available to all data warehouse users but is purged frequently and is temperature-based. But it’s not a cache of the result set. It’s caching of data blocks only.

It may be tempting to shut down a Snowflake virtual warehouse as soon as possible to save costs, but the disk cache is also lost when the warehouse is closed. Here, costs must be weighed up against performance. This is not an issue at Teradata. The system is available 24/7.

Snowflake does not use indexes or partitioning that the user can determine. However, it is possible to define a cluster key to locate specific column values within one or a few micro partitions:

ALTER TABLE Customer CLUSTER BY (CustomerId);

How are user sessions handled?

As a counterpart to the parsing engine, which handles sessions and delivers result sets to the client, Snowflake has a service layer.

This manages sessions and everything that goes with them (queries, transactions, result sets, etc.) and maintains a cache of the results of executed queries.

What about Referential Integrity?

Snowflake only offers what Teradata understands by Soft Referential Integrity. There is no forced referential integrity as in Teradata.

Is Skewing a problem in Snowflake?

When we speak of skew in Teradata, we mean an uneven load on the AMP.
Skewing is also present in Snowflake at the node level. Snowflake uses a method to minimize skew automatically: file stealing.

When a worker process has finished its work (i.e., scanned its input files), it requests additional files from other worker tasks. The other worker tasks check how many input file sets they have left to process and pass them to the requesting worker task. This automatically balances the skew.

This kind of skew optimization is not possible on a shared-nothing system like Teradata.

Does Snowflake also come with its own hardware?

No. Snowflake runs exclusively in the cloud. In principle, Snowflake has understood how to make the best use of the various cloud services (Amazon, Google, Azure).

Take AWS (Amazon Web Services) as an example:
Data Storage uses Amazon S3.
Computing power in the form of a virtual warehouse is provided via clusters of EC2 instances.

Disk cache is provided by SSD/Main Memory of the EC2 instances.
Meanwhile, there is also Teradata in the cloud. However, it should be remembered that the disadvantages of Shared Nothing Architecture exist here.

What kind of optimizer does Snowflake use?

Like Teradata, Snowflake uses a cost-based optimizer.

However, all required statistics are automatically maintained while loading or updating the data. In Teradata, these must be defined at least once by the user (there is, however, the option of having Teradata analyze the query logs to suggest statistics).

When is the Execution Plan created in Snowflake?

We know IPE (Incremental Planning and Execution) in Teradata. Here, the execution plan is modular, and the final plan is determined during the query’s execution.

Snowflake also makes certain decisions only at query runtime. This includes, for example, the selection of the join method.
Both optimizers can, therefore, generate dynamic execution plans.

Will Snowflake squeeze Teradata out of the market?

I think not. I hope not.

Teradata is simply a perfectly tuned system, and I don’t think it will quickly disappear from the market.

However, I do see Hadoop’s chances dwindling.

For example, using Hadoop to store JSON and load it after conversion into a database system is trendy. Snowflake can handle this task with flying colors—no need for Hadoop.

I must admit that I am impressed with Snowflake. I have always been skeptical of the latest hypes like Cloud, Data Lake, Hadoop, etc., to name a few. It’s difficult to separate the marketing scream from the actual values. But here, the developers of Snowflake seem to have made a big hit.

Will Teradata one day have the same advantages as Snowflake (i.e. elasticity through cloud/shared data architecture)?

The architecture of Teradata limits this. Even if Teradata is offered in the cloud, this does not mean it is a full “cloud database”.

Here one must look very carefully at what the individual providers understand about the cloud. There is a lot of marketing involved.

Is Snowflake better than Teradata?

It’s not necessarily a matter of Snowflake being “better” than Teradata, as both systems have their own advantages and disadvantages. Snowflake offers better elasticity and flexibility due to its cloud-based architecture, while Teradata is a well-tuned, traditional system with more robust workload management features. Choosing between the two depends on your organization’s specific needs and requirements.

How is Teradata different from Snowflake?

Teradata and Snowflake have different architectures and designs. Teradata uses a shared-nothing architecture, while Snowflake employs a shared-data architecture. Snowflake is designed for the cloud and offers more flexibility in scaling, whereas Teradata scaling requires time-consuming restructuring. Snowflake collects statistics automatically, while Teradata requires manual definitions and tools to achieve the same result. Additionally, Teradata has a more advanced workload management system than Snowflake’s virtual warehouse concept.

Why move from Teradata to Snowflake?

Organizations might consider moving from Teradata to Snowflake for a number of reasons:

Elasticity and scalability:

Snowflake allows for more flexibility in scaling resources, as it’s designed for the cloud and can independently scale compute and storage.
Automatic statistics collection:

Snowflake automatically collects required statistics without any user intervention, simplifying database maintenance.
Lower maintenance:

Snowflake doesn’t require (or offer) secondary indexes or join indexes, reducing the overall maintenance effort.

Cost savings:

Organizations might sometimes experience cost savings by moving to a cloud-based system like Snowflake due to its pay-as-you-go model.
However, it’s essential to carefully evaluate your organization’s specific needs before deciding.

Do people still use Teradata?

Yes, people still use Teradata. As mentioned in the article, Teradata is a well-tuned, traditional system that offers robust workload management features and has been widely adopted in various industries. While there are cloud-based alternatives like Snowflake that have gained popularity, Teradata remains a viable choice for many organizations, particularly those that require sophisticated workload management or have significant investments in Teradata infrastructure.

Snowflake vs. Teradata Summary

At Snowflake, our design differs significantly. Have you worked with both Teradata and Snowflake? If so, please share your experience in the comments.

Online guides claim that migrating from Teradata to Snowflake is seamless, but I am skeptical of these assertions as Snowflake’s affiliate program may influence them. I would be surprised if the process is as effortless as it seems. If you have had any personal experience with Snowflake, please share, especially if it matches the company’s ambitious marketing campaigns. Here is a brief overview of what to expect when transitioning to Snowflake from Teradata.

  • Table DDL statements can be quite easily migrated
  • View definitions can be quite easily migrated
  • Teradata Stored Procedures have to be rewritten in Javascript or Phyton
  • Macros will have to be rewritten in Javascript or Phyton and become Stored Procedures in Snowflake
  • Teradata Load Utilities (fastload, multiload, tpump, tpt) are replaced with the COPY INTO command.

Try Out Snowflake For Free

You’ll receive $400 of Snowflake credit to use over 30-days:

https://trial.snowflake.com/

  • Avatar
    Samrat Pandiri says:

    Hi Ronald.. Can you shed some light on this paragraph – “Teradata offers to the cache using Teradata Virtual Memory. The cache is available to all users of the data warehouse. This cache holds the result sets of all queries of the last 24 hours. This is far more extensive than the caching of queries in Teradata.”

    • Hi Samrat. Somehow I mixed up the sentences by mistake. The 24-hour cache is available in Snowflake and the counter is starting from zero again if the result set of a query is taken from the cache. Teradata has intelligent memory which is data temperature driven. I hope it is clearer now (and correct).

      BR

      Roland

      • Avatar
        Samrat Pandiri says:

        I think this need to be tweaked in a bit more to clear the confusion.

        As I understand Snowflake does cache the results. But Teradata doesn’t cache the results. One thing it stores in the cache is the Plastic Steps.
        Coming to the Teradata Intelligent Memory, it just moves your data blocks from cold to hot storage based on the access. So, this methods will cache the data blocks in Memory rather than accessing it from the disk when they are used frequently.

        • You are completely right. Somehow I was mixing the different types of caching. I changed it again. Thanks for your help. I really appreciate it!

          • Hi… just stumbled on this very interesting thread. To delve a bit deeper into this, you are correct that Teradata does not cache results but it does offer two methods of data caching. The first is automatic: FSGCache is used to store Most Recently Used data blocks and is available to all queries globally. The size of FSGCache is tunable. The second is TIM (Teradata Intelligent Memory) which acts as a storage tier for very hot data. This operates as a Most Frequently Used cache. TIM is also configurable but optional. As the price of memory has dropped over the years, Teradata applications have been able to leverage larger memory and achieve much higher IO efficiencies with these two methods.

          • Avatar
            Robbert Naastepad says:

            And this is not the only thing that is not quite correct or clear:

            1. Where is the explanation why a primary or join index is good or bad?
            2. Where is the explanation about why you need workload management to prioritise some workloads over others from an enterprise perspective? Maybe you want to have a regulatory report before your daily workload finishes???
            3. There is a little paragraph about “What is a disadvantage of the Teradata architecture compared to Snowflake?”, but where are the advantages?
            4. This remark: “In contrast to Teradata, Snowflake collects the required statistics on its own, without the user having to do anything.”. You don’t need to collect statistics in Teradata. But it is better that you do it and tell the which statistics to collect instead of not having control over it. Collecting statistics takes resources and you want full control over resources and sometimes not collect them automagically.
            5. Will Teradata one day have the same advantages as Snowflake (i.e. elasticity through cloud/shared data architecture)? The architecture of Teradata limits this. Even if Teradata is offered in the cloud, this does not mean that it is a full “cloud database”.
            Why?

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >