In this article, I will explain the main differences in architecture between Teradata and Snowflake.
It is not meant to be a complete comparison of all features or advantages and disadvantages but is limited to the differences from an architectural perspective.
Snowflake vs. Teradata Architecture
How is the data distributed in Snowflake?
Snowflake is not a Shared Nothing architecture; instead, the computing resources access shared data.
Therefore there is no concept like the Primary Index. All computing resources can access the data.
What happens if the system size has to be increased?
This is the advantage of cloud architectures such as Snowflake: It simply scales.
On a Teradata system, an upgrade always means a restructuring. When AMPs come in, tables need to be redistributed to ensure optimal parallelism.
A new hashmap is needed, which contains the additional AMPs. However, it must be said that since Teradata 16, the restructuring can be delayed by using several hashmaps.
A significant advantage over a full Teradata system is given if you want to save space, e.g., by MVC. The paradox: In this case, you first need a lot of space to save space. In Snowflake, the data is always compressed, and if there is no more space available, new space can be added at short notice.
What about indexes?
There is no such thing as a secondary index or join index in Snowflake.
What about workload management?
Snowflake relies on the concept of a virtual warehouse that separates the workload. Snowflake says there is no need for workload management, but it makes sense to have both when you look at Teradata.
Teradata offers a genuinely sophisticated Workload Management (TASM) and the ability to partition the system. A virtual partition can access the full resources (CPU) if they are not needed by another partition—a significant advantage.
What is a disadvantage of the Teradata architecture compared to Snowflake?
Scaling is not as flexible as with a cloud database in which compute nodes and data can scale independently.
In Teradata, computing power and data scale together. This ensures that IO (whatever the bottleneck is) is always available in sufficient quantities.
But the disadvantage is that more storage cannot merely be added without distributing the data. Modern database systems run as much as possible in the main memory. On the other hand, Teradata uses spool files in each step of the execution plan, which often have to be stored temporarily on the mass storage.
How are statistics collected?
In contrast to Teradata, Snowflake collects the required statistics on its own, without the user having to do anything.
How does Snowflake access data?
Tables are stored in so-called micro partitions (max. 500MB in size). Within each micro partition, the data columnar is stored (the individual column values are compressed).
Each micro partition has metadata. In the metadata, the offset of each column is stored as well as the range of values. Access is gained by looking up the metadata in which a micro partition is needed to query the tables and where the required columns’ data are located.
This is completely different from the hashing used in Teradata. But as you can see, Snowflake is a pure column store. Similar to the row-level partitioning used in Teradata is the sorting of data within columns in Snowflake.
What does SQL Tuning look like in Snowflake?
The possibilities of tuning in Snowflake are limited in contrast to Teradata. However, the numerous options offered by Teradata are not necessary due to the architecture.
There is no workload management comparable to Teradata (TASM) in Snowflake.
Workload tuning is done by defining virtual warehouses. In principle, nothing more than independent clusters of computers accessing the same data.
By forming clusters, performance can be optimized at the highest level.
Snowflake uses intensive disk-level caching. The Disk Cache is a combination of SSD and memory. When creating virtual warehouses, it is advisable to create the users that execute the same queries in the same virtual warehouse. Then the disk cache can be used optimally.
Snowflake also offers a cache of the result set, which is available to every user independent of the virtual warehouse.
This cache holds the result sets of all queries of the last 24 hours. Each time the cache is used, the counter starts from zero again. This is far more extensive than the caching of the query plan in Teradata.
Teradata offers a block-level cache using Teradata Virtual Memory. The cache is available to all users of the data warehouse but purged frequently and temperature-based. But it’s not a cache of the result set. It’s caching of data blocks only.
It may be tempting to shut down a Snowflake virtual warehouse as soon as possible to save costs, but the disk cache is also lost when the warehouse is closed. Here, costs must be weighed up against performance. This is not an issue at Teradata. The system is available 24/7.
Snowflake does not use indexes or partitioning that can be determined by the user. However, it is possible to define a cluster key to locate specific column values within one or a few micro partitions:
ALTER TABLE Customer CLUSTER BY (CustomerId);
How are user sessions handled?
As a counterpart to the parsing engine which handles sessions and delivers result sets to the client, Snowflake has a service layer.
This manages sessions and everything that goes with them (queries, transactions, result sets, etc.) and also maintains a cache of the results of executed queries.
What about Referential Integrity?
Snowflake only offers what Teradata understands by Soft Referential Integrity. There is no forced referential integrity as in Teradata.
Is Skewing a problem in Snowflake?
When we speak of skew in Teradata, we mean an uneven load on the AMP.
Skewing is also present in Snowflake at the node level. Snowflake uses a method to minimize skew automatically: file stealing.
When a worker process has finished its work (i.e., scanned its input files), it requests additional files from other worker tasks. The other worker tasks check how many input file sets they have left to process and pass them to the requesting worker task. This automatically balances the skew.
This kind of skew optimization is not possible on a shared-nothing system like Teradata.
Does Snowflake also come with its own hardware?
No. Snowflake runs exclusively in the cloud. In principle, Snowflake has understood how to make the best use of the various cloud services (Amazon, Google, Azure).
Take AWS (Amazon Web Services) as an example:
Data Storage uses Amazon S3.
Computing power in the form of a virtual warehouse is provided via clusters of EC2 instances.
Disk cache is provided by SSD/Main Memory of the EC2 instances.
Meanwhile, there is also Teradata in the cloud. However, it should be remembered that the disadvantages of Shared Nothing Architecture exist here.
What kind of optimizer does Snowflake use?
Like Teradata, Snowflake uses a cost-based optimizer.
However, all required statistics are automatically maintained while loading or updating the data. In Teradata, these must be defined at least once by the user (there is, however, the option of having Teradata analyze the query logs to suggest statistics).
When is the Execution Plan created in Snowflake?
We know IPE (Incremental Planning and Execution) in Teradata. Here, the execution plan is modular, and the final plan is determined during the query’s execution.
Snowflake also makes certain decisions only at query runtime. This includes, for example, the selection of the join method.
Both optimizers can, therefore, generate dynamic execution plans.
Will Snowflake squeeze Teradata out of the market?
I think not. I hope not.
Teradata is simply a perfectly tuned system, and I don’t think it will quickly disappear from the market.
However, I do see Hadoop’s chances dwindling.
For example, it is trendy to use Hadoop to store JSON and load after conversion into a database system. Snowflake can handle this task with flying colors—no need for Hadoop.
I must admit that I am impressed with Snowflake. I have always been skeptical of the latest hypes like Cloud, Data lake, Hadoop, etc. to name a few. It’s difficult to separate the marketing scream from the actual values. But here, the developers of Snowflake seem to have made a big hit.
Will Teradata one day have the same advantages as Snowflake (i.e. elasticity through cloud/shared data architecture)?
The architecture of Teradata limits this. Even if Teradata is offered in the cloud, this does not mean that it is a full “cloud database”.
Here one must look very carefully at what the individual providers understand by cloud. There is a lot of marketing involved.
Snowflake vs. Teradata Summary
As you have seen, at Snowflake, we are dealing with a completely different design.
Do you have experience with Teradata and Snowflake? Leave me a comment and tell me about your experience!
[appbox googleplay com.dwhpro.quiz2020]
There are countless instructions on the internet on migrating from Teradata to Snowflake with seemingly no complications. To me, it almost seems like an affiliate program driven by Snowflake.
I would be surprised if migration were as easy as it is presented.
Please write to me about your experiences with Snowflake and whether the high expectations of Snowflake Marketing keep their promises. Here is a brief overview of what to expect when migrating from Teradata to Snowflake:
- Table DDL statements can be quite easily migrated
- View definitions can be quite easily migrated
- Teradata Load Utilities (fastload, multiload, tpump, tpt) are replaced with the COPY INTO command.
Try Out Snowflake For Free
You’ll receive $400 of Snowflake credit to use over 30-days: