Future And Past? Snowflake Versus Teradata
In this article, I will explain the main differences in the Architecture between Teradata and Snowflake.
It is not meant to be a complete comparison of all features or advantages and disadvantages but is limited to the differences from an architecture point of view.
Snowflake vs. Teradata Architecture
Snowflake is not a Shared Nothing architecture, rather the computing resources access shared data.
Therefore there is no concept like the Primary Index. All computing resources can access the data.
This is the advantage of cloud architectures such as Snowflake: It simply scales.
On a Teradata system, an upgrade always means a restructuring. When AMPs come in, tables need to be redistributed to ensure optimal parallelism.
A new hashmap is needed which contains the additional AMPs. To be fair, however, it must be said that since Teradata 16 the restructuring can be delayed by using several hashmaps.
A big advantage over a full Teradata system is given if you want to save space e.g. by MVC. The paradox: In this case, you first need a lot of space to save space. In Snowflake the data is always compressed and if there is no more space available, new space can be added at short notice.
There is no such thing as a secondary index or join index in Snowflake.
Snowflake relies on the concept of a virtual warehouse that separates the workload. Snowflake says there is no need for workload management, but when you look at Teradata, it makes sense to have both.
Teradata offers a truly sophisticated Workload Management (TASM) as well as the ability to partition the system. A virtual partition can access the full resources (CPU) if they are not needed by another partition. A big advantage.
Scaling is not as flexible as with a cloud database in which compute nodes and data can scale independently.
In Teradata, computing power and data scale together. This ensures that IO (whatever the bottleneck is) is always available in sufficient quantities.
But the disadvantage is that more storage cannot simply be added without distributing the data. Modern database systems run as much as possible in the main memory. Teradata, on the other hand, uses spool files in each step of the execution plan, which often have to be stored temporarily on the mass storage.
In contrast to Teradata, Snowflake collects the required statistics on its own, without the user having to do anything.
Tables are stored in so-called micro partitions (max. 500MB in size). Within each micro partition the data columnar is stored (the individual column values are compressed).
Each micro partition has metadata. In the metadata, the offset of each column is stored as well as the range of values. Access is gained by looking up in the metadata in which micro partition is needed to query the tables and where the data of the required columns are located.
This is completely different from the hashing used in Teradata. But as you can see Snowflake is a pure column store. Similar to the row-level partitioning used in Teradata is the sorting of data within columns in Snowflake.
The possibilities of tuning in Snowflake are limited in contrast to Teradata. However, the numerous options offered by Teradata are not necessary due to the architecture.
There is no workload management comparable to Teradata (TASM) in Snowflake.
Workload tuning is done by defining virtual warehouses. In principle nothing more than independent clusters of computers accessing the same data.
By forming clusters, performance can be optimized at the highest level.
Snowflake uses intensive disk-level caching. The Disk Cache is a combination of SSD and memory. When creating virtual warehouses, it is advisable to create the users that execute the same queries in the same virtual warehouse. Then the disk cache can be used optimally.
Snowflake also offers a cache of the result set, which is available to every user independent of the virtual warehouse.
This cache holds the result sets of all queries of the last 24 hours. Each time the cache is used the counter starts from zero again. This is far more extensive than the caching of the query plan in Teradata.
Teradata offers a block-level cache using Teradata Virtual Memory. The cache is available to all users of the data warehouse but purged frequently and temperature-based. But it's not a cache of the result set. It's caching of data blocks only.
It may be tempting to shut down a Snowflake virtual warehouse as soon as possible to save costs, but the disk cache is also lost when the warehouse is closed. Here, costs must be weighed up against performance. This is not an issue at Teradata, the system is available 24/7.
Snowflake does not use indexes or partitioning that can be determined by the user. However, it is possible to define a cluster key to locate specific column values within one or a few micro partitions:
ALTER TABLE Customer CLUSTER BY (CustomerId);
As a counterpart to the parsing engine which handles sessions and delivers result sets to the client, Snowflake has a service layer.
This manages sessions and everything that goes with them (queries, transactions, result sets, etc.) and also maintains a cache of the results of executed queries.
Snowflake only offers what Teradata understands by Soft Referential Integrity. There is no forced referential integrity as in Teradata.
When we speak of skew in Teradata we mean an uneven load on the AMP.
Skewing is also present in Snowflake at the node level. Snowflake uses a method to minimize skew automatically: file stealing.
When a worker process has finished its work (i.e. scanned its input files), it requests additional files from other worker tasks. The other worker tasks check how many input file sets they have left to process and pass them to the requesting worker task. This automatically balances the skew.
This kind of skew optimization is not possible on a shared-nothing system like Teradata.
No. Snowflake runs exclusively in the cloud. In principle, Snowflake has understood how to make the best use of the various cloud services (Amazon, Google, Azure).
Take AWS (Amazon Web Services) as an example:
Data Storage uses Amazon S3.
Computing power in the form of a virtual warehouse is provided via clusters of EC2 instances.
Disk cache is provided by SSD/Main Memory of the EC2 instances.
Meanwhile, there is also Teradata in the cloud. However, it should be remembered that the disadvantages of Shared Nothing Architecture exist here.
Like Teradata, Snowflake uses a cost-based optimizer.
However, all required statistics are maintained automatically while loading or updating the data. In Teradata, these must be defined at least once by the user (there is, however, the option of having Teradata analyze the query logs to suggest statistics).
We know IPE (Incremental Planning and Execution) in Teradata. Here, the execution plan is modular, and the final plan is determined during the execution of the query.
Snowflake also makes certain decisions only at query runtime. This includes, for example, the selection of the join method.
Both optimizers can, therefore, generate dynamic execution plans.
I think not. I hope not.
Teradata is simply a perfectly tuned system and I don't think it will disappear from the market so quickly.
However, I do see Hadoop's chances dwindling.
For example, it is very popular to use Hadoop to store JSON and load after conversion into a database system. Snowflake can handle this task with flying colors. No need for Hadoop.
I must admit that I am impressed with Snowflake. I have always been skeptical of the latest hypes like Cloud, Data lake, Hadoop, etc. to name a few. It's difficult to separate the marketing scream from the true values. But here the developers of Snowflake seem to have made a big hit.
The architecture of Teradata limits this. Even if Teradata is offered in the cloud, this does not mean that it is a full “cloud database”.
Here one must look very carefully at what the individual providers understand by cloud. There is a lot of marketing involved.
Snowflake vs. Teradata Summary
As you have seen, at Snowflake we are dealing with a completely different design.
Do you have experience with Teradata and Snowflake? Leave me a comment and tell me about your experience!
There are countless instructions on the internet on how to migrate from Teradata to Snowflake with seemingly no complications. To me, it almost seems like an affiliate program driven by Snowflake.
I would be surprised if migration were as easy as it is presented.
Write to me about your experiences with Snowflake and whether the high expectations of Snowflake Marketing keep their promises.
Try Out Snowflake For Free
You’ll receive $400 of Snowflake credit to use over 30-days: