With modern cloud databases, such as Snowflake, elasticity is always emphasized as essential. It is cited as a major difference to on-premise shared-nothing databases like Teradata.
Mainly the lower cost (“pay only what you need”) is cited as one of the main advantages. I think this is a very one-sided approach, and databases like Snowflake first have to prove how they compare to Teradata when it comes to a high number of parallel queries, and the Snowflake has to be scaled up accordingly.
I don’t think that a winner can be chosen so easily. Performance depends too much on the workload as general statements can be made here.
In this blog post, I will neither go into total costs of ownership, but show how Snowflake makes it possible to accelerate the ETL process without incurring higher cost (at least with the existing pricing model)
I will also show you why the acceleration of the ETL process is not that easily possible with Teradata. But As I said, the statement here is not that Snowflake has a lower TCO.
I’m just showing that elasticity opens up new possibilities in the ETL process that on-premise systems like Teradata on-premises (or Netezza) or Teradata in the cloud do not offer.
The ETL Process
Let us first look at a typical ETL process. Data is read from a source, transformed, possibly historized, and finally written to the target database.
Although there are dependencies between ETL jobs/scripts, many of the scripts can be executed in parallel because there are no dependencies.
What is the usual setup these days? On an ETL server or cluster of servers, a certain number of ETL scripts are executed in a batch simultaneously.
Sizing The Compute Node Cluster in Snowflake
What can we do to speed up the loading process at the same cost? One possibility is to double the number of compute nodes in a Snowflake cluster.
If we assume a minute-based billing, we can halve the loading time, as is Snowflake’s case. Our loading is thus completed in half the time at the same cost.
As you can see, this model falls and stands with how cloud resources usage is charged.
But even if a cost advantage is not given, you see that we can flexibly control our loading process in Snowflake.
In the ideal case, however, the loading time can be reduced without additional costs
For example, at the end of a month load, where several jobs are executed, more compute nodes could be used to reach service level agreements. This possibility is not given with Teradata. I will explain later why.
Adding Additional Compute Node Clusters in Snowflake
Another possibility is if we run the scripts on multiple independent clusters.
This will reduce the contention, and the runtimes will be shorter than in the previous variant, where we added additional compute nodes to the cluster.
The jobs in each cluster now have access to the dedicated resources of that cluster.
We must not forget that the jobs within a cluster compete for CPU, memory, and IOs. This is connected with a not to be underestimated effort.
Why Teradata Is Not Suitable
Starting new compute nodes causes costs. Snowflake can do this within a few minutes and is therefore particularly suited to react flexibly to changing workloads.
Teradata in the cloud is not nearly as flexible. Even though Teradata is scalable, shared-nothing architecture prevents scalability in real-time.
As I mentioned before, this doesn’t say anything about the total cost of ownership, and I don’t think costs can be defined in general terms.
Snowflake’s marketing is misleading when it is claimed that Snowflake has a much lower TCO than Teradata.
But on the other hand, Teradata claims the opposite when it comes to the parallel execution of a vast amount of competing requests where many additional compute nodes are needed in Snowflake to achieve good performance.
I think they both are wrong in their marketing because, in the end, it depends on the workload.