How Snowflake Reduces Load Times And Why Teradata Can’t

Roland Wenzlofsky

February 1, 2020

minutes reading time


Teradata Load Times vs. Snowflake Load Times

With modern cloud databases like Snowflake, elasticity is always emphasized as essential. It is cited as a significant difference from on-premise shared-nothing databases like Teradata.

The lower cost (“pay merely what you need”) is one of the main advantages. I think this is a very one-sided approach, and databases like Snowflake first have to prove how they compare to Teradata when it comes to a high number of parallel queries, and the Snowflake has to be scaled up accordingly.

I don’t think that a winner can be chosen so easily. Performance depends too much on the workload, as general statements can be made here.

In this blog post, I will neither go into total costs of ownership but show how Snowflake makes it possible to accelerate the ETL process without incurring higher costs (at least with the existing pricing model)

I will also show you why the acceleration of the ETL process is not that easily possible with Teradata. But As I said, the statement here is not that Snowflake has a lower TCO.

I’m just showing that elasticity opens up new possibilities in the ETL process that on-premise systems like Teradata on-premises (or Netezza) or Teradata in the cloud do not offer.

The ETL Process

Let us first look at a typical ETL process. Data is read from a source, transformed, possibly historized, and finally written to the target database.

Although there are dependencies between ETL jobs/scripts, many of the scripts can be executed in parallel because there are no dependencies.

What is the usual setup these days? On an ETL server or cluster of servers, several ETL scripts are executed simultaneously in a batch.

Sizing The Compute Node Cluster in Snowflake

Snowflake ETL

What can we do to speed up the loading process at the same cost? One possibility is to double the number of compute nodes in a Snowflake cluster.

If we assume a minute-based billing, we can halve the loading time, as is Snowflake’s case. Our loading is thus completed in half the time at the same cost.

As you can see, this model falls and stands with how cloud resource usage is charged.

But even if a cost advantage is not given, you see that we can flexibly control our loading process in Snowflake.

In the ideal case, however, the loading time can be reduced without additional costs.

For example, at the end of a month’s load, where several jobs are executed, more compute nodes could be used to reach service-level agreements. This possibility is not given to Teradata. I will explain later why.

Adding Additional Compute Node Clusters in Snowflake

Snowflake Cluster

Another possibility is if we run the scripts on multiple independent clusters.

This will reduce the contention, and the runtimes will be shorter than in the previous variant, where we added additional compute nodes to the cluster.

The jobs in each cluster now have access to the dedicated resources of that cluster.

We must not forget that the jobs within a cluster compete for CPU, memory, and IOs. This is connected with a not to be underestimated effort.

Why Teradata Is Not Suitable

Starting new compute nodes causes costs. Snowflake can do this within a few minutes and is therefore particularly suited to react flexibly to changing workloads.

Teradata in the cloud is not nearly as flexible. Even though Teradata is scalable, shared-nothing architecture prevents scalability in real-time.

As I mentioned, this doesn’t say anything about the total cost of ownership, and I don’t think costs can be defined in general terms.

Snowflake’s marketing is misleading when it is claimed that Snowflake has a much lower TCO than Teradata.

But on the other hand, Teradata claims the opposite when it comes to the parallel execution of a vast amount of competing requests where many additional compute nodes are needed in Snowflake to achieve good performance.

I think they both are wrong in their marketing because, in the end, it depends on the workload.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>