The 15-Year Detour: How The Data Industry Spent Billions Reinventing SQL

Somewhere around 2020, the data world quietly arrived at a conclusion that Teradata engineers could have told you in 1984: SQL on a massively parallel architecture is a pretty good way to process large volumes of data.

The path to get there was anything but quiet. It involved billions in capital, an entire generation of engineers learning to write Java MapReduce jobs, a hype cycle that convinced many Fortune 500 companies they needed a Hadoop cluster, and a slow, slightly embarrassing retreat back to the thing that worked all along.

If you are currently evaluating Databricks for your data platform and wondering why there are two ways to do the same thing — SQL and PySpark — the answer lies in this history. Understanding it matters because the choice you make will affect your team’s productivity for years.

Table of Contents

Early 2000s — The Year SQL Was Declared Dead

In early 2000, Hadoop was released as an open-source project. It was inspired by two Google papers — one on the Google File System and one on MapReduce, and it promised to do something revolutionary: process massive datasets across clusters of cheap commodity hardware.

There was just one catch. If you wanted to join two datasets, you had to write hundreds of lines of Java. Map functions. Reduce functions. Serialization logic. Custom partitioners. No SQL. No query optimizer. No schema. Just raw code running on raw files.

This was not a design flaw. It was a design philosophy. The people who built Hadoop — and the broader ecosystem around it — were software engineers inspired by Google’s research papers on distributed computing. They thought in code, not queries. And they carried a deep, almost ideological skepticism toward relational databases.

SQL, they said, does not scale. Relational databases are too rigid. The future belongs to schema-on-read and distributed computation written in real programming languages.

The enterprise world listened. And started hiring Java developers to do what their SQL developers had been doing perfectly well for decades.

2008 — The First Crack in the Facade

It took about two years for the friction to become unbearable.

At Facebook, the data infrastructure team had a problem. The team was growing fast, and they could not teach every new engineer and analyst to write complex Java MapReduce jobs just to answer basic questions about data. SQL was the language their people already knew. So they built Apache Hive — a layer that translated SQL-like queries into MapReduce jobs running on Hadoop.

Hive was open-sourced and quickly became one of the most widely used tools in the Hadoop ecosystem.

Let that sink in. The ecosystem intended to replace SQL needed an SQL interface within 2 years of its launch.

But there was a problem. Hive was painfully slow. A query that would take seconds on Teradata could take minutes — sometimes tens of minutes — on Hive. The reason was MapReduce itself: every stage wrote intermediate results to disk. The architecture intended to replace the data warehouse was slower by orders of magnitude.

2014 — Spark Arrives (Code-First, Again)

Apache Spark began as a research project at UC Berkeley, was open-sourced in 2010, and reached production maturity in 2014. Spark solved Hadoop’s biggest problem: it kept data in memory between processing steps instead of writing to disk after every stage. The performance improvement was dramatic — 10x to 100x faster than MapReduce for many workloads.

Get Roland Wenzlofsky’s stories in your inbox

Join Medium for free to get updates from this writer. Subscribe

But here is where the pattern repeats. Spark’s primary API was code-first: Scala, then Python (PySpark). The people who built it were computer scientists, not data warehouse architects. The API was elegant, powerful, and completely foreign to anyone who had spent their career writing SQL.

Databricks — the company created to commercialize Spark — quickly realized this was a problem. The enterprise market they wanted to sell into — banks, telcos, insurance companies — had thousands of SQL developers and almost no PySpark developers. So Spark SQL shipped with Spark 1.0, allowing users to write familiar SQL queries that compiled to the same execution plans as PySpark code.

Once again, the SQL-free future needed SQL to survive contact with reality.

So, Why Do Both Still Exist?

This is the question that most Databricks tutorials skip over. If SQL and PySpark compile to the same Spark execution plan and deliver virtually identical performance, why does PySpark exist at all?

The answer is not about data processing. SQL handles joins, aggregations, window functions, and filtering — the bread and butter of data transformation — with less code, better readability, and easier maintenance than PySpark. For 80 to 90 percent of pipeline work, SQL is the better choice. Any SQL developer can read, debug, and maintain it. You are not dependent on one Python specialist being available forever.

PySpark earns its place when the logic goes beyond what SQL handles well: calling external APIs mid-pipeline, complex string manipulation, working with unstructured data such as nested JSON or images, machine learning feature engineering, or dynamic pipeline logic that depends on runtime conditions. In short, when you need programming, not querying.

The mistake many teams make — especially teams coming from the Hadoop era — is defaulting to PySpark for everything. They rewrite perfectly good SQL logic in Python because it feels more “engineering” and more “modern.” The result is pipelines that are harder to read, maintain, and hand over to the next team. This is not modernization. This is creating maintenance debt for no reason.

The Irony No One Talks About

Teradata shipped its first massively parallel processing system, the DBC/1012, in 1984. It ran SQL. It distributed data across nodes using hashing. It processed queries in parallel across multiple processors, each with its own storage. It scaled linearly as you added nodes.

Forty years later, when you write a SQL query in Databricks, Snowflake, or BigQuery, you are using a system that distributes data across nodes, processes queries in parallel across multiple processors, and scales by adding compute resources. The interface is SQL. The architecture is massively parallel.

The Hadoop world spent 15 years painfully reinventing what relational databases already did. First with MapReduce (parallel processing without SQL), then with Hive (SQL bolted onto MapReduce), then with Spark (faster parallel processing without SQL), then with Spark SQL (SQL bolted onto Spark). Each step brought the industry closer to what Teradata had in the 1980s — just running on cheaper commodity hardware instead of proprietary appliances.

The data warehouse architects were right all along. They just needed the hardware economics to catch up.

What This Means for Your Next Platform Decision

If you are evaluating Databricks, Snowflake, or any modern data platform, this history matters. Not as nostalgia, but as a practical guide:

Use SQL as your default for data transformation pipelines. It is more readable, more maintainable, and easier to audit — which matters enormously in regulated industries like banking and insurance. Reserve PySpark for the cases where SQL genuinely cannot do the job.

Do not let your team rewrite working SQL in PySpark because someone told them it is the modern way. The modern way is SQL. It always was.

And the next time someone tells you a new framework will replace SQL, remember: they said that in 2006, too. It took the industry 15 years and billions of dollars to find its way back.

Trends in tech are like fashion. They repeat every 20 years, just with new branding.

I’m Roland Wenzlofsky. I help organizations migrate from Teradata and other legacy platforms to Snowflake, Databricks, and BigQuery — without the surprises described above. 20+ years in enterprise data warehousing for European banks and telcos, no vendor partnerships, no platform bias. If you’re evaluating, planning, or stuck in a migration, let’s talk: [email protected] | dwhpro.com

Related Services

🏗️ Planning a Data Platform Migration?

Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.

Our Migration Services →

I think you made a great point about the challenges that come with closed, proprietary ecosystems. Teradata is a good example of a platform that historically kept things very close to their chest throughout the 80s, 90s and 2000s — which, while perhaps a reasonable business strategy at the time, did limit wider community engagement and open-source contribution which could have made it a lot more popular among businesses across the world.
Teradata + Azure right now are both suffering from very limited resources for people to learn the platforms easily and quickly (not saying they don’t exist, I used them for both).
Python isn’t the best programming language for any one purpose however, because it is widely documented and easy to find support, simple to pick’up and go, it is the most popular programming language in the world despite it’s not so great performance.

From a day-to-day working perspective, there are some areas that could benefit from improvement in Teradata. Things like consistency in column naming conventions across their system-views, some duplication between similarly-named system-objects, and the speed of querying system tables for real-time performance insights are areas where a more polish. These kinds of refinements would make a real difference for teams working with the platform regularly as administrators.

Tooling is another area with room to grow. Their client tools and web-based console feel like they haven’t quite kept pace with modern expectations, and while their new AI-assisted onboarding UI is a welcome step forward, it still has some way to go before it feels truly fluid and intuitive.

That said, it’s genuinely encouraging to see Teradata embracing a cloud-first direction, incorporating AI capabilities, and modernising their client tools. These are positive moves. It’s worth acknowledging that platforms like Hadoop and Spark — which ultimately led us Delta Lake — helped shape where the industry is today and even Teradata are now riding on their wave of success with NOS.
Solutions like Databricks and Microsoft Fabric now offer compelling end-to-end platforms that cater not just to DBAs and developers, but also to business analysts and wider stakeholders. Keeping pace with that shift is an important challenge for Teradata going forward. It looks like their new UI is aiming to bring together ClearScape, web-console, and Studio which I guess will be their E-2-E solution to interacting with Teradata and datalake. But it’s hard to see them keeping pace with Databricks + MS Fabric as the latter matures and attracts business who love no/low code solutions.

Full disclosure, I ran my original message via AI to ensure my points were less “attacking” and more “constructive”. A skill I think many of us (myself included) have lost over the decades of internet fast-paced, anonymous communication.

On the plus side, their extensions to ANSI SQL are genuinely useful, and their long-standing commitment to MPP architecture from the very beginning is something to be respected — it was ahead of its time. I also love that they still have really solid option for on-prem. Having worked across many database platforms, I’d say Teradata has a strong foundation to build on, and with continued investment in quality and modernisation, there’s real potential to close the gap with current trends.

These days, I pay a lot of attention to Databricks. OK cross-node compute has been around for a long time with Teradata, but Databricks

1 thought on “The 15-Year Detour: How the Data Industry Spent Billions Reinventing SQL”

Other_POV

03/19/2026 at 9:15 am

I think you made a great point about the challenges that come with closed, proprietary ecosystems. Teradata is a good example of a platform that historically kept things very close to their chest throughout the 80s, 90s and 2000s — which, while perhaps a reasonable business strategy at the time, did limit wider community engagement and open-source contribution which could have made it a lot more popular among businesses across the world.
Teradata + Azure right now are both suffering from very limited resources for people to learn the platforms easily and quickly (not saying they don’t exist, I used them for both).
Python isn’t the best programming language for any one purpose however, because it is widely documented and easy to find support, simple to pick’up and go, it is the most popular programming language in the world despite it’s not so great performance.

From a day-to-day working perspective, there are some areas that could benefit from improvement in Teradata. Things like consistency in column naming conventions across their system-views, some duplication between similarly-named system-objects, and the speed of querying system tables for real-time performance insights are areas where a more polish. These kinds of refinements would make a real difference for teams working with the platform regularly as administrators.

Tooling is another area with room to grow. Their client tools and web-based console feel like they haven’t quite kept pace with modern expectations, and while their new AI-assisted onboarding UI is a welcome step forward, it still has some way to go before it feels truly fluid and intuitive.

That said, it’s genuinely encouraging to see Teradata embracing a cloud-first direction, incorporating AI capabilities, and modernising their client tools. These are positive moves. It’s worth acknowledging that platforms like Hadoop and Spark — which ultimately led us Delta Lake — helped shape where the industry is today and even Teradata are now riding on their wave of success with NOS.
Solutions like Databricks and Microsoft Fabric now offer compelling end-to-end platforms that cater not just to DBAs and developers, but also to business analysts and wider stakeholders. Keeping pace with that shift is an important challenge for Teradata going forward. It looks like their new UI is aiming to bring together ClearScape, web-console, and Studio which I guess will be their E-2-E solution to interacting with Teradata and datalake. But it’s hard to see them keeping pace with Databricks + MS Fabric as the latter matures and attracts business who love no/low code solutions.

Full disclosure, I ran my original message via AI to ensure my points were less “attacking” and more “constructive”. A skill I think many of us (myself included) have lost over the decades of internet fast-paced, anonymous communication.

On the plus side, their extensions to ANSI SQL are genuinely useful, and their long-standing commitment to MPP architecture from the very beginning is something to be respected — it was ahead of its time. I also love that they still have really solid option for on-prem. Having worked across many database platforms, I’d say Teradata has a strong foundation to build on, and with continued investment in quality and modernisation, there’s real potential to close the gap with current trends.

These days, I pay a lot of attention to Databricks. OK cross-node compute has been around for a long time with Teradata, but Databricks

The 15-Year Detour: How the Data Industry Spent Billions Reinventing SQL