You Migrated From Teradata To Spark And Threw Away The One Thing That Made It Fast

If you have spent any amount of time working with Teradata, you know that the Primary Index is one of the most important design decisions you make. It determines how data is distributed across AMPs and whether your joins are fast or slow. Choosing the wrong Primary Index is one of the most common causes of poor query performance in Teradata. Choosing the right one is one of the most valuable skills a data warehouse professional can have.

When organizations migrate from Teradata to Apache Spark — whether on Databricks, Microsoft Fabric, or any other platform — a fundamental change occurs. The concept of permanent data distribution disappears. Not because it is no longer needed, but because the platform does not enforce it.

The result, in my experience, is that most Spark tables in production are the architectural equivalent of Teradata NOPI tables. And the consequences only become visible when the accumulated compute costs are compared against the performance expectations set during the migration.

Table of Contents

What the Primary Index Actually Does

For readers unfamiliar with Teradata, a brief explanation is in order.

When you create a table in Teradata with a Primary Index on customer_id, the system hashes every value of customer_id and assigns each row to a specific AMP (Access Module Processor). An AMP is a unit of parallel processing, each with its own storage and CPU. Rows with the same customer_id always reside on the same AMP.

When you join two tables that share the same Primary Index column, the join is AMP-local. AMP 3 joins its local customers with its local orders. No data moves across the interconnect, no redistribution is required, and the join is fast.

When the Primary Index columns do not match, Teradata must redistribute one or both tables. You see this in the EXPLAIN plan as: “We do an all-AMPs redistribute step.” This is expensive, and experienced Teradata professionals design their data models specifically to minimize redistributions.

The key point is this: while Teradata allows tables without a Primary Index (NOPI tables), these exist for specific purposes, such as fast bulk loading. In production analytical environments, experienced Teradata professionals define a Primary Index on virtually every table because the performance benefits of controlled data distribution are too significant to ignore. The system does not force this decision, but decades of practice have made it a standard that is rarely questioned.

What Happens in Spark

Spark follows the same parallel processing principle. Work is distributed across executors, each with its own memory and CPU cores. But there is one critical difference: Spark does not enforce any permanent data distribution by default.

When you write a table in Spark, the data lands wherever it happens to land. There is no hash distribution. There is no guaranteed placement. When you join two tables, Spark must first determine where the matching rows are. In most cases, they are scattered randomly across executors.

To perform the join, Spark executes a shuffle. It hashes every join key from both tables and sends each row across the network to the executor responsible for that hash value. Once matching rows are co-located, the join proceeds locally. This is exactly what Teradata does in a redistribution step — the same operation, the same cost, the same performance impact.

The difference is that in Teradata, a redistribution step is a warning sign. It means your PI choice was suboptimal. In Spark, a shuffle is the default. It happens on virtually every join, every aggregation, every operation that requires matching data across the cluster.

The NOPI Parallel

In Teradata, there is a specific table type called NOPI — No Primary Index. A NOPI table has no hash distribution. Rows are appended in arrival order without any placement logic. NOPI tables exist for a specific purpose: fast bulk loading via FastLoad or TPT. They are staging tables. No experienced Teradata professional would run production analytical queries against a NOPI table, because the performance would be unacceptable. Every join against a NOPI table requires a full redistribution.

In Spark, the default state of every table is essentially NOPI. There is no hash distribution, no guaranteed co-location of matching keys, and data lands wherever it lands. And yet, unlike in Teradata, where NOPI is a conscious decision for a specific staging purpose, in Spark, this is the default for everything — including production analytical tables that are joined thousands of times a day.

The platform compensates with brute force. Spark has massive memory, fast networks, and elastic clusters that can scale on demand. The shuffle happens on every join, but the hardware absorbs it. In Teradata, you would never accept this because the hardware is fixed and expensive. In the cloud, people add more resources and consider the problem solved — until the accumulated compute costs exceed the cost of the Teradata system that was replaced.

Bucketing: The PI Equivalent That Nobody Uses

Spark does have a mechanism for permanent hash distribution. It is called bucketing.

When you write a table with bucketing, Spark hashes the specified column and distributes rows into a fixed number of files — called buckets. If you bucket both customers and orders on customer_id with the same number of buckets, customer 4711 and all orders for customer 4711 end up in the same bucket file in both tables. When you join them, Spark matches bucket 0 with bucket 0, bucket 1 with bucket 1, and so on. Each executor reads one pair of matching buckets and joins them locally, without any shuffle or data movement across the network.

This is functionally identical to Teradata’s Primary Index. The hash function distributes data. Matching hash values are co-located. Joins are local.

The problem is that almost nobody uses it. In my experience, most Spark developers have never written a bucketed table. Bucketing is not the default. It requires explicit configuration at write time. It requires both sides of a join to use the same column and the same number of buckets. And it requires discipline to maintain across every insert and update. Most teams skip it entirely and accept the shuffle overhead.

Partitioning: The Same Word, Different Meaning

A common source of confusion is that both Teradata and Spark use the term “partitioning,” but it means different things in each.

In Teradata, Partitioned Primary Index (PPI) organizes data within each AMP into logical partitions, typically by date range. When you filter on the partition column, Teradata eliminates the partitions it does not need. This is partition elimination, and it operates on top of the PI distribution. The PI determines which AMP a row lives on. The PPI determines which partition within that AMP.

In Spark, partitioning is physical. When you partition a table by year, Spark creates separate directories on the file system for each value:

/table/year=2022/file1.parquet, file2.parquet
/table/year=2023/file1.parquet, file2.parquet
/table/year=2024/file1.parquet, file2.parquet

When you filter on WHERE year = 2024, Spark only reads the files in the year=2024 directory. It never opens the other directories. This is efficient and comparable to Teradata’s partition elimination.

But Spark partitioning does not address data distribution for joins. Within each partition directory, the data is not hash-distributed. If you partition by year and join on customer_id, Spark still shuffles. Partitioning and bucketing address different problems. In Teradata, PI and PPI are integrated into a single design. In Spark, they are separate mechanisms that must be consciously combined.

There is also a critical practical difference in granularity. In Teradata, you can partition by day across ten years and have 3,650 partitions without issues. In Spark, each partition is a physical directory containing files. Millions of partitions mean millions of tiny files, which destroys Spark performance. This is known as the “small files problem” and has no equivalent in Teradata.

What This Means for Migrations

When organizations migrate from Teradata to Spark-based platforms, they typically use automated migration tools that convert DDL, translate SQL, and move data. What these tools do not do is replicate the data distribution strategy.

A Teradata table with PRIMARY INDEX (customer_id) becomes a Spark table with no distribution. The data is loaded into Parquet or Delta Lake files without bucketing. Every AMP-local query on Teradata now triggers a full shuffle on Spark. The queries get slower, people add more compute, costs increase, and the conclusion is that “the new platform is not as fast as Teradata” — when in reality, nobody replicated the architecture that made Teradata fast in the first place.

In my experience, this is one of the most common and most expensive mistakes in data platform migrations. The tools change, but the fundamentals do not. Data distribution determines join performance on every parallel processing system, whether it was built in 1984 or 2024, and ignoring this because the new platform does not enforce it is not a simplification but a regression.

The Broader Pattern

This issue is a specific instance of a broader pattern I have observed repeatedly over the past twenty years. The data warehouse industry tends to adopt new platforms without transferring the architectural knowledge that made the old platforms work. The assumption is that the new platform is “smart enough” to compensate. Sometimes it is, at small scale. At enterprise scale, the physics of distributed data processing reassert themselves every time.

Anyone who understands data distribution, join strategies, and partition elimination on Teradata can learn Spark, Databricks, or Fabric in a reasonable time, because the fundamentals do not change even when the tools do. But the reverse is rarely true. A Spark developer who has never worked with a system that enforced data distribution may never think about it at all, and may never understand why the queries are slow and the costs are high.

The solution is not to mourn the Primary Index. It is to bring the same level of architectural thinking to the new platform: use bucketing where joins are critical, partition where filters are selective, and understand the cost of a shuffle the way one once understood the cost of a redistribution. The vocabulary changes. The discipline must not.

Roland Wenzlofsky is the founder of DWHPro, a network of senior data warehouse consultants who deliver end-to-end solutions across Teradata, Snowflake, and Databricks. No handovers. No fragmentation. He is the author of “Teradata Query Performance Tuning” and writes regularly about what works in data warehousing and what does not.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

🏗️ Planning a Data Platform Migration?

Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.

Our Migration Services →

You Migrated From Teradata to Spark and Threw Away the One Thing That Made It Fast