Why Z-Ordering Fails on Skewed Data — and Liquid Clustering Does Not

After twenty years of building enterprise data warehouses, I have learned that data is never evenly distributed. Customer tables are dominated by one country. Transaction tables are dominated by one product line. Event tables are dominated by one event type.

This is skew. And skew is what separates a data layout strategy that works in theory from one that works in production.

Databricks offers two approaches to organising data within Delta Lake tables: Z-ordering and Liquid Clustering. Both aim to improve query performance through data skipping — the ability to avoid reading files that cannot contain relevant rows. Both sound promising in documentation. But when your data is skewed, one of them quietly becomes useless.

This article explains why, with illustrations that make the mechanism visible.

The setup: a skewed dataset

Consider a transaction table with five countries: US, DE, ES, IT, and FR. Eighty percent of the records come from the United States. The remaining twenty percent is split evenly among the other four countries.

This is not an unusual distribution. In most enterprise datasets I have worked with, a single value dominates at least one important column.

Raw data: 80% US, 5% each DE, ES, IT, FR
Each strip represents records for that country across the date range
US

DE

ES

IT

FR

US (80%)
DE (5%)
ES (5%)
IT (5%)
FR (5%)

The US row is packed. The other rows are sparse. This is skew.

What Z-ordering does with skewed data

Z-ordering uses a Z-order space-filling curve to sort data across multiple columns simultaneously. When you run OPTIMIZE my_table ZORDER BY (country, date), the engine reads all data, interleaves the bits of the country and date values using the Z-curve, sorts the records, and writes them into new files.

The critical property of the Z-curve is that it treats all values uniformly. It does not know or care that US appears eighty times more often than FR. It zigzags through the two-dimensional space in a fixed pattern.

After Z-ordering: US records contaminate every file
The Z-curve assigns records to files uniformly. Every file contains a mix.
File 1
min: DE max: US

File 2
min: DE max: US

File 3
min: DE max: US

File 4
min: DE max: US

File 5
min: DE max: US

Every file’s min/max for country spans DE to US. No file can be skipped for any country query.

The result is visible in the illustration above. Because US records are everywhere, every file contains a mix of all countries. When the query engine checks whether a file could contain German records, it looks at the min/max statistics and concludes: yes, this file could contain DE rows. For every single file. No files can be skipped.

What Liquid Clustering does differently

Liquid Clustering uses a Hilbert space-filling curve instead of a Z-order curve. The Hilbert curve has a property that matters enormously for skewed data: better locality preservation. Points that are close together in multi-dimensional space stay closer together in the one-dimensional ordering.

In practical terms, the Hilbert curve spends more time in dense regions before moving to sparse ones. When eighty percent of the data sits in the US row, the curve fills that row first, producing files that contain only US records. Then it visits the sparse country rows and groups them tightly together.

After Liquid Clustering: each file has a tight country range
The Hilbert curve adapts to the distribution. US gets dedicated files. Minorities grouped.
File 1
US only
File 2
US only
File 3
US only
File 4
DE + ES
File 5
IT + FR
Three files contain only US. Two files group the minority countries tightly. Every file has a narrow min/max range.

Three files contain only US records. One file contains DE and ES. One file contains IT and FR. Each file has a tight min/max range for the country column. When the query engine checks whether a file could contain German records, it skips four out of five files instantly.

Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.

The skew advantage: both majority and minority queries benefit

Here is the insight that most explanations miss. Liquid Clustering does not merely help queries on minority values. It helps queries on the majority value as well.

Three queries, same data, side by side
Z-ordering always scans everything. Liquid Clustering skips files on every query — even for the dominant value.

Query 1: WHERE country = ‘DE’ (minority)

Z-ordering
Scan
Scan
Scan
Scan
Scan
5/5 scanned

Liquid Clustering
Skip
Skip
Skip
Scan
Skip
1/5 scanned

Query 2: WHERE country = ‘US’ (majority — the skewed value)

Z-ordering
Scan
Scan
Scan
Scan
Scan
5/5 scanned

Liquid Clustering
Scan
Scan
Scan
Skip
Skip
3/5 scanned

Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)

Z-ordering
Scan
Scan
Scan
Scan
Scan
5/5 scanned

Liquid Clustering
Skip
Scan
Skip
Skip
Skip
1/5 scanned

With Z-ordering and skew, data skipping is ineffective for every query. With Liquid Clustering, every query benefits — including queries on the dominant value.

The maintenance difference

There is a second dimension to this comparison that matters in production: what happens when new data arrives.

Z-ordering
500 new US records arrive
Land as small unsorted files
Z-order immediately degrades
Must re-run OPTIMIZE on entire table
Expensive. Manual. Same mixed result.

Liquid Clustering
500 new US records arrive
Engine detects: all US records
Merges with existing US files only
Rewrites 1-2 files. Others untouched.
Cheap. Automatic. Clustering stays tight.

Z-ordering is a static, full-table operation. The moment new data lands — which, in a skewed dataset, will overwhelmingly be more US records — the Z-ordering starts degrading. To restore it, you must re-run OPTIMIZE on the entire table. On a large table, this is expensive. On a skewed table, it is expensive and produces the same ineffective result as before.

Liquid Clustering works incrementally. When new US records arrive, the engine merges them into the appropriate cluster, rewriting only the affected files. The minority files are never touched. The clustering remains tight without a full-table rewrite, without manual scheduling, and without the associated compute cost.

Summary

Z-ordering Liquid Clustering
Curve type Z-order (uniform) Hilbert (locality-preserving)
Skew handling Treats all values equally — dominant value contaminates every file Adapts to distribution — dominant value gets dedicated files
Minority query All files scanned Most files skipped
Majority query All files scanned Minority files skipped
Maintenance Manual full-table OPTIMIZE Automatic incremental
New data Degrades immediately Merges incrementally
Syntax OPTIMIZE t ZORDER BY (c) ALTER TABLE t CLUSTER BY (c)

The bottom line: Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.

📚 Want more like this?

Join 4,000+ data warehouse professionals who get practical insights on Teradata, Snowflake, and Databricks delivered to their inbox.



No spam. Unsubscribe at any time.

Related Services

🏗️ Planning a Data Platform Migration?

Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.

Our Migration Services →

📊 Data Platform Migration Survey

Help us map where the industry is heading. Results are public — see what others chose.

1. What is your current data platform?

2. Where are you migrating to (or evaluating)?

Migrating FROM
Migrating TO

Thanks for voting! Share this with your network.

Follow me on LinkedIn for daily insights on data warehousing and platform migrations.

Stay Ahead in Data Warehousing

Get expert insights on Teradata, Snowflake, BigQuery, Databricks, Microsoft Fabric, and modern data architecture — delivered to your inbox.

Leave a Comment

DWHPro

Expert network for enterprise data platforms. Senior consultants, project teams built for your challenge — across Teradata, Snowflake, Databricks, and more.

📍Vienna, Austria & Miami, Florida

Quick Links
Services Team Teradata Book Blog Contact Us
Connect
LinkedIn → [email protected]
Newsletter

Join 4,000+ data professionals.
Weekly insights on Teradata, Snowflake & data architecture.