Why Z-Ordering Fails On Skewed Data — And Liquid Clustering Does Not

After twenty years of building enterprise data warehouses, I have learned that data is never evenly distributed. Customer tables are dominated by one country. Transaction tables are dominated by one product line. Event tables are dominated by one event type.

This is skew. And skew is what separates a data layout strategy that works in theory from one that works in production.

Databricks offers two approaches to organising data within Delta Lake tables: Z-ordering and Liquid Clustering. Both aim to improve query performance through data skipping — the ability to avoid reading files that cannot contain relevant rows. Both sound promising in documentation. But when your data is skewed, one of them quietly becomes useless.

This article explains why, with illustrations that make the mechanism visible.

Table of Contents

The setup: a skewed dataset

Consider a transaction table with five countries: US, DE, ES, IT, and FR. Eighty percent of the records come from the United States. The remaining twenty percent is split evenly among the other four countries.

This is not an unusual distribution. In most enterprise datasets I have worked with, a single value dominates at least one important column.

Raw data: 80% US, 5% each DE, ES, IT, FR

Each strip represents records for that country across the date range

US (80%)

DE (5%)

ES (5%)

IT (5%)

FR (5%)

The US row is packed. The other rows are sparse. This is skew.

What Z-ordering does with skewed data

Z-ordering uses a Z-order space-filling curve to sort data across multiple columns simultaneously. When you run OPTIMIZE my_table ZORDER BY (country, date), the engine reads all data, interleaves the bits of the country and date values using the Z-curve, sorts the records, and writes them into new files.

The critical property of the Z-curve is that it treats all values uniformly. It does not know or care that US appears eighty times more often than FR. It zigzags through the two-dimensional space in a fixed pattern.

After Z-ordering: US records contaminate every file

The Z-curve assigns records to files uniformly. Every file contains a mix.

File 1

min: DE max: US

File 2

min: DE max: US

File 3

min: DE max: US

File 4

min: DE max: US

File 5

min: DE max: US

Every file’s min/max for country spans DE to US. No file can be skipped for any country query.

The result is visible in the illustration above. Because US records are everywhere, every file contains a mix of all countries. When the query engine checks whether a file could contain German records, it looks at the min/max statistics and concludes: yes, this file could contain DE rows. For every single file. No files can be skipped.

What Liquid Clustering does differently

Liquid Clustering uses a Hilbert space-filling curve instead of a Z-order curve. The Hilbert curve has a property that matters enormously for skewed data: better locality preservation. Points that are close together in multi-dimensional space stay closer together in the one-dimensional ordering.

In practical terms, the Hilbert curve spends more time in dense regions before moving to sparse ones. When eighty percent of the data sits in the US row, the curve fills that row first, producing files that contain only US records. Then it visits the sparse country rows and groups them tightly together.

After Liquid Clustering: each file has a tight country range

The Hilbert curve adapts to the distribution. US gets dedicated files. Minorities grouped.

File 1

US only

File 2

US only

File 3

US only

File 4

DE + ES

File 5

IT + FR

Three files contain only US. Two files group the minority countries tightly. Every file has a narrow min/max range.

Three files contain only US records. One file contains DE and ES. One file contains IT and FR. Each file has a tight min/max range for the country column. When the query engine checks whether a file could contain German records, it skips four out of five files instantly.

Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.

The skew advantage: both majority and minority queries benefit

Here is the insight that most explanations miss. Liquid Clustering does not merely help queries on minority values. It helps queries on the majority value as well.

Three queries, same data, side by side

Z-ordering always scans everything. Liquid Clustering skips files on every query — even for the dominant value.

Query 1: WHERE country = ‘DE’ (minority)

Z-ordering

Scan

5/5 scanned

Liquid Clustering

Skip

Scan

Skip

1/5 scanned

Query 2: WHERE country = ‘US’ (majority — the skewed value)

Z-ordering

Scan

5/5 scanned

Liquid Clustering

Scan

Skip

3/5 scanned

Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)

Z-ordering

Scan

5/5 scanned

Liquid Clustering

Skip

Scan

Skip

1/5 scanned

With Z-ordering and skew, data skipping is ineffective for every query. With Liquid Clustering, every query benefits — including queries on the dominant value.

The maintenance difference

There is a second dimension to this comparison that matters in production: what happens when new data arrives.

Z-ordering

500 new US records arrive

Land as small unsorted files

↓

Z-order immediately degrades

↓

Must re-run OPTIMIZE on entire table

↓

Expensive. Manual. Same mixed result.

Liquid Clustering

500 new US records arrive

Engine detects: all US records

↓

Merges with existing US files only

↓

Rewrites 1-2 files. Others untouched.

↓

Cheap. Automatic. Clustering stays tight.

Z-ordering is a static, full-table operation. The moment new data lands — which, in a skewed dataset, will overwhelmingly be more US records — the Z-ordering starts degrading. To restore it, you must re-run OPTIMIZE on the entire table. On a large table, this is expensive. On a skewed table, it is expensive and produces the same ineffective result as before.

Liquid Clustering works incrementally. When new US records arrive, the engine merges them into the appropriate cluster, rewriting only the affected files. The minority files are never touched. The clustering remains tight without a full-table rewrite, without manual scheduling, and without the associated compute cost.

Summary

	Z-ordering	Liquid Clustering
Curve type	Z-order (uniform)	Hilbert (locality-preserving)
Skew handling	Treats all values equally — dominant value contaminates every file	Adapts to distribution — dominant value gets dedicated files
Minority query	All files scanned	Most files skipped
Majority query	All files scanned	Minority files skipped
Maintenance	Manual full-table OPTIMIZE	Automatic incremental
New data	Degrades immediately	Merges incrementally
Syntax	`OPTIMIZE t ZORDER BY (c)`	`ALTER TABLE t CLUSTER BY (c)`

The bottom line: Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.

Related Services

🏗️ Planning a Data Platform Migration?

Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.

Our Migration Services →

Why Z-Ordering Fails on Skewed Data — and Liquid Clustering Does Not

The setup: a skewed dataset

What Z-ordering does with skewed data

What Liquid Clustering does differently

The skew advantage: both majority and minority queries benefit

Query 1: WHERE country = ‘DE’ (minority)

Query 2: WHERE country = ‘US’ (majority — the skewed value)

Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)

The maintenance difference

Summary

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

Leave a Comment Cancel reply

The setup: a skewed dataset

What Z-ordering does with skewed data

What Liquid Clustering does differently

The skew advantage: both majority and minority queries benefit

Query 1: WHERE country = ‘DE’ (minority)

Query 2: WHERE country = ‘US’ (majority — the skewed value)

Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)

The maintenance difference

Summary

📚 Want more like this?

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

Leave a Comment Cancel reply