After twenty years of building enterprise data warehouses, I have learned that data is never evenly distributed. Customer tables are dominated by one country. Transaction tables are dominated by one product line. Event tables are dominated by one event type.
This is skew. And skew is what separates a data layout strategy that works in theory from one that works in production.
Databricks offers two approaches to organising data within Delta Lake tables: Z-ordering and Liquid Clustering. Both aim to improve query performance through data skipping — the ability to avoid reading files that cannot contain relevant rows. Both sound promising in documentation. But when your data is skewed, one of them quietly becomes useless.
This article explains why, with illustrations that make the mechanism visible.
The setup: a skewed dataset
Consider a transaction table with five countries: US, DE, ES, IT, and FR. Eighty percent of the records come from the United States. The remaining twenty percent is split evenly among the other four countries.
This is not an unusual distribution. In most enterprise datasets I have worked with, a single value dominates at least one important column.
What Z-ordering does with skewed data
Z-ordering uses a Z-order space-filling curve to sort data across multiple columns simultaneously. When you run OPTIMIZE my_table ZORDER BY (country, date), the engine reads all data, interleaves the bits of the country and date values using the Z-curve, sorts the records, and writes them into new files.
The critical property of the Z-curve is that it treats all values uniformly. It does not know or care that US appears eighty times more often than FR. It zigzags through the two-dimensional space in a fixed pattern.
The result is visible in the illustration above. Because US records are everywhere, every file contains a mix of all countries. When the query engine checks whether a file could contain German records, it looks at the min/max statistics and concludes: yes, this file could contain DE rows. For every single file. No files can be skipped.
What Liquid Clustering does differently
Liquid Clustering uses a Hilbert space-filling curve instead of a Z-order curve. The Hilbert curve has a property that matters enormously for skewed data: better locality preservation. Points that are close together in multi-dimensional space stay closer together in the one-dimensional ordering.
In practical terms, the Hilbert curve spends more time in dense regions before moving to sparse ones. When eighty percent of the data sits in the US row, the curve fills that row first, producing files that contain only US records. Then it visits the sparse country rows and groups them tightly together.
Three files contain only US records. One file contains DE and ES. One file contains IT and FR. Each file has a tight min/max range for the country column. When the query engine checks whether a file could contain German records, it skips four out of five files instantly.
The skew advantage: both majority and minority queries benefit
Here is the insight that most explanations miss. Liquid Clustering does not merely help queries on minority values. It helps queries on the majority value as well.
Query 1: WHERE country = ‘DE’ (minority)
Query 2: WHERE country = ‘US’ (majority — the skewed value)
Query 3: WHERE country = ‘US’ AND date = ‘March’ (filtered majority)
The maintenance difference
There is a second dimension to this comparison that matters in production: what happens when new data arrives.
Z-ordering is a static, full-table operation. The moment new data lands — which, in a skewed dataset, will overwhelmingly be more US records — the Z-ordering starts degrading. To restore it, you must re-run OPTIMIZE on the entire table. On a large table, this is expensive. On a skewed table, it is expensive and produces the same ineffective result as before.
Liquid Clustering works incrementally. When new US records arrive, the engine merges them into the appropriate cluster, rewriting only the affected files. The minority files are never touched. The clustering remains tight without a full-table rewrite, without manual scheduling, and without the associated compute cost.
Summary
| Z-ordering | Liquid Clustering | |
|---|---|---|
| Curve type | Z-order (uniform) | Hilbert (locality-preserving) |
| Skew handling | Treats all values equally — dominant value contaminates every file | Adapts to distribution — dominant value gets dedicated files |
| Minority query | All files scanned | Most files skipped |
| Majority query | All files scanned | Minority files skipped |
| Maintenance | Manual full-table OPTIMIZE | Automatic incremental |
| New data | Degrades immediately | Merges incrementally |
| Syntax | OPTIMIZE t ZORDER BY (c) |
ALTER TABLE t CLUSTER BY (c) |
The bottom line: Z-ordering was designed for a world where data is reasonably distributed. Liquid Clustering was designed for the world we actually live in.
Related Services
🏗️ Planning a Data Platform Migration?
Architecture-first approach: we design before a single line of code is written. Zero data loss across every migration delivered.
Our Migration Services →