Codecs and Pipelines: v2 Compressors vs v3 Codec Chains
Mix.install([
{:ex_zarr, path: Path.join(__DIR__, "../..")},
{:kino, "~> 0.13"}
])
Introduction
Compression is critical for Zarr arrays — it reduces storage costs, speeds up I/O, and enables working with larger datasets. Zarr v2 uses simple compressors. Zarr v3 introduces codec pipelines: composable, multi-stage transformations. This livebook explores both approaches, when to use each codec, and how to measure compression effectiveness.
What you’ll learn:
- Zarr v2 compressor model (single-stage)
- Zarr v3 codec pipelines (multi-stage)
- Common codecs: gzip, zstd, zlib, blosc
- Compression ratio vs speed trade-offs
- Choosing codecs for different data types
- Measuring compression effectiveness
Core principle: Chunks compress independently. This enables parallel compression, but also means codecs can’t exploit inter-chunk patterns.
Setup
alias ExZarr.Array
alias ExZarr.Gallery.{Pack, SampleData, Metrics}
Zarr v2: Simple Compressors
In Zarr v2, each chunk goes through a single compressor:
graph LR
A[Raw Chunk Data] --> B[Compressor<br/>e.g. zstd, gzip]
B --> C[Compressed Chunk]
C --> D[Storage]
style A fill:#e1f5ff
style B fill:#fff9c4
style C fill:#c8e6c9
style D fill:#f3e5f5
Simple and effective: For most use cases, a single compressor is enough. v2 supports:
-
gzip- Widely compatible, moderate speed -
zlib- Similar to gzip, slightly different format -
zstd- Faster, better ratios, modern choice -
blosc- Optimized for numerical data, multiple algorithms -
lz4- Extremely fast, lower compression ratio
Zarr v3: Codec Pipelines
Zarr v3 allows chaining codecs into pipelines:
graph LR
A[Raw Chunk Data] --> B[Codec 1<br/>Transpose]
B --> C[Codec 2<br/>ByteShuffle]
C --> D[Codec 3<br/>zstd]
D --> E[Compressed Chunk]
E --> F[Storage]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#ffe0b2
style D fill:#fff9c4
style E fill:#c8e6c9
style F fill:#f3e5f5
Why pipelines: Some transformations improve compression without being compressors themselves. For example:
- Transpose: Reorder dimensions to improve locality
- ByteShuffle: Separate bytes by position, helping compressors
- Delta encoding: Store differences instead of absolute values
These are called “array-to-array” or “array-to-bytes” codecs, followed by a “bytes-to-bytes” compressor.
Step 1: Create Arrays with Different Compressors (v2)
Let’s create the same logical array with different v2 compressors and compare.
compressors = [:zstd, :gzip, :zlib, :blosc]
arrays =
Enum.map(compressors, fn comp ->
{:ok, array} =
Array.create(
shape: {500, 500},
chunks: {100, 100},
dtype: :float32,
compressor: comp,
zarr_version: 2,
storage: :memory
)
{comp, array}
end)
IO.puts("Created #{length(arrays)} arrays with different compressors")
:ok
Same structure, different compression: All arrays have identical shape and chunks, only the compressor differs.
Step 2: Write Data and Measure Compression
We’ll write the same data to each array and measure compressed size.
# Generate test data: 500×500 matrix
data = SampleData.matrix(500, 500)
binary = Pack.pack(data, :float32)
uncompressed_size = byte_size(binary)
IO.puts("Uncompressed data size: #{div(uncompressed_size, 1024)} KB\n")
compression_results =
Enum.map(arrays, fn {comp, array} ->
# Write data
{_result, write_time_us} =
Metrics.time(fn ->
Array.set_slice(array, binary, start: {0, 0}, stop: {500, 500})
end)
# Estimate compressed size (for memory storage, we'd need to save to disk)
# For demonstration, we'll read it back and time it
{_read_data, read_time_us} =
Metrics.time(fn ->
Array.get_slice(array, start: {0, 0}, stop: {500, 500})
end)
%{
compressor: comp,
write_time_us: write_time_us,
read_time_us: read_time_us
}
end)
IO.puts("Compression performance:\n")
Enum.each(compression_results, fn result ->
IO.puts("#{result.compressor}:")
IO.puts(" Write: #{Metrics.human_us(result.write_time_us)}")
IO.puts(" Read: #{Metrics.human_us(result.read_time_us)}")
IO.puts("")
end)
What we measured:
- Write time: Includes compression time
- Read time: Includes decompression time
Expected patterns:
-
zstd: Fast, good compression -
gzip/zlib: Slower, decent compression -
blosc: Very fast, excellent for numerical data -
lz4: Fastest, lower compression ratio
Step 3: Understanding Blosc
Blosc is special: it’s a meta-compressor that combines:
- Blocking: Splits data into blocks for cache efficiency
- Shuffling: Reorders bytes to improve compression
- Compression: Uses an underlying codec (lz4, zstd, etc.)
graph TD
A[Raw Data] --> B[Blosc Meta-Compressor]
B --> C[Split into Blocks]
C --> D[Byte Shuffle Each Block]
D --> E[Compress with LZ4/zstd]
E --> F[Compressed Blocks]
style A fill:#e1f5ff
style B fill:#f3e5f5
style F fill:#c8e6c9
Why Blosc works well: Numerical arrays have structure (nearby values are similar). Shuffling bytes by position (all high bytes together, all low bytes together) creates long runs that compress better.
Step 4: Zarr v3 Codec Pipelines
Let’s create a v3 array with an explicit codec pipeline. We’ll use:
- Transpose (if beneficial)
- ByteShuffle
- zstd compression
# For this example, we'll create a simple v3 array with zstd
# Full pipeline support depends on ExZarr codec implementation
{:ok, array_v3} =
Array.create(
shape: {500, 500},
chunks: {100, 100},
dtype: :float32,
compressor: :zstd,
zarr_version: 3,
storage: :memory
)
# Write the same data
Array.set_slice(array_v3, binary, start: {0, 0}, stop: {500, 500})
IO.puts("Created v3 array with codec pipeline")
v3 advantage: Explicit codec specification in metadata makes the transformation pipeline transparent. You can see exactly what operations are applied.
Future extensibility: v3’s codec model allows custom codecs without format changes. Want a domain-specific transformation? Implement it as a codec and register it.
Step 5: Comparing Compression Ratios
Let’s save arrays to disk and measure actual compressed sizes.
base_dir = Path.join(System.tmp_dir!(), "exzarr_codec_comparison")
File.rm_rf!(base_dir)
File.mkdir_p!(base_dir)
size_results =
Enum.map(arrays, fn {comp, array} ->
path = Path.join(base_dir, "array_#{comp}")
File.mkdir_p!(path)
Array.save(array, path: path)
# Measure directory size (includes metadata + chunks)
{size_output, 0} = System.cmd("du", ["-sk", path])
size_kb = size_output |> String.trim() |> String.split("\t") |> hd() |> String.to_integer()
%{compressor: comp, size_kb: size_kb}
end)
IO.puts("Compressed sizes on disk:\n")
Enum.each(size_results, fn result ->
ratio = Float.round(uncompressed_size / 1024 / result.size_kb, 2)
IO.puts("#{result.compressor}: #{result.size_kb} KB (#{ratio}x compression)")
end)
Compression ratio: Original size / compressed size. Higher is better (more compression).
Real-world ratios depend on:
- Data type (integers compress better than random floats)
- Data patterns (repeated values, smooth gradients)
- Chunk size (larger chunks can find more patterns)
- Compressor settings (level, window size)
Step 6: When to Use Each Codec
graph TD
A[Choosing a Codec] --> B{Priority?}
B -->|Speed| C[lz4 or blosc:lz4]
B -->|Compression| D[zstd level 9+]
B -->|Compatibility| E[gzip]
B -->|Numerical Data| F[blosc:zstd]
style A fill:#e1f5ff
style C fill:#c8e6c9
style D fill:#fff9c4
style E fill:#f3e5f5
style F fill:#ffe0b2
Decision guide:
- zstd (default): Best general-purpose choice. Fast, good compression, widely supported.
- blosc: Excellent for numerical arrays (satellite data, simulation results). Use with zstd or lz4 backend.
- gzip: Maximum compatibility. Slower, but readable by any tool.
- zlib: Similar to gzip, slight differences in format.
- lz4: When speed is critical and compression ratio is secondary (e.g., temporary intermediate data).
For Zarr v3: Start with zstd. Add byte shuffle if working with numerical data. Consider transpose for multidimensional arrays with non-uniform access patterns.
Step 7: Codec Choice by Data Type
Different data types benefit from different codecs:
# Integer data (highly compressible)
{:ok, int_array} =
Array.create(
shape: {1000, 1000},
chunks: {100, 100},
dtype: :int32,
compressor: :zstd,
storage: :memory
)
# Integer data with pattern: sequential IDs
int_data =
for r <- 0..999, c <- 0..999 do
r * 1000 + c
end
int_binary = Pack.pack(int_data, :int32)
Array.set_slice(int_array, int_binary, start: {0, 0}, stop: {1000, 1000})
# Random floats (barely compressible)
{:ok, float_array} =
Array.create(
shape: {1000, 1000},
chunks: {100, 100},
dtype: :float64,
compressor: :zstd,
storage: :memory
)
float_data = SampleData.rand_floats(1_000_000, 12345)
float_binary = Pack.pack(float_data, :float64)
Array.set_slice(float_array, float_binary, start: {0, 0}, stop: {1000, 1000})
IO.puts("Created arrays with different data patterns")
IO.puts("Integer array: sequential pattern (highly compressible)")
IO.puts("Float array: pseudo-random (low compressibility)")
Compression by data type:
- Integers with patterns: 5-20x compression (counters, IDs, sparse data)
- Floats from simulations: 2-5x compression (smooth fields, sensor data)
- Random data: 1-1.5x compression (encrypted, already compressed, true random)
- Sparse arrays: 100-1000x compression with appropriate fill values
Step 8: Codec Pipeline Benefits (v3)
Zarr v3 codec pipelines shine when you need multi-stage transformations:
Example pipeline for climate data:
- Transpose: Reorder dimensions to group similar values
- Delta encoding: Store differences between adjacent values
- ByteShuffle: Separate bytes by position
- zstd: Final compression
Each stage improves compression or processing efficiency.
# Conceptual example (actual pipeline depends on ExZarr codec support)
IO.puts("""
Example v3 codec pipeline for climate model output:
1. Transpose: (time, lat, lon) → (lat, lon, time)
- Groups spatial neighbors for better compression
2. Delta encoding: Store temporal differences
- Temperature changes slowly over time
3. ByteShuffle: Separate high/low bytes
- Creates long runs of similar bytes
4. zstd: Final compression
- Compresses the transformed data
Result: 10-30x compression for typical climate data
""")
When to use pipelines:
- Data has specific structure (spatial, temporal)
- Multiple dimensions with different access patterns
- Need to balance compression and decoding speed
- Custom transformations for domain-specific data
Step 9: Measuring Codec Overhead
Compression isn’t free. Let’s measure the overhead.
# No compression
{:ok, uncompressed_array} =
Array.create(
shape: {500, 500},
chunks: {100, 100},
dtype: :float32,
compressor: nil,
storage: :memory
)
# With compression
{:ok, compressed_array} =
Array.create(
shape: {500, 500},
chunks: {100, 100},
dtype: :float32,
compressor: :zstd,
storage: :memory
)
test_data = SampleData.matrix(500, 500)
test_binary = Pack.pack(test_data, :float32)
# Time writes
{_, uncompressed_write_us} =
Metrics.time(fn ->
Array.set_slice(uncompressed_array, test_binary, start: {0, 0}, stop: {500, 500})
end)
{_, compressed_write_us} =
Metrics.time(fn ->
Array.set_slice(compressed_array, test_binary, start: {0, 0}, stop: {500, 500})
end)
# Time reads
{_, uncompressed_read_us} =
Metrics.time(fn ->
Array.get_slice(uncompressed_array, start: {0, 0}, stop: {500, 500})
end)
{_, compressed_read_us} =
Metrics.time(fn ->
Array.get_slice(compressed_array, start: {0, 0}, stop: {500, 500})
end)
IO.puts("Codec overhead comparison:\n")
IO.puts("Uncompressed:")
IO.puts(" Write: #{Metrics.human_us(uncompressed_write_us)}")
IO.puts(" Read: #{Metrics.human_us(uncompressed_read_us)}")
IO.puts("\nCompressed (zstd):")
IO.puts(" Write: #{Metrics.human_us(compressed_write_us)}")
IO.puts(" Read: #{Metrics.human_us(compressed_read_us)}")
Overhead insights:
- Write time: Compression adds CPU cost
- Read time: Decompression adds CPU cost
- For remote storage: Compression reduces transfer time, often making total time faster despite CPU overhead
- For local storage: Compression reduces disk I/O, especially on slow drives
Rule of thumb: If I/O time > CPU time, compression is almost always worth it.
Why This Matters
Cost reduction: Cloud storage costs $0.02-0.10 per GB/month. 10x compression saves 90% of storage costs.
Performance: Transferring 1 GB at 100 MB/s takes 10 seconds. With 10x compression, it’s 1 second. Compression trades cheap CPU cycles for expensive I/O bandwidth.
Accessibility: Smaller datasets are easier to share, download, and distribute. Compressed arrays enable collaboration across labs, teams, and continents.
Use cases:
- Scientific data: Climate, astronomy, genomics datasets are huge. Compression makes them tractable.
- Machine learning: Embedding matrices, training datasets, checkpoints benefit from compression.
- Finance: Tick data compresses extremely well (repeated timestamps, small price changes).
- Crypto: On-chain data has structure (sequential blocks, repeated addresses).
Key Takeaways
- v2 uses simple compressors — one-stage, easy to configure
- v3 uses codec pipelines — multi-stage, composable, extensible
- zstd is the default — fast, good compression, widely supported
- blosc excels for numerical data — built-in shuffling improves compression
- Compression ratios depend on data — structured data compresses better
- Codec overhead is usually worth it — I/O savings outweigh CPU cost
- Choose based on priority — speed (lz4), compression (zstd high level), compatibility (gzip)
What’s Next
Explore concurrency:
-
02_concurrency/02_01_parallel_reads.livemd- Parallel chunk reads with task supervision -
02_concurrency/02_03_profiling_exzarr.livemd- Understanding I/O vs decode vs compression costs
Apply to domains:
-
04_ai_genai/04_01_embeddings_in_zarr.livemd- Storing embedding matrices with optimal compression -
05_finance/05_01_tick_data_cubes.livemd- Compressing financial time series -
07_geospatial/07_02_cmip_climate_slices.livemd- Climate data compression strategies