07/04/2026 14:01

Search code, repositories, users, issues, pull requests...

A pure Zig implementation of the XET protocol for efficient file storage and retrieval through content-defined chunking and deduplication.

XET is a protocol for handling large files by breaking them into chunks based on their content (not fixed sizes), compressing them, and storing them in a way that eliminates duplicates.

It's particularly useful for managing large models and datasets, like those hosted on HuggingFace.

This library implements the full XET protocol spec in Zig, including:

Content-defined chunking using Gearhash or UltraCDC algorithms (chunks are between 8KB-128KB)
LZ4 compression with byte grouping and bitslice optimizations
Merkle tree construction for efficient file verification
Xorb format for serializing chunked data
MDB shard format for metadata storage
CAS client for uploading and downloading files from HuggingFace
HuggingFace Bucket API support
High-level upload and download pipelines
Parallel chunk fetching using concurrent I/O

The implementation has been cross-verified against the Rust reference implementation to ensure correctness.

It can be compiled to WebAssembly, but runs at about 45% of the non-threaded native speed.

Zig 0.16 or newer
A HuggingFace token (for downloading models)

The most common use case is downloading models efficiently:

The parallel version uses concurrent I/O to fetch, decompress, and hash chunks simultaneously, providing significant performance improvements for large models.

Add to your build.zig.zon:

Then in your code:

The XET protocol processes files in several stages:

Chunking: Files are split using a rolling hash algorithm. Instead of fixed-size chunks, boundaries are determined by content patterns, which means similar files share many identical chunks.
Hashing: Each chunk gets a BLAKE3 hash. A Merkle tree combines these hashes to create a single file identifier.
Compression: Chunks are compressed with LZ4, optionally with byte grouping preprocessing for better ratios.
Deduplication: Identical chunks (same hash) are stored only once, saving space when you have multiple similar files.
Storage: Chunks are bundled into "xorbs" and metadata is stored in "MDB shards" for efficient retrieval.

When downloading from HuggingFace, the library queries the CAS (content-addressable storage) API to find which chunks are needed, fetches them (optionally in parallel using concurrent I/O), decompresses, and reconstructs the original file.

Uploading works in reverse: data is chunked, hashed, compressed into xorbs, and uploaded to CAS along with shard metadata.

The parallel fetching implementation uses concurrent I/O to simultaneously:

Download chunks via HTTP
Decompress chunks
Compute BLAKE3 hashes

This provides significant speedup for large models, especially with good network bandwidth.

This implementation follows the official XET protocol specification exactly.

All constants, algorithms, and formats match the reference Rust implementation byte-for-byte. The test suite includes cross-verification tests to ensure continued compatibility.

Go to https://huggingface.co/settings/tokens
Create a new token with "Read access to contents of all public gated repos you can access"
Copy the token and set it as HF_TOKEN environment variable

XET Protocol Draft Specification
XET Protocol Documentation

Link nội dung: https://melodious.edu.vn/xet-a101063.html