Skip to main content

On This Page

git-sfs: High-Performance Large File Storage via Symlinks and rclone

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

git-sfs: Large File Storage Without the LFS Server

git-sfs is a Symbolic File Storage tool that swaps the Git LFS server for native filesystem symlinks and rclone transport. It hashes large files with SHA-256 and converts them into relative symlinks, keeping repository clones fast and history lightweight.

Why This Matters

Traditional Git LFS implementations solve storage but introduce a server problem, requiring proprietary protocols and per-GB transfer fees. Tools like DVC add complexity via Python runtimes and manifest files that frequently cause merge conflicts in pull requests. git-sfs addresses the technical reality that large files do not belong in Git objects by using standard symlinks that Git understands natively. By routing bytes through rclone, engineers can use any existing remote—S3, SFTP, or local paths—without the overhead of a dedicated LFS endpoint or opaque pointer files.

Key Insights

  • Hash-verify at every boundary: git-sfs re-hashes files after hashing, download, and copy to ensure corrupted files are rejected (2026).
  • Atomic write operations: The system uses a temp-file-plus-rename strategy to ensure interrupted push or pull operations never leave partial files.
  • Immutable cache design: Files in the local cache are write-once and read-only, preventing accidental overwrites or data corruption.
  • Native Git visibility: Unlike git-annex or DVC, git-sfs uses plain relative symlinks so PR diffs clearly show which files were added or removed.
  • Concurrency-first architecture: The Go-based binary supports a configurable worker pool (n_jobs) to handle datasets containing millions of files.

Working Examples

Standard workflow for initializing git-sfs and tracking a dataset directory.

git-sfs init
# edit .git-sfs/config.toml to set rclone backend
git-sfs setup
git-sfs add data/
git add .git-sfs/config.toml data/
git commit -m "track datasets"
git-sfs push

Configuring concurrency for high-volume file transfers.

[settings]
n_jobs = 8

Partial pull command to materialize only a specific subset of the dataset.

git-sfs pull data/validation/

Practical Applications

  • CI/CD Pipeline Optimization: Use ‘git-sfs verify’ to perform fast presence checks on datasets without downloading full history. Pitfall: Neglecting to set up the rclone configuration on CI runners, leading to failed pull operations.
  • Large Dataset Versioning: Track model weights and training sets as symlinks to allow PR reviewers to see file changes in the native Git tree. Pitfall: Committing the actual large files instead of using ‘git-sfs add’, which bypasses the symbolic storage and bloats the repo.

References:

Continue reading

Next article

Building a Zero-Dependency 'Life in Weeks' Poster Generator

Related Content