Foldiff

A general purpose diffing tool that operates on folders of mixed text/binary files.

Get builds here for Windows (x86-64), Linux (x86-64), and macOS (Arm).

Motivation

I like to keep frequent backup of data. I'm also a bit of a data hoarder and take exports of my data from places periodically.

This leads to a bit of an issue when I go to upload these backups to my cloud storage—they're huge! My first attempt to solve this was by tarring the old and new folders with tar-sorted, and running the results through xdelta3.

This definitely worked, but not well. It often made unnecessarily huge diffs even once compressed. I knew we could do better, so I sat down and made it myself.

Foldiff: a diffing program that takes two very similar directories and makes a very efficient diff file that can convert the old folder into the new one.

This way I can compress and store the first backup, then I can just keep storing tiny diffs. Nice.

I hope you find this as useful as I have :)

Usage

Create a diff:

foldiff diff old-files new-files diff.fldf

Apply a diff:

foldiff apply old-files diff.fldf new-files

Check if two folders are the same

foldiff verify old-files new-files

Check if the first folder is in the state expected by the diff, and the second folder is equivalent to what that diff would output

foldiff verify old-files new-files diff.fldf

Symlinks are not supported. Empty folders are not stored.

General principle

The method of diffing is as follows

hash all files (both old and new)
perform basic file type inference on all files (both old and new)
files that are identical path and content between new and old are noted down but from then on ignored
subsequent files with identical hashes are treated as noted as duplicates of the first (old & new)
when old and new structures have files that share hashes, store that as a rename/copy/move operation

(store list of old and list of new files with that hash)

for files without hash matches, where both folders have a file with that path:

run the binary diffing algorithm (below) on that file, to generate a diff, and store that

for files without hash matches, where only the old folder contains that path

store that path to be deleted

for files without hash matches, where only the new folder contains that path

store that file as new, compressing with zstd

write the manifest listing paths, hashes, etc, into the file
sort the list of diffs by file type, then by the name sorting algorithm (see below), and place into the file
sort the list of new files by type, then name, and write

Then to apply

read the manifest from the stream
for files that can just be kept, or can be copied to a set of new locations (and possibly delete some old locations), apply that
apply simple file deletions
create newly added files
apply all xdelta3 diffs

The name sorting algorithm is:

split the name by / slashes
reverse the order of the segments
sort by each segment in turn (e.g. equivalent to re-joining the segments and sorting)

Binary diffing algorithm:

Calculate the minimum number of chunks required to split the old file into chunks of MAX 2GB
Split both the old and new file evenly into that many chunks
For each pair of chunks, use the old chunk as a dictionary to compress the new chunk with zstd, in long mode.
Store the zst chunks

To apply the binary diff:

Split the old file into the same chunks
Decompress each diff using the old chunk as the dictionary with zstd
Concatenate the decompressed chunks

The FLDF File Format, v1.1.0

all numbers are stored in big-endian, because it is the correct choice :)

fields marked "(>100r)" are for versions AFTER fldf 1.0.0-r only, fields marked "(100r)" are only on fldf 1.0.0-r, and removed after that.

magic bytes, ASCII 'FLDF'
(>100r) null byte, then three byte version num e.g. [0, 1, 1, 0]
(>100r) u64 byte length of compressed manifest
A messagepack object, raw if (100r), zstd-compressed if (>100r)
- (100r) version: [0x1, 0x0, 0x0, 0x72], absent on newer versions
- untouched files (list of following:)
  - path
  - XXH64 hash
- delete files (list of following:)
  - XXH64 hash
  - path in old folder
- new files (list of following:)
  - new XXH3 hash
  - u64 index into new array
  - path
- duplicated files (list of following:)
  - XXH64 hash
  - u64 index into new array, u64::MAX if not necessary
  - list of paths in old folder
  - list of paths in new folder
- patch files (list of following:)
  - old XXH64 hash
  - new XXH64 hash
  - u64 index into patch array
  - path
new files:
- u64 number of elements
- repetition of:
  - u64 size of blob
  - binary blob of compressed zstd data
patch files:
- u64 number of diffs
- repetition of:
  - u64 number of chunks in this diff
  - repetition of:
    - u64 length of diff
    - binary blob of compressed diff data

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
foldiff		foldiff
libfoldiff		libfoldiff
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG-fldf.md		CHANGELOG-fldf.md
CHANGELOG-foldiff.md		CHANGELOG-foldiff.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Foldiff

Motivation

Usage

General principle

The FLDF File Format, v1.1.0

Progress

About

Releases 6

Packages

Languages

License

yellowsink/foldiff

Folders and files

Latest commit

History

Repository files navigation

Foldiff

Motivation

Usage

General principle

The FLDF File Format, v1.1.0

Progress

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages