Removing Large Files from Git

When I was working on the cryptic crossword solver, I explored a variety of caching strategies for precomputing things like lists of synonyms and anagrams of words in my dictionary. To save myself some time rebuilding those caches, I ended up pushing them into my git repository. Later on, when my caching strategy changed, I deleted most of those files, but their history lived on in the repository and expanded an initial clone to over 300Mb. I guess I’ve learned my lesson about committing large autogenerated files, but I still needed to get these files out of the repository’s history.

Fortunately, git lets us rewrite history, as long as we’re careful about it. I mostly used these instructions: https://help.github.com/articles/remove-sensitive-data

Initially, this didn’t get all of the files from all the other branches. I had to use this script from here: http://stackoverflow.com/a/4754797:

#!/bin/bash
for branch in `git branch -a | grep remotes | grep -v HEAD | grep -v master`; do
    git branch --track ${branch##*/} $branch
done

to get all the branches locally, then run:

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch data/*ngrams.*' --prune-empty --tag-name-filter cat -- --all

to remove the offending files (in this case, data/ngrams.pck, data/ngrams.json, and data/initial_ngrams.pck), and so on for all the other files.

To push those changes to github, I did:

git push --force origin --all

This successfully cut the size of an initial checkout from ~300Mb to ~100Mb. I may need to track down some more files eventually, but that’s pretty good.

Of course, the caveat of rewriting history in this way is that it creates a history which is inconsistent with any existing clones of the repository. This meant that I had to do a fresh clone of the repo onto each machine I was working from, and that any collaborators I might have had (I was working mostly alone at that point) would have had to do the same. Still, in this case it was much better than having to live with a huge initial repository size or wipe out my history by starting a new git history.

Robin Deits 21 June 2013