Removing checked-in client data from a git repository - permanently

During a recent migration, we checked in some client data to the sites/default git repository. This was a mistake: not least because the client’s data was some four or five times the size of the rest of the rest of the codebase; but also because there were non-ASCII characters in some of their filenames (“Pannetoné”, anyone?) These were playing havoc when they were on folders shared between a Linux Vagrant box and a Mac OSX host.

Removing files permanently from version control means removing them not just from the current revision, but also rewriting every revisions in the repository so they’re not there either (otherwise Git keeps them hanging around in its object repository, in case those old revisions are ever checked out again.) While there are lots of fragments of solutions online, very few of them encompassed the full complexity of our requirements: many branches, many tags, all needing to be preserved, along with trialling the changes by pushing them all to a temporary remote repository.

Our solution to this problem has ended up a little more long-winded than some, but it’s safer and works harder to preserve the integrity of the version history; when we work on behalf of clients, that’s our biggest priority. And applying this to a 27MB repository clone reduced its size to under 6MB!