【GIT技巧】How to Shrink a Git Repository

wutiejun commented 8 years ago

How to Shrink a Git Repository 如何缩小一个GIT仓库

http://stevelorek.com/how-to-shrink-a-git-repository.html

Our main Git repository had suddenly ballooned in size. It had grown overnight to 180MB (compressed) and was taking forever to clone.

我们的git仓库的大小突然爆发式增大了。一下子增大了180M（压缩后的），而且每个克隆都会增加。

The reason was obvious; somebody, somewhere, somewhen, somehow, had committed some massive files. But we had no idea what those files where.

原因也是很清楚的：某人，某时，某地，某原因，提交大量的文件。但我们对这些文件在哪没什么想法（就放哪都行）。

After a few hours of trial, error and research, I was able to nail down a process to: 经过几个小的尝试，出错和研究，我可以确定以下几个步骤：

Discover the large files
Clean them from the repository
Modify the remote (GitHub) repository so that the files are never downloaded again
找到大文件
从仓库中清理掉他们
修改远程（github）仓库，这样这些文件就再也不会下载了

This process should never be attempted unless you can guarantee that all team members can produce a fresh clone. It involves altering the history and requires anyone who is contributing to the repository to pull down the newly cleaned repository before they push anything to it.

这些过程最好永远也不要尝试，除非你能保证你的团队所有成员可以处理一个新的克隆（即：丢弃原来的工作空间）。它会触发变更仓库的历史记录，而且需要所有仓库的贡献者重新取回一份新的仓库，在他们推送任何内容之前。

Deep Clone the Repository 深度克隆仓库

If you don't already have a local clone of the repository in question, create one now:

如果你你还没有一个在考虑（缩小）中的仓库的本地克隆，现在就创建一个：

$ git clone remote-url

Now—you may have cloned the repository, but you don't have all of the remote branches. This is imperative to ensure a proper 'deep clean'. To do this, we'll need a little Bash script:

现在你可能已经克隆了这个仓库，但你并没有所有的远程分支。对于深度清理这个是必要的。为了实现它，我们要一个Bash脚本（实现与远程所有分支的同步）：

#!/bin/bash
for branch in `git branch -a | grep remotes | grep -v HEAD | grep -v master`; do
    git branch --track ${branch##*/} $branch
done

Thanks to bigfish on StackOverflow for this script, which is copied verbatim.

感谢gitfish在StackOverflow上提供的这个脚本，这是一字不差的复制过来的。

Copy this code into a file, chmod +x filename.sh, and then execute it with ./filename.sh. You will now have all of the remote branches as well (it's a shame Git doesn't provide this functionality).

复制代码到一个文件中，chmod +x filename.sh, 然后执行它。你就可以很好的得到远程上的所有分支了（git没有提供这个功能真是个丢人的事）。

Discovering the large files 发现大文件

Credit is due to Antony Stubbs here - his Bash script identifies the largest files in a local Git repository, and is reproduced verbatim below:

【略：脚本脚本找到大文件即可】

#!/bin/bash
#set -x 

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`git rev-list --all --objects | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

Execute this script as before, and you'll see some output similar to the below: All sizes are in kB. The pack column is the size of the object, compressed, inside the pack file.

执行前面的脚本，你会看到类似这样的输出：所有的大小是kB，pack列是对象的大小，压缩的，在pack文件中。

size     pack    SHA                                       location
1111686  132987  a561d25105c79aa4921fb742745de0e791483afa  08-05-2012.sql
5002     392     e501b79448b9e970ab89b048b3218c2853fdfc88  foo.sql
266      249     73fa731bb90b04dcf79eeea8fdd637ba7df4c089  app/assets/images/fw/iphone.fw.png
265      43      939b31c563bd40b1ca70e4f4a9f7d67c27c936c0  doc/models_complete.svg
247      39      03514d9e84418573f26b205bae7e4e57057c036f  unprocessed_email_replies.sql
193      49      6e601c4067aaddb26991c4bd5fbddef003800e70  public/assets/jquery-ui.min-0424e108178defa1cc794ee24fc92d24.js
178      30      c014b20b6fed9f17a0b2809ac410d74f291da26e  foo.sql
158      158     15f9e56bc0865f4f303deff053e21909661a716b  app/assets/images/iphone.png
103      36      3135e15c5cec75a4c85a0636b154b83221020c97  public/assets/application-c65733a4a64a1a885b1c32694574b12a.js
99       85      c1c80bc4c09e692d5e2127e39c87ecacdb1e816f  app/assets/images/fw/lovethis_logo_sprint.fw.png

Yep - looks like someone has been pushing some rather unnecessary files somewhere! Including a lovely 1.1GB present in the form of a SQL dump file.

耶，看上去是有人推送了一些并不需要的文件。包括一个可爱的SQL导出文件。

Cleaning the files 清除文件

Cleaning the file will take a while, depending on how busy your repository has been. You just need one command to begin the process:

清理这些文件要一点时间，取决于你的仓库，你只需要一个命令来处理：

$ git filter-branch --tag-name-filter cat --index-filter 'git rm -r --cached --ignore-unmatch filename' --prune-empty -f -- --all

This command is adapted from other sources—the principal addition is --tag-name-filter cat which ensures tags are rewritten as well.

这个命令从其它源文件--规则添加的--中适配， --tag-name-filter用于很好的抓取重写的标签。

After this command has finished executing, your repository should now be cleaned, with all branches and tags in tact.

在该命令执行完以后，你的仓库就应该清理了，包括所有的分支和标签。

Reclaim space 回收空间

While we may have rewritten the history of the repository, those files still exist in there, stealing disk space and generally making a nuisance of themselves. Let's nuke the bastards:

我们现在要重写仓库的历史，这些文件还在这里，偷偷占用磁盘空间并且经常给自己找麻烦。让我们（用核弹）摧毁这讨厌的家伙：

$ rm -rf .git/refs/original/
$ git reflog expire --expire=now --all
$ git gc --prune=now
$ git gc --aggressive --prune=now

Now we have a fresh, clean repository. In my case, it went from 180MB to 7MB.

现在世界清静了，清理仓库，在我的场景下，它从180M缩小到7M

Push the cleaned repository 推送干净的仓库

Now we need to push the changes back to the remote repository, so that nobody else will suffer the pain of a 180MB download.

现在我需要推送修改后的结果到远程仓库中，这样所有人都不用承受这180M的下载痛苦了：

$ git push origin --force --all

The --all argument pushes all your branches as well. That's why we needed to clone them at the start of the process.

--all参数推送你所有的分支。这也就是为什么一开始我们要克隆它的原因。

Then push the newly-rewritten tags:

然后推送最新写入的标签：

$ git push origin --force --tags

Tell your teammates 告诉你的同事

Anyone else with a local clone of the repository will need to either use git rebase, or create a fresh clone, otherwise when they push again, those files are going to get pushed along with it and the repository will be reset to the state it was in before.

不管是谁，只要有一份该仓库的克隆，都将需要使用git rebase，或者重新创建一个新的克隆，否则当他们再次推送时，这些文件会被推送回来，然后仓库又回到之前状态了。

wutiejun commented 8 years ago

10.7 Git Internals - Maintenance and Data Recovery

https://git-scm.com/book/en/v2/Git-Internals-Maintenance-and-Data-Recovery

6.4 Git 工具 - 重写历史

https://git-scm.com/book/zh/v1/Git-%E5%B7%A5%E5%85%B7-%E9%87%8D%E5%86%99%E5%8E%86%E5%8F%B2

wutiejun commented 8 years ago

这个命令理解小有一点复杂：

$ git filter-branch --tag-name-filter cat --index-filter 'git rm -r --cached --ignore-unmatch filename' --prune-empty -f -- --all

查了一下帮助手册： Checklist for Shrinking a Repository git-filter-branch is often used to get rid of a subset of files, usually with some combination of --index-filter and --subdirectory-filter. People expect the resulting repository to be smaller than the original, but you need a few more steps to actually make it smaller, because git tries hard not to lose your objects until you tell it to. First make sure that:

You really removed all variants of a filename, if a blob was moved over its lifetime. git log --name-only --follow --all -- filename can help you find renames. You really filtered all refs: use --tag-name-filter cat -- --all when calling git-filter-branch. Then there are two ways to get a smaller repository. A safer way is to clone, that keeps your original intact.

Clone it with git clone file:///path/to/repo. The clone will not have the removed objects. See Section G.3.21, “git-clone(1)”. (Note that cloning with a plain path just hardlinks everything!) If you really don't want to clone it, for whatever reasons, check the following points instead (in this order). This is a very destructive approach, so make a backup or go back to cloning it. You have been warned.

Remove the original refs backed up by git-filter-branch: say git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d. Expire all reflogs with git reflog expire --expire=now --all. Garbage collect all unreferenced objects with git gc --prune=now (or if your git-gc is not new enough to support arguments to --prune, use git repack -ad; git prune instead).

wutiejun / workspace