Closed alex-harvey-z3q closed 6 years ago
Next time it happens I will make sure I save a copy of the corrupted cache.
The issue may be that we are simply allowing g10k to fail and give up here: https://github.com/xorpaul/g10k/blob/de149b1af11dbb853d911aad2cfb980fa13dfb14/helper.go#L164-L172
Really, we can't allow this tool to ever just give up and fail in production - especially if the only problem is a corrupted Git cache. It should just delete the problematic cache and try again.
An earlier instance of the output when this failed:
executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-PUP_puppet-jenkins.git remote update --prune exit status 1
Output: Fetching origin
error: refs/merge-requests/204/head does not point to a valid object!
error: refs/merge-requests/204/head does not point to a valid object!
error: refs/merge-requests/204/head does not point to a valid object!
error: Could not read 9665485307d333eb3faa0e1367c969f5a9adf4c1
error: refs/merge-requests/204/head does not point to a valid object!
From git.example.com:PUP/puppet-jenkins
e5c1923..9513e8a master -> master
error: unable to find 9665485307d333eb3faa0e1367c969f5a9adf4c1
fatal: object 9665485307d333eb3faa0e1367c969f5a9adf4c1 not found
error: Could not fetch origin
Output: fatal: Not a git repository: '/tmp/g10k/git@git.example.com-PUP_puppet-fstab.git'
could mean that the initial git pull
of this repository did fail, but the current g10k version does not create the cache directory in this case. Maybe you are using an older version?
I tried this Puppetfile
mod 'firewall',
:git => 'https://gthub.com/puppetlabs/puppetlabs-firewall.git',
:branch => 'master'
If there are only a handful of Puppet modules that are hosted on an unreliable Git server, then you can add it directly to the module:
mod 'firewall',
:git => 'https://github.com/puppetlabs/puppetlabs-firewall.git',
:branch => 'master',
:ignore_unreachable => 'true'
Or you can add a global setting in your g10k config to allow all your Git modules to fail and your g10k run to continue. https://github.com/xorpaul/g10k/issues/57#issuecomment-301075218
Really, we can't allow this tool to ever just give up and fail in production
I don't agree with this, in my setup I want g10k to fail if there is anything unreachable, because I only sync the g10k populated environments to my Puppetservers if g10k did run successfully. I'd rather have an older working Puppet environment than an corrupted, half populated environment in production.
It should just delete the problematic cache and try again.
Checking the local git repository first, clearing it and retry could be a solution, but then how often should g10k try this? What should it do if the Git repository is completely unreachable? I'd rather fail fast and let the user retry the g10k run.
What we can agree on is that the cached git repository should never be empty or corrupted.
It would greatly help if we could find the reason how it ended up corrupted and fix that.
Otherwise I could add a g10k config setting that always checks the git repository first with git fsck
or something similar and clear it and retry.
Yes. I think the basic principle is that the cached git repository should never be empty or corrupted, whereas I am seeing them corrupted quite often. I estimate g10k is being called dozens of times per day in about 50 AWS accounts per day at my site and I'm getting a corrupted cache maybe once a fortnight. I can confirm that each time I have seen the cache corrupted, it would fail repeatedly until I deleted the cache, at which point it would always succeed.
I guess the next thing to do is for me to wait until this happens again, and make a copy of the corrupted cache.
I take it you're saying you haven't actually seen this before?
I have an example of one of the problematic g10k caches saved now.
Here's the problem:
$ git --git-dir /var/tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune
Fetching origin
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
Warning: Permanently added 'git.example.com,10.0.0.10' (RSA) to the list of known hosts.
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: Could not read 5a16bc06490229d809f3b217a8ad3b6db2054355
error: Could not read bdcd583320d464836a146f4d7122453bcb225069
remote: Counting objects: 2, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (2/2), done.
fatal: bad object 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: git.example.com:FOO/puppet-tenant_profile.git did not send all necessary objects
error: Could not fetch origin
I'll see what else I can glean from this tar ball.
Reminder to me: I have this saved as a tarball as /var/tmp/tp.tgz
on my laptop.
Use of git fsck --full
results in:
$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (4343/4343), done.
error: refs/heads/feature/fitness6-jarfile: invalid sha1 pointer 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: refs/heads/feature/talendmetadata: invalid sha1 pointer bdcd583320d464836a146f4d7122453bcb225069
error: refs/merge-requests/30/head: invalid sha1 pointer 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
error: refs/merge-requests/31/head: invalid sha1 pointer bdcd583320d464836a146f4d7122453bcb225069
broken link from commit 7c52cc078733ce8191867549e40b6870488cf6c8
to tree e219e5806d873ee8f21e37f731f84eb369ea03ee
broken link from commit 7c52cc078733ce8191867549e40b6870488cf6c8
to commit 5a16bc06490229d809f3b217a8ad3b6db2054355
broken link from commit 7c52cc078733ce8191867549e40b6870488cf6c8
to commit bdcd583320d464836a146f4d7122453bcb225069
broken link from tree 0d2bff1d856e679e4e6176bf9adcb235d75abafe
to tree 1af132babe82801b3b1816e84cb5addc958e5956
broken link from tree bb3b9aa85c640065f022f40f5ca92b437fbee9d6
to tree 0f58444d1b3ff07465f57c03300bfc3ea48536ba
broken link from tree bb3b9aa85c640065f022f40f5ca92b437fbee9d6
to tree 3007b6433279e2e0ee15489077db6ae97b69efa4
broken link from tree 01f5f38b3de8ba060ea7d4d9f5e49d2cfab3b186
to blob 7f430468dd7c3f22a00621b1a2f2ade211ca77ec
missing blob 7f430468dd7c3f22a00621b1a2f2ade211ca77ec
dangling commit eec3ce2ee3120133c1f95b6a9b960c3dc93f8452
missing tree 3007b6433279e2e0ee15489077db6ae97b69efa4
dangling commit cfa76e03fb97179326076300889b1d830404f5ce
missing commit bdcd583320d464836a146f4d7122453bcb225069
missing tree 1af132babe82801b3b1816e84cb5addc958e5956
dangling tag 2695e2f30947085baf77af21e16b3a01c4ee19cb
missing commit 5a16bc06490229d809f3b217a8ad3b6db2054355
missing tree 0f58444d1b3ff07465f57c03300bfc3ea48536ba
missing tree e219e5806d873ee8f21e37f731f84eb369ea03ee
On the other hand if I clone the upstream repo again and run the fsck command:
$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (2375/2375), done.
See this Stack Overflow post here, which seems to describe the same problem for others: https://stackoverflow.com/questions/30356012/git-gc-displays-error-could-not-read-commit
Even after running the fsck command above, the remote update --prune
command still fails.
@xorpaul , I think if the git update remote --prune
returns a non-zero exit status, g10k should delete that clone, clone it again, and try again. Only if it still fails should it give up and abort. Otherwise, we just can't use this production. Thoughts?
@xorpaul Also, if you would like a tarball of the corrupted Git repo I saved, and copy of the same repo after cloning a fresh copy, let me know where I can send it.
@alexharv074 Thanks for the debug info.
g10k is just calling the git binary to clone and update the local Git repository, if the remote Git server is unable to respond appropriately or sends a corrupted state of the repository, then the only thing g10k can do is retry the checkout.
What Git server are you using? Is it running on a VM or hardware? You should open a ticket at this Git server project with this information (cloning and updating multiple repositories at the same time, probably overloading the Git server, so that it sends invalid responses). Maybe you can adjust some settings (worker processes, web server processes) so that g10k doesn't overload your server.
I'll have a look at the git clone
retry mechanism.
In the meantime you could try limiting the number of parallel checkouts and pulls with the -maxworker
parameter.
@xorpaul
The Git server is Gitlab 8.16.4, running on a RHEL 6 EC2 instance, and the Git client version 1.7.1.
In any case, a clean & retry mechanism makes a lot sense to me, whatever the root cause is here. Whether it's the Git server's fault, or whether it's just a random corruption of a cloned Git repo, I still would not expect the tool to give up in production if the problem is that it has corrupted data in its cache.
Not sure how hard it is to implement the feature I proposed of course. I would send a PR if only I knew Golang.
Try out the new v0.4 release:
https://github.com/xorpaul/g10k/releases/tag/v0.4
You can limit the number of Goroutines with -maxextractworker
parameter or as maxextractworker: <INT>
g10k config setting.
Now you can also retry failed Git commands with 0.4.1
https://github.com/xorpaul/g10k/releases/tag/v0.4.1
Either use -retrygitcommands
cli parameter or retry_git_commands g10k config setting.
---
:cachedir: '/tmp/g10k'
retry_git_commands: true
sources:
example:
remote: 'https://github.com/xorpaul/g10k-environment.git'
basedir: '/tmp/example/'
If you then call g10k with this config file and have a corrupted local Git repository, g10k deletes the local cache and retries the Git clone command once:
WARN: git command failed: git --git-dir /tmp/g10k/modules/https-__github.com_puppetlabs_puppetlabs-firewall.git remote update --prune deleting local cached repository and retrying...
Hi @xorpaul
Thanks very much for implementing the feature.
However, it does not seem to be working in the expected way:
-bash-4.1$ g10k -version
g10k Version 0.4 Build time: 2017-11-08_15:22:31 UTC
-bash-4.1$ g10k -puppetfile -retrygitcommands
executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-packer.git remote update --prune exit status 128
Output: fatal: Not a git repository: '/tmp/g10k/git@git.example.com-FOO_puppet-packer.git'
If you are using GitLab be sure that you added your deploy key to your repository
-bash-4.1$ ls -ld /tmp/g10k/git\@git.example.com-FOO_puppet-packer.git/
drwxrwxr-x. 7 jenkins jenkins 140 Oct 27 08:41 /tmp/g10k/git@git.example.com-FOO_puppet-packer.git/
Hi @alexharv074,
ah, sorry forgot to add the new CLI parameter to the Puppetfile mode. Fixed. https://github.com/xorpaul/g10k/commit/e323b65ecb4e29628cb866123de428d8ddda713f
Please try: https://github.com/xorpaul/g10k/releases/tag/v0.4.2
$ ./g10k -puppetfile -verbose -retrygitcommands
2017/11/09 11:00:34 Executing git clone --mirror https://github.com/puppetlabs/puppetlabs-apache.git /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git took 2.70532s
2017/11/09 11:00:34 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00243s
Need to sync .//modules/apache/ 2017/11/09 11:00:34 syncToModuleDir(): Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git archive master took 0.04980s
2017/11/09 11:00:34 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00224s
Synced ./Puppetfile with 1 git repositories and 0 Forge modules in 2.8s with git (2.7s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers
$ rm -rf /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git/*
$ rm modules/apache/
rm: cannot remove 'modules/apache/': Is a directory
$ ./g10k -puppetfile -verbose -retrygitcommands
2017/11/09 11:00:47 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git remote update --prune took 0.00189s
WARN: git repository https://github.com/puppetlabs/puppetlabs-apache.git does not exist or is unreachable at this moment!
WARN: git command failed: git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git remote update --prune deleting local cached repository and retrying...
2017/11/09 11:00:49 Executing git clone --mirror https://github.com/puppetlabs/puppetlabs-apache.git /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git took 2.46786s
2017/11/09 11:00:49 Executing git --git-dir /tmp/g10k/https-__github.com_puppetlabs_puppetlabs-apache.git rev-parse --verify 'master' took 0.00244s
Synced ./Puppetfile with 1 git repositories and 0 Forge modules in 2.5s with git (2.5s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers
@xorpaul
I am very happy to say it's working!
Before:
-bash-4.1$ g10k -puppetfile
Resolving Git modules (34/52) 2.303s [===========================================>------------------------] 65%
executeCommand(): git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune exit status 1
Output: Fetching origin
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: refs/heads/feature/fitness6-jarfile does not point to a valid object!
error: refs/heads/feature/talendmetadata does not point to a valid object!
error: refs/merge-requests/30/head does not point to a valid object!
error: refs/merge-requests/31/head does not point to a valid object!
error: Could not read 5a16bc06490229d809f3b217a8ad3b6db2054355
error: Could not read bdcd583320d464836a146f4d7122453bcb225069
error: unable to find 57a1d82e3d5ff67bb774fe40f5719323a64d6b03
fatal: object 57a1d82e3d5ff67bb774fe40f5719323a64d6b03 not found
error: Could not fetch origin
If you are using GitLab be sure that you added your deploy key to your repository
Install new version:
[ec2-user@jenkins ~]$ g10k -version
g10k version 0.4.2 Build time: 2017-11-08_16:01:31 UTC
After:
-bash-4.1$ g10k -puppetfile -retrygitcommands
Resolving Git modules (43/52) 3.605s [=======================================================>------------] 83%
WARN: git repository git@git.example.com:FOO/puppet-tenant_profile.git does not exist or is unreachable at this moment!
Resolving Git modules (52/52) 4.542s [====================================================================] 100%
Synced ./Puppetfile with 52 git repositories and 0 Forge modules in 7.9s with git (7.6s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers
And corrupted cache and all, it still copied 52 modules in 7.9 seconds!
Thanks so much Andreas, there will be many happy customers at my site, and best of all, I feel confident to roll out g10k at my next place!
Glad I could help!
Did you censor your output? You should've gotten a warning like:
WARN: git command failed: git --git-dir /tmp/g10k/git@git.example.com:FOO/puppet-tenant_profile.git remote update --prune deleting local cached repository and retrying...
No, I did redact sensitive information using search & replace to update the Git server address, and site-specific info in the Git URL, but the output I showed is otherwise unchanged.
To be honest, I was about to see if I could send in a pull request to improve the wording of the error message, but sounds like maybe it's still not behaving the way you expected it to?
g10k should print a warning that the git command failed and that it retries the git clone command:
https://github.com/xorpaul/g10k/blob/master/git.go#L117
Maybe the progress bar from the default verbosity level is the cause that it skipped this line for you.
Can you retry using the -info
verbosity level?
You are correct:
-bash-4.1$ g10k -puppetfile -info -retrygitcommands
WARN: git repository git@git.example.com:FOO/puppet-tenant_profile.git does not exist or is unreachable at this moment!
WARN: git command failed: git --git-dir /tmp/g10k/git@git.example.com-FOO_puppet-tenant_profile.git remote update --prune deleting local cached repository and retrying...
Synced ./Puppetfile with 52 git repositories and 0 Forge modules in 7.7s with git (7.4s sync, I/O 0.0s) and Forge (0.0s query+download, I/O 0.0s) using 50 resolv and 20 extract workers
Alright then.
I'll update the output in the next release, so that only the retrying line gets printed, when -retrygitcommands
is set.
From time to time I find that the g10k cache becomes corrupted and I am forced to delete it. This is a big problem in production and ultimately may mean I can't use g10k in production. A recent example was a failure like this: