Closed andrewfraley closed 6 years ago
This looks like a race condition during the extraction of multiple Forge archives.
Can you provide the Puppetfile with Forge module only or at least how many Forge modules this Puppetfile is using?
I'm actually not using any forge modules, all modules are hosted in a self-hosted github enterprise instance. Other data points that might be useful:
git urls are all HTTPS all modules either use :branch => or :tag =>
Having the same problems. We're also using git only, while some are https and some are ssh/git. I've added some really simple debug code:
mbaur@mbaur-g10k:~/g10k-source$ git diff
diff --git a/forge.go b/forge.go
index a9eba6a..b6cb9d6 100644
--- a/forge.go
+++ b/forge.go
@@ -303,7 +303,7 @@ func unTar(r io.Reader, targetBaseDir string) {
if err == io.EOF {
break
}
- Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error())
+ Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error() + targetBaseDir)
}
// get the individual filename and extract to the current directory
The targetBaseDir path which is shown, than the error occurred, is the path to a git repository, so there is definitely something wrong here.
If i can help to debug this further, please let me know.
@baurmatt I also added the targetDir
to the error message locally.
I still suspect this is some kind of race condition, but the only thing that would help me debug this is if I can reproduce the bug or someone could give me the debug
output of the failed g10k run.
@xorpaul Thanks I'll try to capture a failed run later today
@andrewfraley Could you reproduce the issue in your setup?
Some information about our setup:
Still trying to get the debug log through our security team.
I was finally able to reliably reproduce this. The problem seems to be some binary files (please don't ask......) which we have in our repository.
mbaur@mbaur-g10k:~$ for i in {1..10}; do ./g10k -config /home/mbaur/g10k.yaml; rm -rf cache/* environments/*; done
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
I've uploaded all data into https://github.com/baurmatt/g10k-77-control and https://github.com/baurmatt/g10k-77-other. Hope this helps to finally find the issue.
I'm having trouble reproducing the error while not on our production puppet servers, but I know I've seen it happen when I testing in a Vagrant VM. I'll keep trying, I don't think we have any repos with binary data (at least not larger than a few KB).
Ok I'm able to reproduce the error, but it never happens on the same module, so I think this is some sort of race condition based on performance of the underlying disks and the remote git server. To reliably make it happen I have to remove the cache directory and remove the target modules directory, so it's a full sync from scratch. I can only make it happen on one of my real servers, which has a much faster connection to the git server, but much slower disks, vs my workstation inside a vagrant instance (workstation has slow connection to git server but fast SSD storage, server has fast connection to git server but pretty slow disks).
2017/11/02 17:55:39 DEBUG executeCommand(): Executing git --git-dir /tmp/g10k-test/modules/https-__redacted.github.enterprise.server_redacted_git_repo.git rev-parse --verify 'v1.1.0' unTar(): error while tar reader.Next() for io.Reader read |0: file already closed
I should add that the total disk space used by all modules in all environments is 650MB, so we are talking about writing a huge number of small files in a very short amount of time.
Hope this helps.
Alright I have managed to reliably reproduce the error in a Virtualbox VM by restricting the disk bandwidth.
Steps to limit bandwidth on a Virtualbox disk:
VBoxManage bandwidthctl "vmname" add "Limit" --type disk --limit 10
VBoxManage storageattach "vmname" --storagectl "SATA Controller" --port 0 --device 0 --type hdd --medium "name_of_disk_file.vmdk" --bandwidthgroup Limit
You can change the limit while the machine is running with:
VBoxManage bandwidthctl "vmname" set Limit --limit 20M
For me the errors start at a 50M limit.
More info here: https://www.virtualbox.org/manual/ch05.html#storage-bandwidth-limit
Interesting.
Are you running g10k
inside this restricted VM or the test Git server?
Like https://github.com/xorpaul/g10k/issues/76#issuecomment-341330157 you can also try limiting the number of parallel checkouts and pulls with the -maxworker
parameter.
We're running g10k in a Virtuozzo container with Ubuntu 16.04 on an HP server with SAS HDDs. I just tried to lower the max worker to 1 but it didn't help at all :/
mbaur@mbaur-g10k:~$ ./g10k -maxworker 1 -config /home/mbaur/g10k.yaml
Resolving Git modules (1/1) 0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
The repository checkout seems to be fine:
mbaur@mbaur-g10k:~/cache/r10k/modules/git@github.com-baurmatt_g10k-77-other.git$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (18/18), done.
Yes, g10k is running inside the restricted VM.
Is maxworker supposed to restrict the number of concurrent git commands? Even with -maxworker 1 I still see hundreds of active git processes while g10k is running.
I think I've solved it, though, I'm actually bumping up against the nproc limit. If I restrict the available disk bandwidth down to 5MB/s, I start getting g10k crashes with errors about resource not available, so half the time I get a g10k crash with a big go dump, the other half I get the unTar error. I think it's just random if git can't create a new thread or g10k itself.
After increasing the nproc limit for the user to 9000 the errors stopped happening, even with the disk restricted to 5MB/s.
On CentOS for a user "puppet", add to /etc/security/limits.d/90-nproc.conf:
puppet soft nproc 9000
(you have to log out and back in for it to take effect)
@baurmatt Are you using the latest g10k release? Which git version are you using? Ubuntu 16.04 should be using a somewhat recent version of git though.
I can't reproduce the issue with your test control repo, but I can only test it on my workstation with plenty of RAM, a fast CPU and an SSD. If you could switch all your Puppetfile Github modules to use the https://github.com:baurmatt/g10k-77-other.git
instead of git@github.com:baurmatt/g10k-77-other.git
then I can also test it on a slower VM.
The problem is that with that test setup you're triggering about 70 git archive
processes in parallel that pump out a lot of data to unTar()
And as @andrewfraley already found out: the -maxworker
setting was always meant to limit the number of Git and Forge module resolving processes, not the number of local git archive
or Forge archive extracting processes.
The way g10k
resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck, which for me was only the number of allowed open file handles that the g10k user is allowed to have.
I'll add a section to the readme that suggests increasing the default limits (nofile, nproc) before running g10k with a large Puppet environment.
Would a parameter help that would limit the local extracting processes, like -maxextractingworkers
?
A parameter to limit extracting processes would definitely be helpful, I think that would help solve another issue I have with g10k pegging the CPUs while running. I don't mind if g10k takes 30s to finish instead of 3s, I just want it to finish within 60s. Thanks!
I'm using g10k 0.3.14 and git 2.7.4.
As you suggested, i just switched the module URLs to https. For me the error still exists.
I've also just migrated my Virtuozzo Container on a hardware server with SSD storage, the error exists there as well.
Have you tried increasing the security limits nproc and nofile for your g10k user?
If i understand it correctly, my limits are already really high:
mbaur@mbaur-g10k:~$ ulimit -Sn
1024000
mbaur@mbaur-g10k:~$ ulimit -Hn
1024000
mbaur@mbaur-g10k:~$ ulimit -Su
514893
mbaur@mbaur-g10k:~$ ulimit -Hu
514893
I've found the issue...
Nov 3 15:59:05 server.example.org kernel: [821998.807769] Out of memory in UB 290422522: OOM killed process 331832 (git) score 0 vm:53000kB, rss:32224kB, swap:0kB
Upgrading my container from 2 to 4GB solved the problem and g10k is running fine. So i guess the -maxextractingworkers
parameter would be nice to reduce the ram/cpu consumption.
@baurmatt :smirk: That's exactly what I meant with
The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck
I'm currently working on that -maxextractworker
parameter.
Try out the new v0.4 release:
https://github.com/xorpaul/g10k/releases/tag/v0.4
You can limit the number of Goroutines with -maxextractworker
parameter or as maxextractworker: <INT>
g10k config setting.
Works great, thanks! :) As i can't reproduce the error, i think this issue can be closed.
Glad to hear! 😏
@xorpaul working great for me as well. Dramatically reduced CPU load and no more errors. Using -maxextractworker 10 -maxworker 10
on an 8 core machine. Thanks!
I sometimes see this issue:
unTar(): error while tar reader.Next() for io.Reader read |0: file already closed
I have my script that runs g10k automatically clear the cache when the last run exited in error, so the next run is usually ok. I only see this happen after changing the Puppetfile.