xorpaul / g10k

my r10k fork in Go
Apache License 2.0
127 stars 51 forks source link

Occasional issues with file already closed errors #77

Closed andrewfraley closed 6 years ago

andrewfraley commented 7 years ago

I sometimes see this issue:

unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I have my script that runs g10k automatically clear the cache when the last run exited in error, so the next run is usually ok. I only see this happen after changing the Puppetfile.

xorpaul commented 7 years ago

This looks like a race condition during the extraction of multiple Forge archives.

Can you provide the Puppetfile with Forge module only or at least how many Forge modules this Puppetfile is using?

andrewfraley commented 7 years ago

I'm actually not using any forge modules, all modules are hosted in a self-hosted github enterprise instance. Other data points that might be useful:

git urls are all HTTPS all modules either use :branch => or :tag =>

baurmatt commented 7 years ago

Having the same problems. We're also using git only, while some are https and some are ssh/git. I've added some really simple debug code:

mbaur@mbaur-g10k:~/g10k-source$ git diff
diff --git a/forge.go b/forge.go
index a9eba6a..b6cb9d6 100644
--- a/forge.go
+++ b/forge.go
@@ -303,7 +303,7 @@ func unTar(r io.Reader, targetBaseDir string) {
                        if err == io.EOF {
                                break
                        }
-                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error())
+                       Fatalf(funcName + "(): error while tar reader.Next() for io.Reader " + err.Error() + targetBaseDir)
                }

                // get the individual filename and extract to the current directory

The targetBaseDir path which is shown, than the error occurred, is the path to a git repository, so there is definitely something wrong here.

If i can help to debug this further, please let me know.

xorpaul commented 7 years ago

@baurmatt I also added the targetDir to the error message locally. I still suspect this is some kind of race condition, but the only thing that would help me debug this is if I can reproduce the bug or someone could give me the debug output of the failed g10k run.

andrewfraley commented 7 years ago

@xorpaul Thanks I'll try to capture a failed run later today

baurmatt commented 7 years ago

@andrewfraley Could you reproduce the issue in your setup?

Some information about our setup:

Still trying to get the debug log through our security team.

baurmatt commented 7 years ago

I was finally able to reliably reproduce this. The problem seems to be some binary files (please don't ask......) which we have in our repository.

mbaur@mbaur-g10k:~$ for i in {1..10}; do ./g10k -config /home/mbaur/g10k.yaml; rm -rf cache/* environments/*; done
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

I've uploaded all data into https://github.com/baurmatt/g10k-77-control and https://github.com/baurmatt/g10k-77-other. Hope this helps to finally find the issue.

andrewfraley commented 7 years ago

I'm having trouble reproducing the error while not on our production puppet servers, but I know I've seen it happen when I testing in a Vagrant VM. I'll keep trying, I don't think we have any repos with binary data (at least not larger than a few KB).

andrewfraley commented 7 years ago

Ok I'm able to reproduce the error, but it never happens on the same module, so I think this is some sort of race condition based on performance of the underlying disks and the remote git server. To reliably make it happen I have to remove the cache directory and remove the target modules directory, so it's a full sync from scratch. I can only make it happen on one of my real servers, which has a much faster connection to the git server, but much slower disks, vs my workstation inside a vagrant instance (workstation has slow connection to git server but fast SSD storage, server has fast connection to git server but pretty slow disks).

2017/11/02 17:55:39 DEBUG executeCommand(): Executing git --git-dir /tmp/g10k-test/modules/https-__redacted.github.enterprise.server_redacted_git_repo.git rev-parse --verify 'v1.1.0' unTar(): error while tar reader.Next() for io.Reader read |0: file already closed

I should add that the total disk space used by all modules in all environments is 650MB, so we are talking about writing a huge number of small files in a very short amount of time.

Hope this helps.

andrewfraley commented 7 years ago

Alright I have managed to reliably reproduce the error in a Virtualbox VM by restricting the disk bandwidth.

Steps to limit bandwidth on a Virtualbox disk:

VBoxManage bandwidthctl "vmname" add "Limit" --type disk --limit 10
VBoxManage storageattach "vmname" --storagectl "SATA Controller" --port 0 --device 0 --type hdd --medium "name_of_disk_file.vmdk" --bandwidthgroup Limit

You can change the limit while the machine is running with: VBoxManage bandwidthctl "vmname" set Limit --limit 20M

For me the errors start at a 50M limit.

More info here: https://www.virtualbox.org/manual/ch05.html#storage-bandwidth-limit

xorpaul commented 7 years ago

Interesting.

Are you running g10k inside this restricted VM or the test Git server?

Like https://github.com/xorpaul/g10k/issues/76#issuecomment-341330157 you can also try limiting the number of parallel checkouts and pulls with the -maxworker parameter.

baurmatt commented 7 years ago

We're running g10k in a Virtuozzo container with Ubuntu 16.04 on an HP server with SAS HDDs. I just tried to lower the max worker to 1 but it didn't help at all :/

mbaur@mbaur-g10k:~$ ./g10k -maxworker 1 -config /home/mbaur/g10k.yaml
Resolving Git modules (1/1)    0s [====================================================================] 100%
unTar(): error while tar reader.Next() for io.Reader unexpected EOF

The repository checkout seems to be fine:

mbaur@mbaur-g10k:~/cache/r10k/modules/git@github.com-baurmatt_g10k-77-other.git$ git fsck --full
Checking object directories: 100% (256/256), done.
Checking objects: 100% (18/18), done.
andrewfraley commented 7 years ago

Yes, g10k is running inside the restricted VM.

Is maxworker supposed to restrict the number of concurrent git commands? Even with -maxworker 1 I still see hundreds of active git processes while g10k is running.

I think I've solved it, though, I'm actually bumping up against the nproc limit. If I restrict the available disk bandwidth down to 5MB/s, I start getting g10k crashes with errors about resource not available, so half the time I get a g10k crash with a big go dump, the other half I get the unTar error. I think it's just random if git can't create a new thread or g10k itself.

After increasing the nproc limit for the user to 9000 the errors stopped happening, even with the disk restricted to 5MB/s.

On CentOS for a user "puppet", add to /etc/security/limits.d/90-nproc.conf: puppet soft nproc 9000 (you have to log out and back in for it to take effect)

xorpaul commented 7 years ago

@baurmatt Are you using the latest g10k release? Which git version are you using? Ubuntu 16.04 should be using a somewhat recent version of git though.

I can't reproduce the issue with your test control repo, but I can only test it on my workstation with plenty of RAM, a fast CPU and an SSD. If you could switch all your Puppetfile Github modules to use the https://github.com:baurmatt/g10k-77-other.git instead of git@github.com:baurmatt/g10k-77-other.git then I can also test it on a slower VM.

The problem is that with that test setup you're triggering about 70 git archive processes in parallel that pump out a lot of data to unTar()

And as @andrewfraley already found out: the -maxworker setting was always meant to limit the number of Git and Forge module resolving processes, not the number of local git archive or Forge archive extracting processes.

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck, which for me was only the number of allowed open file handles that the g10k user is allowed to have.

I'll add a section to the readme that suggests increasing the default limits (nofile, nproc) before running g10k with a large Puppet environment.

Would a parameter help that would limit the local extracting processes, like -maxextractingworkers?

andrewfraley commented 7 years ago

A parameter to limit extracting processes would definitely be helpful, I think that would help solve another issue I have with g10k pegging the CPUs while running. I don't mind if g10k takes 30s to finish instead of 3s, I just want it to finish within 60s. Thanks!

baurmatt commented 7 years ago

I'm using g10k 0.3.14 and git 2.7.4.

As you suggested, i just switched the module URLs to https. For me the error still exists.

I've also just migrated my Virtuozzo Container on a hardware server with SSD storage, the error exists there as well.

xorpaul commented 7 years ago

Have you tried increasing the security limits nproc and nofile for your g10k user?

baurmatt commented 7 years ago

If i understand it correctly, my limits are already really high:

mbaur@mbaur-g10k:~$ ulimit -Sn
1024000
mbaur@mbaur-g10k:~$ ulimit -Hn
1024000
mbaur@mbaur-g10k:~$ ulimit -Su
514893
mbaur@mbaur-g10k:~$ ulimit -Hu
514893
baurmatt commented 7 years ago

I've found the issue...

Nov  3 15:59:05 server.example.org kernel: [821998.807769] Out of memory in UB 290422522: OOM killed process 331832 (git) score 0 vm:53000kB, rss:32224kB, swap:0kB

Upgrading my container from 2 to 4GB solved the problem and g10k is running fine. So i guess the -maxextractingworkers parameter would be nice to reduce the ram/cpu consumption.

xorpaul commented 7 years ago

@baurmatt :smirk: That's exactly what I meant with

The way g10k resolves Puppet environments in parallel is a guaranteed way to find the next bottle-neck

I'm currently working on that -maxextractworker parameter.

xorpaul commented 7 years ago

Try out the new v0.4 release:

https://github.com/xorpaul/g10k/releases/tag/v0.4

You can limit the number of Goroutines with -maxextractworker parameter or as maxextractworker: <INT> g10k config setting.

baurmatt commented 6 years ago

Works great, thanks! :) As i can't reproduce the error, i think this issue can be closed.

xorpaul commented 6 years ago

Glad to hear! 😏

andrewfraley commented 6 years ago

@xorpaul working great for me as well. Dramatically reduced CPU load and no more errors. Using -maxextractworker 10 -maxworker 10 on an 8 core machine. Thanks!