yarnpkg / yarn

The 1.x line is frozen - features and bugfixes now happen on https://github.com/yarnpkg/berry
https://classic.yarnpkg.com
Other
41.44k stars 2.72k forks source link

Yarn fails with ESOCKETTIMEDOUT while installing a large package on a slow disk #8242

Closed darkk closed 3 years ago

darkk commented 4 years ago

Bug description

Windows build of our electron app is consistently failing. yarn install --frozen-lockfile failed to download https://registry.yarnpkg.com/date-fns/-/date-fns-2.12.0.tgz and https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz. Failure to download @material-ui/icons was reported with ESOCKETTIMEDOUT. However, I expected the buildhost to have a well-provisioned network as it was a Github Actions runner. Linux build was working fine.

I assumed that high-latency disk IO may be a reason and managed to get a test-case that reproduces the issue reliably: ESOCKETTIMEDOUT is reliably triggered on Linux when small and realistic delay (8ms) is injected to disk IO system calls.

ESOCKETTIMEDOUT being reported because of slow disk IO is very confusing behavior, as it sounds like temporary network error while the root cause is different. It does not match my understanding of "Developing Javascript projects shouldn't leave the door open to surprises" motto, so I'm reporting this test-case as a separate issue despite possible duplicates in the issue tracker. :slightly_smiling_face:

Command

ubuntu $ docker run --privileged --rm -ti --net=host node:12-buster /bin/bash
docker # apt update && apt install -y strace
docker # mkdir ~/prj && cd ~/prj
docker # strace -f -o /dev/null -e trace=stat -e inject=stat:delay_exit=8000 yarn add @material-ui/icons@^4.5.1

What is the current behavior?

yarn add v1.22.4
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
info There appears to be trouble with your network connection. Retrying...
error An unexpected error occurred: "https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz: ESOCKETTIMEDOUT".
info If you think this is a bug, please open a bug report with the information provided in "/root/prj/yarn-error.log".
info Visit https://yarnpkg.com/en/docs/cli/add for documentation about this command.

What is the expected behavior? If I run exactly same command with delay_exit=1 (0.001ms) instead of delay_exit=8000 (8ms), I get the expected behavior:

yarn add v1.22.4
info No lockfile found.
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "@material-ui/core@^4.0.0".
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "react@^16.8.0".
warning " > @material-ui/icons@4.9.1" has unmet peer dependency "react-dom@^16.8.0".
[4/4] Building fresh packages...
success Saved lockfile.
success Saved 3 new dependencies.
info Direct dependencies
└─ @material-ui/icons@4.9.1
info All dependencies
├─ @babel/runtime@7.10.5
├─ @material-ui/icons@4.9.1
└─ regenerator-runtime@0.13.5
Done in 53.01s.

Steps to Reproduce

  1. See "Command" paragraph. Also, the test-case may need some small changes depending on environment.

First, strace adds some overhead on it's own and it may affect reproducibility. E.g. yarn add @material-ui/icons@^4.5.1 is Done in 5.76s. in the very same environment without strace wrapper. That's why I compare strace-with-delay to strace-without-delay and don't compare it to "clean" run.

Second, I've taken stat() call from the following:

Third, I've taken 8ms delay assuming that there is single stat() system call per unpacked file and I was emulating HDD-based system having 125 IOPS performance. It's all a ballpark estimate: 1ms delay works on my system, 2ms fails with ESOCKETTIMEDOUT once but manages to install a package after retry, 4ms and 8ms fail reliably.

Fourth, as soon as TCP buffering is involved (see comment on TCP ZeroWindow later), available network bandwidth and size of socket buffer may be also a factor playing a role in bug reproducibility. I've reproduced the bug with these exact variable values with Ubuntu 16.04 laptop connected by 100 Mbit/s link in Russian St. Petersburg and on Linode VM in Newark (see below).

Fifth, your node build may interact with OS kernel a bit differently, e.g. it may use open() instead of openat(). So, if the test-case fails for you, try to increase the injected latency for the disk-related system call or change a disk-related system call. I reproduced the issue on Ubuntu 18.04 VM in Linode Newark availability zone, but I had to use openat as a latency-injection point instead of stat. 4 statx() and 3 openat() syscalls were made for the aforementioned filename at that VM.

Comments and assumptions

SQLite has "faster than FS" benchmarks showing that Windows had pretty bad performance (compared to Linux) while operating with lots of small files. Both date-fns and @material-ui/icons have thousands of files as well as packages mentioned in "Possible duplicates" section. That explains that Windows users are suffering way more from ESOCKETTIMEDOUT happening while installing packages with thousands of files.

@FredyC came to the same idea that high-latency HDD being used instead of low-latency SSD triggers the ESOCKETTIMEDOUT in https://github.com/yarnpkg/yarn/issues/6221#issuecomment-412844630

@Hinaser made an excellent comment describing packet capture https://github.com/yarnpkg/yarn/issues/5259#issuecomment-389096161 yarn probably stops reading from a socket (so client OS sends TCP ZeroWindow) and eventually closes the socket from the client side.

I assume that node or yarn is busy unpacking well-compressed tarball full of small files and does not restart reading from socket for long enough time, so ESOCKETTIMEDOUT is triggered. I assume that the code also does not disable socket timeout while putting stream in paused state.

I assume, the possible fix is to download .tgz to a temporary file with some timeouts for network interactions and to unpack it without any timeouts as disk can't write faster anyway. Unfortunately, I'm not familiar with yarn codebase to provide a good PR.

Environment

yarn-error.log is the following:

Arguments: 
  /usr/local/bin/node /opt/yarn-v1.22.4/bin/yarn.js add @material-ui/icons@^4.5.1

PATH: 
  /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Yarn version: 
  1.22.4

Node version: 
  12.18.2

Platform: 
  linux x64

Trace: 
  Error: https://registry.yarnpkg.com/@material-ui/icons/-/icons-4.9.1.tgz: ESOCKETTIMEDOUT
      at ClientRequest.<anonymous> (/opt/yarn-v1.22.4/lib/cli.js:141375:19)
      at Object.onceWrapper (events.js:421:28)
      at ClientRequest.emit (events.js:315:20)
      at TLSSocket.emitRequestTimeout (_http_client.js:709:9)
      at Object.onceWrapper (events.js:421:28)
      at TLSSocket.emit (events.js:327:22)
      at TLSSocket.Socket._onTimeout (net.js:481:8)
      at listOnTimeout (internal/timers.js:549:17)
      at processTimers (internal/timers.js:492:7)

npm manifest: 
  No manifest

yarn manifest: 
  No manifest

Lockfile: 
  No lockfile

Possible duplicates:

ievgennaida commented 4 years ago

Windows 10. yarn v1.22.4 node v10.19.0 WSL

Executed: npm init react-app my-ap

Failed: [2/4] Fetching packages... error An unexpected error occurred: "https://registry.yarnpkg.com/rxjs/-/rxjs-6.5.4.tgz: ESOCKETTIMEDOUT".

AmirTugi commented 4 years ago

I'm experiencing the same exact issue. It's a massive issue for us, since it's blocking deploys to production. Anyone has any idea how to solve this? What changed since last week?

darkk commented 4 years ago

@AmirTugi the workaround that works for me is to do yarn config set network-timeout 300000 to raise timeout to 5m from 30s.

However, that's just a workaround, not a fix.

AmirTugi commented 4 years ago

Right, I tried to raise it to 1000000 and it didn't work. So that's an arbitrary work-around :)

m0hamm3d-ac commented 4 years ago

We are facing this issue on our pipeline servers too.. Has anything changed on yarn?

AlanKnightly commented 4 years ago

me too, I had tries more than 30 times these days and always got timeout, which is so annoying

ankasani commented 4 years ago

Even, I see a the same problem in the app center build services. Can anyone please look into this issue?

Is there any yarn status page available?

m0hamm3d-ac commented 4 years ago

Even, I see a the same problem in the app center build services. Can anyone please look into this issue?

Is there any yarn status page available?

The previous failures I observed corresponded to npm outages shown on this page - https://status.npmjs.org/

anhvut commented 4 years ago

@darkk did a wonderful description of the bug. I proposed a PR with his proposal fix in mind:

toby-griffiths commented 4 years ago

I'm also seeing this on my builds on a Digital Ocean (SSD) build server the last couple of days (since setting the build server up).

marikaner commented 3 years ago

This is happening on GH actions for us as well. Every day a few of our checks fail, because of that. Current solution: rerun...

mikehardy commented 3 years ago

For github actions failures you might like https://github.com/nick-invision/retry/

merceyz commented 3 years ago

Closing as fixed in v2 where the timeout logic is less susceptible to this sort of issue

https://yarnpkg.com/getting-started/migration

jjangga0214 commented 3 years ago

@merceyz

I have a few questions.

  1. Is this not supposed to be fixed in v1?
  2. Does this happen only on a specific OS (.e.g. macOS)?
  3. Why does this happen and how is this fixed in v2?

Would appreciate if you share some information.

Thanks!

darkk commented 3 years ago

@jjangga0214 WRT Q#2. It might happen on any OS. It's just more probable to trigger the bug on macOS and Windows due to performance characteristics of the filesystems. HDD (or any other high-latency medium) instead of SSD also increases the probability.

ben1one commented 3 years ago

thx so much!!!!