webrecorder / warcio

Streaming WARC/ARC library for fast web archive IO
https://pypi.python.org/pypi/warcio
Apache License 2.0
387 stars 58 forks source link

CLI Indexer: silently ignore brokenpipe signal #136

Open sebastian-nagel opened 3 years ago

sebastian-nagel commented 3 years ago

This PR makes the cli indexer silently ignore the broken pipe signal (SIGPIPE) if the output is sent to a Unix pipe and not entirely consumed, e.g.

$> warcio index not-too-small.warc.gz | head
...
Traceback (most recent call last):
  File ".../bin/warcio", line 33, in <module>
    sys.exit(load_entry_point('warcio==1.7.4', 'console_scripts', 'warcio')())
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/cli.py", line 55, in main
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/cli.py", line 68, in indexer
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 33, in process_all
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 41, in process_one
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 53, in process_index_entry
  File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 87, in _write_line
BrokenPipeError: [Errno 32] Broken pipe

# alternatively
$> warcio index -o >(head) not-too-small.warc.gz
wumpus commented 3 years ago

Sebastian, most Unix things do print the EPIPE, and it's sometimes very helpful when the indexing process has a separate error logfile from the other end of the pipe.

What situation do you have in mind for making it a silent failure?

sebastian-nagel commented 3 years ago

What situation do you have in mind for making it a silent failure?

Using the indexer output as input for head or grep.

Of course, the indexer could show a short message about the "end of pipe" signal. But a stack trace isn't really useful, a broken pipe signal is hardly an issue of the indexer.

most Unix things do print the EPIPE

Good point. Yes, should exit with EPIPE. I'll update the PR. Thanks!

Usually (unless configured otherwise, eg. set -o pipefail for the bash), the shell ignores this error code for commands in the pipeline except the last one. But I'm not aware of any Unix tool (cat, sort, grep, etc.) which "prints" a message that it received a SIGPIPE. In any case, a stack trace is hardly useful as the reason for the SIGPIPE is outside warcio.

wumpus commented 2 years ago

I think I'm remembering EPIPE being useful for things like scp and other network tools. But you're right, normal on-node tools like head have always been silent about EPIPE.

This is kind of similar to the situation of ^C being a silent situation for everything but Python.

sebastian-nagel commented 2 years ago

Yes, Python is special here:

But also Unix tools handling report the SIGPIPE in their exit value as 128 + 13 = 141 (SIGPIPE is defined as 13 in /usr/include/asm-generic/signal.h):

bash -c '/bin/seq 50000 | head >/dev/null; echo ${PIPESTATUS[@]} "=>" $?'
141 0 => 0

If the shell is configured with pipefail, the entire pipeline fails:

bash -c 'set -o pipefail; /bin/seq 50000 | head >/dev/null; echo ${PIPESTATUS[@]} "=>" $?'
141 0 => 141

But I'm still not sure whether the exit value 141 is appropriate for a Python program - that was the last point I thought about. Btw., the post effective handling of the SIGPIPE signal provides a good summary of the problem.

sebastian-nagel commented 2 years ago

Note: I'm also fine with a short error message BrokenPipeError: [Errno 32] Broken pipe but without the long stacktrace.