wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

Paratext fails to parse test file(s) #35

Closed ehiggs closed 7 years ago

ehiggs commented 7 years ago

Hi there. I have a csv-game on bitbucket. I ran the test file through paratext and it failed. The test file is generated with this script:

#!/bin/bash

# Simple csv file which should flex escaping a little.
for i in $(seq 1 1000000); 
  do echo 'hello,","," ",world,"!"'; 
done > /tmp/hello.csv

# Test for 'hello world'
touch /tmp/empty.csv

The code I use is as follows:

#!/usr/bin/env python
import paratext
print sum(map(lambda x: len(x[1]), paratext.load_raw_csv("/dev/stdin",
    no_header=True, allow_quoted_newlines=True)))
$ python2/csvreader-paratext.py < /tmp/hello.csv
Traceback (most recent call last):
  File "python2/csvreader-paratext.py", line 4, in <module>
    allow_quoted_newlines=True)])
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 271, in load_raw_csv
    loader = internal_create_csv_loader(filename, *args, **kwargs)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext/core.py", line 161, in internal_create_csv_loader
    loader.load(_make_posix_filename(filename), params)
  File "/home/ehiggs/.virtualenvs/paratext/lib/python2.7/site-packages/paratext_internal.py", line 414, in load
    return _paratext_internal.ColBasedLoader_load(self, filename, params)
RuntimeError: The file ends with an open quote (4506147)

Changing to num_threads=1 fixes this, but obviously it's a racey bug. There are a few other bugs that are associated.

To get an idea of the baseline performance of I also parse an empty file. paratext segfaults:

$ cat /tmp/empty.csv
$ python2/csvreader-paratext.py < /tmp/empty.csv 
Segmentation fault: 11

Normally to test how we process a csv file, we should be able to use basic cmd line tools to subsample the file and pass it to the csv reader. This seems to fail with paratext. The following hangs on 100% cpu and doesn't respond to SIGINT (iow I can't use Ctrl-C and must use SIGSUSP (Ctrl-Z) and then kill %1

$ head -5 /tmp/hello.csv  | time python2/csvreader-paratext.py
deads commented 7 years ago

Sorry for the delay in responding to your issue. It has been a busy summer at work.

It was my intention to get a fairly comprehensive test suite working before fixing bugs that the community reports. This way we can ensure that one fix does not break earlier fixes. The issue you were experiencing was a bug in the algorithm for dividing the file. In the latest master, the example you gave above seems to work for any number of threads. This is not surprising since the latest master incorporates quite a bit of fuzz-like testing on adverserial input.

The paratext package provides several helper functions for generating arbitrary data frames for the purpose of testing. One of these functions is called generate_hell_frame.

df=paratext.testing.generate_hell_frame(1000, 5, fmt="mixed")

In this frame, there are UTF-8 columns, arbitrary byte sequences, 7-bit ASCII strings, and printable ASCII strings. The data of these columns will contain arbitrary punctuation, double quoting, newlines, and escape characters as well as non-UTF-8 and non-ASCII data.

There is another utility function that writes the data to a file:

paratext.serial.save_frame("myfile.csv", df, encoding=encoding)

where encoding can be utf-8, ascii, printable-ascii, or arbitrary. In each case, if a sequence is encountered outside the encoding, it is properly escaped. This enables a data frame with both Unicode, byte buffer columns, and strings to be written to printable ASCII, and read back in a lossless fashion.

ehiggs commented 7 years ago

FWIW, this is still broken.

deads commented 7 years ago

Are you sure? It works for me. I tried:

it=paratext.load_raw_csv("/tmp/hello.csv", no_header=True, allow_quoted
    ...: _newlines=True)

In [2]: it.next()
Out[2]: 
(u'col0',
 array([0,

Also, if I try:

In[3]: paratext.load_csv_to_pandas("/tmp/hello.csv", no_header=True).head()
Out[3]: 
    col0 col1 col2   col3 col4
0  hello    ,       world    !
1  hello    ,       world    !
2  hello    ,       world    !
3  hello    ,       world    !
4  hello    ,       world    !

it works.

Perhaps you do have the latest source or did not properly rebuild.

Try doing a git pull, removing the build/ directory:


git pull
rm -rf build
ehiggs commented 7 years ago

I found that I couldn't reproduce it when writing a test in tests/test_paratext.py. But it fails when reading from stdin. IIRC you use mmap when reading the file so that won't work; and it certainly wouldn't make any sense to do this with multiple threads so maybe it's moot. So you could still fail but crashing with mysterious errors is not a nice UX.

ehiggs commented 7 years ago

I was able to get paratext added to the game in the end: https://bitbucket.org/ewanhiggs/csv-game

ehiggs commented 7 years ago

As this is closed, I entered #62 to handle the stdin issue. Thanks!