Closed ehiggs closed 7 years ago
Sorry for the delay in responding to your issue. It has been a busy summer at work.
It was my intention to get a fairly comprehensive test suite working before fixing bugs that the community reports. This way we can ensure that one fix does not break earlier fixes. The issue you were experiencing was a bug in the algorithm for dividing the file. In the latest master, the example you gave above seems to work for any number of threads. This is not surprising since the latest master incorporates quite a bit of fuzz-like testing on adverserial input.
The paratext
package provides several helper functions for generating arbitrary data frames for the purpose of testing. One of these functions is called generate_hell_frame
.
df=paratext.testing.generate_hell_frame(1000, 5, fmt="mixed")
In this frame, there are UTF-8 columns, arbitrary byte sequences, 7-bit ASCII strings, and printable ASCII strings. The data of these columns will contain arbitrary punctuation, double quoting, newlines, and escape characters as well as non-UTF-8 and non-ASCII data.
There is another utility function that writes the data to a file:
paratext.serial.save_frame("myfile.csv", df, encoding=encoding)
where encoding
can be utf-8
, ascii
, printable-ascii
, or arbitrary
. In each case, if a sequence is encountered outside the encoding, it is properly escaped. This enables a data frame with both Unicode, byte buffer columns, and strings to be written to printable ASCII, and read back in a lossless fashion.
FWIW, this is still broken.
Are you sure? It works for me. I tried:
it=paratext.load_raw_csv("/tmp/hello.csv", no_header=True, allow_quoted
...: _newlines=True)
In [2]: it.next()
Out[2]:
(u'col0',
array([0,
Also, if I try:
In[3]: paratext.load_csv_to_pandas("/tmp/hello.csv", no_header=True).head()
Out[3]:
col0 col1 col2 col3 col4
0 hello , world !
1 hello , world !
2 hello , world !
3 hello , world !
4 hello , world !
it works.
Perhaps you do have the latest source or did not properly rebuild.
Try doing a git pull, removing the build/
directory:
git pull
rm -rf build
I found that I couldn't reproduce it when writing a test in tests/test_paratext.py
. But it fails when reading from stdin. IIRC you use mmap when reading the file so that won't work; and it certainly wouldn't make any sense to do this with multiple threads so maybe it's moot. So you could still fail but crashing with mysterious errors is not a nice UX.
I was able to get paratext added to the game in the end: https://bitbucket.org/ewanhiggs/csv-game
As this is closed, I entered #62 to handle the stdin issue. Thanks!
Hi there. I have a csv-game on bitbucket. I ran the test file through paratext and it failed. The test file is generated with this script:
The code I use is as follows:
Changing to
num_threads=1
fixes this, but obviously it's a racey bug. There are a few other bugs that are associated.To get an idea of the baseline performance of I also parse an empty file.
paratext
segfaults:Normally to test how we process a csv file, we should be able to use basic cmd line tools to subsample the file and pass it to the csv reader. This seems to fail with paratext. The following hangs on 100% cpu and doesn't respond to
SIGINT
(iow I can't useCtrl-C
and must useSIGSUSP
(Ctrl-Z
) and thenkill %1