wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

perf problems #58

Closed cottrell closed 7 years ago

cottrell commented 7 years ago

I'm getting consistently better timings with pandas.read_csv. Is there some build problem perhaps on OSX? For example this seems typical on some smallish dataset (1 million rows)

In [8]: %time d = pd.read_csv('j.csv', header=None, dtype=str)
CPU times: user 4.15 s, sys: 422 ms, total: 4.57 s
Wall time: 4.57 s

In [9]: %time df = paratext.load_csv_to_dict('j.csv', no_header=True)
CPU times: user 14.5 s, sys: 914 ms, total: 15.4 s
Wall time: 6.45 s

In [12]: paratext.__version__
Out[12]: '0.3.1rc1'

$ python --version
Python 3.5.3 :: Anaconda custom (x86_64)
deads commented 7 years ago

All of our public benchmarks were on Linux machines with multiple SSDs and at least 16 cores. Most of our internal uses of paratext is in a server environment. As such, we have not extensively optimized paratext performance for Mac OS X. However, there could be other factors to explain the differences.

Can you give more details about the file you are trying to load and the specs of your machine?

cottrell commented 7 years ago

Quite likely the number of cores then. File is a sample from the UK Land Registry dataset. I wasn't careful with the file cache but always loaded the pandas run first which should make the second run faster I would think. I the benchmarks are only supposed to look good from cold state that could be it.

It would be good to know exactly under what situations one should look into this library. The 10x perf is quite attractive. I'm typically on 4-12 core Linux on cluster at work but ... NFS.

$ wc j.csv
 1000000 5323110 175040402 j.csv
$ du -sh j.csv
167M    j.csv
$ head j.csv
"{61D50B1A-FBBB-43B9-BFB3-10794185519D}","41950","1995-10-20 00:00","CO4 3FS","S","N","F","14","","TURNSTONE END","COLCHESTER","COLCHESTER","COLCHESTER","ESSEX","A","A"
"{7A9B0334-22C7-4F3B-BC5C-095E19C38C75}","96500","1995-09-22 00:00","RM4 1PX","S","N","F","FAIRWAY","","NORTH ROAD","HAVERING-ATTE-BOWER","ROMFORD","HAVERING","GREATER LONDON","A","A"
"{E585DFCF-8323-4C8D-A015-095E1E1272D0}","27500","1995-08-08 00:00","PO6 3RR","T","N","F","18","","HARLESTON ROAD","PORTSMOUTH","PORTSMOUTH","PORTSMOUTH","PORTSMOUTH","A","A"
"{4E698F7E-AB41-4722-800B-095E373063A2}","53000","1995-12-11 00:00","TS16 9EB","S","N","F","7","","WENTWORTH WAY","EAGLESCLIFFE","STOCKTON-ON-TEES","STOCKTON-ON-TEES","STOCKTON-ON-TEES","A","A"
"{2701C6AF-88A1-44BD-B108-0CE7CFAE171F}","75000","1995-03-30 00:00","IG7 6ET","D","N","F","34","","LAMBOURNE ROAD","CHIGWELL","CHIGWELL","EPPING FOREST","ESSEX","A","A"
"{A2B0C762-9DE8-4FC2-8350-0CE7FD7F0C46}","98000","1995-08-25 00:00","NW8 6ER","F","Y","L","61","FLAT 1","QUEENS GROVE","LONDON","LONDON","CITY OF WESTMINSTER","GREATER LONDON","A","A"
"{98666A50-C668-4369-8679-0CE80094AB81}","20000","1995-11-10 00:00","BB1 1SP","T","N","L","56","","NOTTINGHAM STREET","BLACKBURN","BLACKBURN","BLACKBURN","LANCASHIRE","A","A"
"{54F5CCDB-BB36-4305-9AB9-0CE805F2AAE9}","73500","1995-09-08 00:00","SE3 7TW","F","N","L","WYCOMBE COURT","15","ST JOHNS PARK","LONDON","LONDON","GREENWICH","GREATER LONDON","A","A"
"{62D37651-D34F-4AE8-87C9-141AFA53830F}","58400","1995-02-24 00:00","SG2 7DF","T","N","F","53","","THE PASTURES","STEVENAGE","STEVENAGE","STEVENAGE","HERTFORDSHIRE","A","A"
"{E4E27C38-F733-4383-B098-141AFBF0AA4D}","79000","1995-11-09 00:00","RH17 5BL","T","N","F","HARRADINES COTTAGES","2","LONDON LANE","CUCKFIELD","HAYWARDS HEATH","MID SUSSEX","WEST SUSSEX","A","A"

$ system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro8,1
      Processor Name: Intel Core i5
      Processor Speed: 2.3 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 3 MB
      Memory: 16 GB
deads commented 7 years ago

First, there needs to be enough I/O bandwidth for a multi-core, multi-threaded approach to have a payoff. This can be achieved a number of ways, one of which is to combine several SSDs in a RAID configuration. At one thread, the workload is CPU-bound. As you add threads, more of the available I/O bandwidth is consumed until a certain point when all the cores are saturated or most of the I/O is consumed. Mac laptops are pretty limited in parallelism and SSD throughput.

Second, the reduce step is unnecessary in Pandas so with a small number of cores, the cost of reducing cannot easily be made up by increases in throughput.

Third, there is a lot of text data in that file. Since we internally use paratext for ML datasets, it treats string fields as categorical by default. paratext maintains a hash table of unique strings to integers for each categorical column. This can slow things down if you have a large column of unique strings. You can override this setting by using the text_names parameter.

paratext.load_csv_to_dict("j.csv", text_names=["col0","col1","col2",...])

Alternatively, you can set the maximum number of levels in a categorical column with the max_levels keyword argument:

paratext.load_csv_to_dict("j.csv, max_levels=0)

Fourth, as the benchmarks show, string creation is slow, which affects throughput relative to the I/O bandwidth for text data.

Fifth, I have not used it for a very long time. It all depends on how the NFS is tuned, the network hardware and configuration, and the workload. NFS is usually tuned for cumulative throughput across all users, but read throughput from a single workstation may be rather limited.