wiseio / paratext

A library for reading text files over multiple cores.
Apache License 2.0
1.06k stars 103 forks source link

Implement file chunk-based reading (#45). #68

Open tdenniston opened 6 years ago

tdenniston commented 6 years ago

Hello,

We needed the ability to parse larger-than-memory CSV files, so this is my attempt at implementing that (issue #45). It's used something like this:

ParaText::CSV::ColBasedLoader loader;
ParaText::ParseParams params;
params.num_threads = 4;
params.chunked_file_reading = true;
params.file_chunk_size = 1024 * 1024; // Approximate number of bytes to read from the input file

loader.load(inputfile, params);
do {
  std::vector<float> col0vals;
  auto inserter = std::back_inserter(col0vals);
  loader.copy_column<decltype(inserter), size_t>(0, inserter);
} while (loader.load_next());

I'm grateful for any feedback on this, and I'd be happy to make any changes you guys may want.