zemirco / json2csv-stream

Transform stream from json to csv
52 stars 17 forks source link

Missing records while converting a 10k records using json2csv-stream module #10

Closed vismayshah90 closed 5 years ago

vismayshah90 commented 8 years ago

Hi Team,

I'm trying to use json2csv-stream module to convert json object to csv, but I have observed that when I increase my size of json object to say around 10k records then while converting to csv some of the records are getting missed without any error being thrown. In order to cross verify I have used JSLint to verify my json and it is valid json.

Sample JSON : [{"name":"xyz","subject":"CS"}, {"name":"abc","subject":"Maths"}, .... 10k records]

Sample Code : var fs = require('fs'); const jsonStream = require('json2csv-stream');

const parser = new jsonStream();

const file = fs.createWriteStream('Demo.csv') .on('close', () => { console.log('File Done'); return; }) .on('error', (err) => { console.log('File NOT Done'); return; });

fs.createReadStream('Demo.json').pipe(parser).pipe(file);

@knownasilya , Please suggest solution or root cause of what I'm doing wrong.

jstephens7 commented 7 years ago

I'm experiencing similar issues while converting a stream with just under 45k records. Approximately .1% of the records are not coming back (about 45 missing).

jstephens7 commented 7 years ago

In addition, it should be noted that it does not appear to be a problem with any individual records, but something skipping them occasionally, as I can run the parser on the same file and get all the records after running it multiple times. While it skips some in one it has them present in another.

vismayshah90 commented 7 years ago

There is a bug in the module BUT I haven't tested recently.

knownasilya commented 7 years ago

If someone can provide a sample dataset that encounters the issue, I could have a look.

jstephens7 commented 7 years ago

I'll send a mock data file that's close to what I'm using. (Still getting the errors with it)

jstephens7 commented 7 years ago

@knownasilya Did you receive the file? It was too large to attach.

knownasilya commented 7 years ago

No, think you could paste it into a gist?

jstephens7 commented 7 years ago

Any update?

jstephens7 commented 7 years ago

I haven't worked with streams much, but I found the problem and a potential (not great solution). The problem is that the objects are getting caught between chunks. I saw there's an objectMode option (not sure what all effects this would have) which could probably be used: https://nodejs.org/api/stream.html#stream_object_mode The other solution I had was to have a variable that we could use to store the remnants of the last chunk and append it to the new chunk before trying to match. I'll create a fork and link the diff here.

Edit: Attached link to diff to show a solution for the problem https://github.com/jstephens7/json2csv-stream/commit/14ba745a3b24dff581a25eb0d095c2bf8836b84b

jstephens7 commented 7 years ago

I'm sure there's a better way to do the same thing using a similar pattern with that._data but as I'm unfamiliar with the code I figured I would just show a proof of concept and have someone more familiar clean it up.

knownasilya commented 7 years ago

I think the solution would be to use another module (https://www.npmjs.com/package/JSONStream) that handles json streams, instead of using the home baked regex solution.

jstephens7 commented 7 years ago

Like I said, that solution works, although it's not ideal. It just illustrates the problem (JSON objects getting caught in multiple chunks). I can work on another solution tonight once I'm home and make a PR.

knownasilya commented 7 years ago

Please use the above module when you work on your PR. I think it should solve the issue that you reproduced.

jstephens7 commented 7 years ago

Do we know for sure whether or not that module has the problem solved we are looking at?

knownasilya commented 7 years ago

That module is battle tested, and it's one purpose is to parse json from a stream, so it should solve the issue.

jstephens7 commented 7 years ago

Alright, I'll leave it to you then.

kaue commented 7 years ago

@jstephens7 @knownasilya I just added support for streams in jsonexport v2.0.0 https://github.com/kauegimenes/jsonexport I just made some tests and looks like its working well with big collections, can you give a try?

jstephens7 commented 7 years ago

Thanks @kauegimenes, I'll take a look later on today.

jstephens7 commented 7 years ago

It seems quite a bit heavier and less efficient than this current one (even with my bad fix). My cpu was churning on my stream (approximately 45k records) for over 20 seconds before I stopped it.

kaue commented 7 years ago

@jstephens7 I just made some changes for better peformance. Benchmark using a 50k collection and jsonexport v2.0.1

Executed benchmark against node module: "json2csv-stream"
Count (1), Cycles (1), Elapsed (7.024 sec), Hz (2.8066241073013356 ops/sec)

Executed benchmark against node module: "json2csv"
Count (1), Cycles (1), Elapsed (6.117 sec), Hz (5.5762221530032985 ops/sec)

Executed benchmark against node module: "jsonexport"
Count (1), Cycles (1), Elapsed (6.309 sec), Hz (3.461989382421796 ops/sec)

Executed benchmark against node module: "jsonexport-stream"
Count (1), Cycles (1), Elapsed (7.411 sec), Hz (1.93111416159437 ops/sec)
jstephens7 commented 7 years ago

I'm guessing the churning is due to the fact that my data is not proper json. Instead of an array it is more of a steam of objects (separated by spaces). I'll try formatting it and seeing if that'll fix it tomorrow.

kaue commented 7 years ago

@jstephens7 That would explain it, but it should give up if its not able to parse a single object after 3 chunks of data from the stream.

kaue commented 7 years ago

@jstephens7 I had the same issue you had using mongoexport, in my case the problem was the _id: ObjectId("..."). I updated the jsonexport to print the error instead of appear stuck.

jstephens7 commented 7 years ago

Ahh, thanks. I'll look again.

knownasilya commented 5 years ago

https://github.com/zemirco/json2csv now has a streams API, please use that module as this one is deprecated/unmaintained.