zazuko / data-cube-curation

MIT License
1 stars 0 forks source link

Coma-separated file cannot be uploaded (semicolon works) #116

Closed tpluscode closed 4 years ago

tpluscode commented 4 years ago

For some arbitrary reason, uploading a source CSV file tries to parse it wit ; as the separator. The CSV type however clearly require the use of ,, as implied by the very acronym.

Should we be configurable somehow? Or maybe change to commas and reject semicolon?

@l00mi @ktk what to you guys think?

Current behavior

Only semicolon-separate file will be parsed correctly

Expected behavior

Accept only comma?

Requirements

ktk commented 4 years ago

@tpluscode we can fix that to comma for the moment but ideally this has to be configurable. I've seen everything from commas to tildes to semicolons and no one ever bothers to change the extension

martinmaillard commented 4 years ago

When uploading the CSV file, we could add a form where users can define the "separator" and "escape char" (any maybe "encoding"?) of their CSV.

In a future version, we may offer a live pre-visualization of the file where they can see if it's parsed properly with the params they provided, like Excel does. But I think it's not a priority.

tpluscode commented 4 years ago

Yes, we talked about the multi-step upload but this would greatly affect the entire process.

I agree about the possible separator option. It could be part of the request, more orthodox solution would probably be a multiple part body but I think it may be difficult due to hydra-box's pipeline limitation currently.

On the other hand we might just accept a query param solution.

POST /sources?separator=;
Content-Type: text/csv

... body
l00mi commented 4 years ago

Another approach might be something like: https://www.npmjs.com/package/detect-csv or https://www.npmjs.com/package/csv-sniffer

Because I do not think, that users of this interface will know about their format. So like this we cover 90% > cases. And the rest will need an interaction anyway.

martinmaillard commented 4 years ago

Auto-detection seems like a good option, but it still requires adapting the API. And I think we'd still want users to be able to change what was detected. (I guess you meant server-side, sorry)

@tpluscode multi-part body is not an option with hydra-box? If that's the case, I'm fine with query params for now...

tpluscode commented 4 years ago

multi-part body is not an option with hydra-box?

the same problem we have currently with upload limit - hydra middleware greedily parser the entire body into a string. that would likely rule out using another middleware like multer. but maybe not if we can wire up express specifically to order them in the right way

Another approach might be something like: detect-csv or csv-sniffer

AHA! by all means let's do that. I was thinking about auto-detection but somehow did not occur to me that libraries likely already exits 😅