xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

How to check file size after each row? #364

Closed saartamir closed 3 years ago

saartamir commented 3 years ago

I'd like to be able to limit my parquets file to specific size. I'm trying to use writer.Size and writer.ObjsSize, but I always keep getting size twice than what I configure (lets say I check whether Size + ObjsSize or just Size > fileLimit) but I keep getting twice the size files. Is there any accurate way to do this?

hangxie commented 3 years ago

I use Offset for a similar requirement, however I take it as "hint" instead of hard limit, as WriteStop() will write things like column index, so the file will be always a bit larger than the limit. In the most cases the file is 10% larger than the limit (Offset) but I can live with it.

Size is size of current row group, RowGroupSize should be a better name but that was taken by row group size limit ...

saartamir commented 3 years ago

@hangxie so do you accumulate the offsets of all messages until you reach the file limit?

hangxie commented 3 years ago

do you accumulate the offsets

I check Offset after each write, again the size limit is a hint, not a hard limit.

xitongsys commented 3 years ago

hi, @saartamir Sorry for late response.

It's hard to know the accurate size because parquet will compress the data. parquet-go has a buffer inside and it will compress and flush the content once it's full.

As @hangxie said, the Offset may be the most accurate value.

saartamir commented 3 years ago

@xitongsys @hangxie sorry for the late reply, When I'm trying to use the 'Offset', it keeps give me a constant value (4), which isn't helping me here. How can I use the Offset like you said the evaluate the file size?