Closed saartamir closed 3 years ago
I use Offset
for a similar requirement, however I take it as "hint" instead of hard limit, as WriteStop()
will write things like column index, so the file will be always a bit larger than the limit. In the most cases the file is 10% larger than the limit (Offset
) but I can live with it.
Size
is size of current row group, RowGroupSize
should be a better name but that was taken by row group size limit ...
@hangxie so do you accumulate the offsets of all messages until you reach the file limit?
do you accumulate the offsets
I check Offset
after each write, again the size limit is a hint, not a hard limit.
hi, @saartamir Sorry for late response.
It's hard to know the accurate size because parquet will compress the data.
parquet-go
has a buffer inside and it will compress and flush the content once it's full.
As @hangxie said, the Offset
may be the most accurate value.
@xitongsys @hangxie sorry for the late reply, When I'm trying to use the 'Offset', it keeps give me a constant value (4), which isn't helping me here. How can I use the Offset like you said the evaluate the file size?
I'd like to be able to limit my parquets file to specific size. I'm trying to use writer.Size and writer.ObjsSize, but I always keep getting size twice than what I configure (lets say I check whether Size + ObjsSize or just Size > fileLimit) but I keep getting twice the size files. Is there any accurate way to do this?