Saving and then loading a JDF 'breaks' pooledarrays

kafisatz commented 4 years ago

Saving and then loading a JDF 'breaks' pooledarrays See example below.

I understand that you compress the data on disk anyway. But as the data is not pooled anymore, the memory footprint of the dataframe is considerably larger after loadjdf

Is this something that you can (easily) improve? You mention that you store some metadata already, so it should be fairly simple to re-pool the data (maybe with a keyword argument?)

using JDF 
using CSV
using DataFrames 

#generate data 
n=20_000
df0=DataFrame(v=repeat(vcat("abc"),n));
allowmissing(df0)
df0.v = convert(Vector{Union{Missing,String}},df0.v)
df0.v[end] = missing

#write file 
csvfile = raw"C:\temp\afile.csv"
CSV.write(fi ,df0)

#read file 
csvSep=','
df =  CSV.read(csvfile, DataFrame,threaded=true, delim=csvSep, pool=0.05,strict=true, lazystrings=false);

#save jdf 
jdffi = raw"C:\temp\df.jdf"
jdffile = JDF.savejdf(jdffi, df)

#load jdf 
dfloaded = JDF.loadjdf(jdffi)

df.v #<- this one is pooled as expected
dfloaded.v #<- not pooled anymore 

Base.summarysize(df)/1024/1024
Base.summarysize(dfloaded)/1024/1024

xiaodaigh commented 4 years ago

I see. I am surprised that CSV.jl uses pooled array instead of CategoricalArrays.jl.

I will extend support to PooledArrays.jl

xiaodaigh commented 4 years ago

Fixed in 0.2.18 to be released soon

xiaodaigh / JDF.jl

Saving and then loading a JDF 'breaks' pooledarrays #45