Saving and then loading a JDF 'breaks' pooledarrays
See example below.
I understand that you compress the data on disk anyway.
But as the data is not pooled anymore, the memory footprint of the dataframe is considerably larger after loadjdf
Is this something that you can (easily) improve?
You mention that you store some metadata already, so it should be fairly simple to re-pool the data (maybe with a keyword argument?)
using JDF
using CSV
using DataFrames
#generate data
n=20_000
df0=DataFrame(v=repeat(vcat("abc"),n));
allowmissing(df0)
df0.v = convert(Vector{Union{Missing,String}},df0.v)
df0.v[end] = missing
#write file
csvfile = raw"C:\temp\afile.csv"
CSV.write(fi ,df0)
#read file
csvSep=','
df = CSV.read(csvfile, DataFrame,threaded=true, delim=csvSep, pool=0.05,strict=true, lazystrings=false);
#save jdf
jdffi = raw"C:\temp\df.jdf"
jdffile = JDF.savejdf(jdffi, df)
#load jdf
dfloaded = JDF.loadjdf(jdffi)
df.v #<- this one is pooled as expected
dfloaded.v #<- not pooled anymore
Base.summarysize(df)/1024/1024
Base.summarysize(dfloaded)/1024/1024
Saving and then loading a JDF 'breaks' pooledarrays See example below.
I understand that you compress the data on disk anyway. But as the data is not pooled anymore, the memory footprint of the dataframe is considerably larger after
loadjdf
Is this something that you can (easily) improve? You mention that you store some metadata already, so it should be fairly simple to re-pool the data (maybe with a keyword argument?)