xitongsys / parquet-go

pure golang library for reading/writing parquet file
Apache License 2.0
1.27k stars 293 forks source link

Speedup marshal.Unmarshal #400

Closed dzbarsky closed 3 years ago

dzbarsky commented 3 years ago

This contains the following changes:

  1. (Minor cleanup) Switch from if-ladder to switch case when evaluating the Type's Kind
  2. (Minor cleanup)Collapse the two cases that handle lists
  3. (Perf) Avoid converting value when the kind already matches (this allocates!)
  4. (Perf) Only compute po.Type() once per po
  5. (Perf) Avoid repeated map lookups for list and map handling. Once is enough, then just use the pointer
  6. (Perf) When possible, reuse previous field index instead of looking up by name every time (this allocates!)
  7. (Perf) When possible, reuse the previously-seen slice's SliceRecord to avoid the map lookup.

(6) and (7) in particular basically always hit due to parquet layout being columns (i.e. the same field structure is processed in a row) :)

1-6 substantially speed up a file I am processing that is full of lists of float32s. The above flamegraph shows Unmarshal time going from 28s -> 10s. There's also another 15s win due to decreased GC pressure. (7) on top of that decreases the time by another 4 seconds.

Before flamegraph:

image

After flamegraph:

image
dzbarsky commented 3 years ago

I didn't see any benchmarks in this repo, how are you testing performance?