Closed JackKelly closed 1 year ago
sudo apt install pkg-config
apt
following these instructions.0
(I used sudo emacs <filename>
):
/proc/sys/kernel/yama/ptrace_scope
/proc/sys/kernel/perf_event_paranoid
(from here)/proc/sys/kernel/kptr_restrict
source /opt/intel/oneapi/vtune/latest/env/vars.sh
vtune-self-checker.sh
vtune-gui
zarr-benchmark
within VTuneI created a very simple shell script which activates the Python venv
and runs the benchmark:
#!/bin/bash
# Activate venv:
source /home/jack/python_venvs/perfcapture/bin/activate
# Run zarr-benchmark:
/home/jack/python_venvs/perfcapture/bin/python \
/home/jack/dev/zarr/perfcapture/scripts/cli.py \
--data-path /home/jack/temp/perfcapture_data_path \
--recipe-path /home/jack/dev/zarr/zarr-benchmark/recipes
Then run that shell script from VTune.
Here's the code. It's very simple!
SlowMemcpyWorkload
takes 5.4 seconds to run on my machine. Here's the VTune Hotspots Summary:
VTune shows that LZ4 decompression takes the most time (not surprising). What is perhaps more surprising - and is consistent with Vincent's observations - is that the second longest running function is __memmove_avx_unaligned_erms
(ERMS is a CPUID feature which means "Enhanced REP MOVSB" (source). I think clear_page_erms
is this 4-line ASM function.):
("CPI rate" is "Cycles Per Instruction retired". Smaller is better. The best a modern CPU can do is about 0.25.)
(This type of profiling slows things down quite a lot)
The bottom sub-plot in the Figure below is the memory bandwidth (y-axis, in GB/sec) over time. It's interesting that the code rarely maxes out the memory bandwidth (although I think VTune is slowing the code down quite a lot, here):
In last week's meeting, we discussed the hypothesis that the fact that the code is using a memmove function with "unaligned" in the name meant that the data was unaligned in memory, and hence the system was forced to use a slow "unaligned" memmove function. The hypothesis was that we could speed things up by aligning the data in memory.
After reading more about memmove, I no longer thing this hypothesis is correct. My understanding is that these memmove_unaligned functions do spend the majority of their time moving data very efficiently using SIMD instructions. The "unaligned" word in the name just means that the function handles the "ragged ends" and the start and/or end. But, once those "ragged ends" are handled, the function powers through the bytes very quickly using aligned SIMD.
Lots of good info here: https://squadrick.dev/journal/going-faster-than-memcpy.html
So I can't immediately see any quick wins for Zarr Python. Zarr Python has to copy data from the uncompressed chunk buffer into the final array. I'm not sure Dask will help but I'll benchmark dask too.
In a low level compiled language we could use multiple threads, one per cpu core, to copy uncompressed chunks into the final array while the uncompressed chunk is still in cpu cache after decompression.
zref: #5