zarr-developers / zarr-benchmark

Benchmarking the performance of various Zarr implementations, using our perfcapture framework.
MIT License
1 stars 2 forks source link

Run Intel VTune against `SlowMemcpyWorkload` #6

Closed JackKelly closed 1 year ago

JackKelly commented 1 year ago

zref: #5

JackKelly commented 1 year ago

Installing Intel VTune on Ubuntu:

  1. sudo apt install pkg-config
  2. Install VTune using apt following these instructions.
  3. Set all these to 0 (I used sudo emacs <filename>):
    1. /proc/sys/kernel/yama/ptrace_scope
    2. /proc/sys/kernel/perf_event_paranoid (from here)
    3. /proc/sys/kernel/kptr_restrict
  4. Follow Intel's post-install instructions:
    1. source /opt/intel/oneapi/vtune/latest/env/vars.sh
    2. vtune-self-checker.sh
  5. vtune-gui

Running zarr-benchmark within VTune

I created a very simple shell script which activates the Python venv and runs the benchmark:

#!/bin/bash

# Activate venv:
source /home/jack/python_venvs/perfcapture/bin/activate

# Run zarr-benchmark:
/home/jack/python_venvs/perfcapture/bin/python \
    /home/jack/dev/zarr/perfcapture/scripts/cli.py \
    --data-path /home/jack/temp/perfcapture_data_path \
    --recipe-path /home/jack/dev/zarr/zarr-benchmark/recipes

Then run that shell script from VTune.

JackKelly commented 1 year ago

What does this benchmark do?

Here's the code. It's very simple!

Results

SlowMemcpyWorkload takes 5.4 seconds to run on my machine. Here's the VTune Hotspots Summary:

image

VTune shows that LZ4 decompression takes the most time (not surprising). What is perhaps more surprising - and is consistent with Vincent's observations - is that the second longest running function is __memmove_avx_unaligned_erms (ERMS is a CPUID feature which means "Enhanced REP MOVSB" (source). I think clear_page_erms is this 4-line ASM function.):

image

("CPI rate" is "Cycles Per Instruction retired". Smaller is better. The best a modern CPU can do is about 0.25.)

image

Microarchitecture Exploration

(This type of profiling slows things down quite a lot)

image

image

Memory usage

image

image

image

The bottom sub-plot in the Figure below is the memory bandwidth (y-axis, in GB/sec) over time. It's interesting that the code rarely maxes out the memory bandwidth (although I think VTune is slowing the code down quite a lot, here):

image

JackKelly commented 1 year ago

In last week's meeting, we discussed the hypothesis that the fact that the code is using a memmove function with "unaligned" in the name meant that the data was unaligned in memory, and hence the system was forced to use a slow "unaligned" memmove function. The hypothesis was that we could speed things up by aligning the data in memory.

After reading more about memmove, I no longer thing this hypothesis is correct. My understanding is that these memmove_unaligned functions do spend the majority of their time moving data very efficiently using SIMD instructions. The "unaligned" word in the name just means that the function handles the "ragged ends" and the start and/or end. But, once those "ragged ends" are handled, the function powers through the bytes very quickly using aligned SIMD.

Lots of good info here: https://squadrick.dev/journal/going-faster-than-memcpy.html

So I can't immediately see any quick wins for Zarr Python. Zarr Python has to copy data from the uncompressed chunk buffer into the final array. I'm not sure Dask will help but I'll benchmark dask too.

In a low level compiled language we could use multiple threads, one per cpu core, to copy uncompressed chunks into the final array while the uncompressed chunk is still in cpu cache after decompression.