Running PIMFlow_ramulator

tanvisharma commented 1 year ago

Hi,

Thank you for doing such a thorough work and providing it open-source to the community. I could install the framework on my machine, but I want to ask a few doubts.

Is it possible to independently use PIMFlow_ramulator with the Newton configuration? If so, could you tell me if there is a separate config file that can be used to run it? I see there is a dram_pim.trace for inputs and bunch of traces in the Newton_trace directory.
Is it necessary to build tvm with USE_CUDNN=ON? I could successfully build it for USE_CUDNN OFF and other simulators (accel-sim, ramulator and pimflow) but I am not able to run the main file (./pimflow).
Where can I get the mobilenet traces? I don't see a ./data directory in my filesystem after following the steps mentioned in install.sh

Thanks again.

sbird-kick commented 1 year ago

Hey, I can't help with the first two, but I can with the third one: https://github.com/yongwonshin/PIMFlow/issues/2

tanvisharma commented 1 year ago

hey @sbird-kick, thanks for directing me to the link!

sbird-kick commented 1 year ago

@tanvisharma What exactly is the error that you face when you try running ./pimflow? Would you have an error log?

yongwonshin commented 1 year ago

Thank you for your interest in our work. Below are the replies to your questions

Yes. If you mean memory configs, you can use PIMFlow_ramulator/configs/GDDR6-config.cfg or PIMFlow_ramulator/configs/HBM-config.cfg. If you want to use another memory config, then you should modify the corresponding memory controller sources. Please refer to PIMFlow_ramulator/src/GDDR6.cpp for the reference implementation. If you mean PIM command files, you can find PIM_trace_partition_*.pim files. Command files in Newton_trace are simple example files for internal testing.
Yes. It is necessary to build TVM with USE_CUDNN=ON since we evaluate using GPU kernels. It may be possible to turn off the option (I'm not sure), but you need to remove the codes in the script that uses GPU. Moreover, a GPU with Turing architecture is recommended. Higher architectures may generate bugs as NVBit does not support newer generations.
@sbird-kick already replied to you. Thank you!

I hope the above answers could help you.

tanvisharma commented 1 year ago

@yongwonshin Thanks for replying!

I was looking for the implementation. I could see that for Newton you have added GWRITE, G_ACT0, G_ACT1, G_ACT2, G_ACT3, COMP, READRES in GDDR6.cpp, among other changes. a. How did you decide the latency for readres? b. I could not find PIM_trace_partition_*.pim files in any of the directories. Could you direct me to it? c. Also, is my understanding correct that these PIM command files are generated by tvm and can be independently processed by PIMFlow_ramulator to get performance metrics?
I am still trying to debug this, it is not picking my cudnn library file during the final build. I have added the path in the environment variables and during cmake (for PIMFlowtvm), I can see that `CUDNN<>` variables are correctly set. Which version of cudnn did you use?

@tanvisharma What exactly is the error that you face when you try running ./pimflow? Would you have an error log?

@sbird-kick Thanks again for pitching in! It was an environment setup issue. I don't have sudo access, so I am trying to install all dependencies using conda, and I didn't want you to pull into the same rabbit hole. I will share an error log if I face other issues.

sbird-kick commented 1 year ago

Ah fair, I tried the same too. I couldn't manage to get it to run at all outside of the docker container they provide. There are too many places where /root/ is hard coded in their code (or maybe there is some way to replace it before installing that I didn't find).

sbird-kick commented 1 year ago

Also, those files are only present after you have managed to run the simulator for a model (in particular, this command ./pimflow -m=profile -t=split -n=mobilenet-v2) after which they will appear in PIMFlow/layerwise/result_simulate/mobilenet-v2/ somewhere.

yongwonshin commented 1 year ago

1. a. I assumed that READRES latency is 1 memory clock cycle (same as tCCD_S) since it is stored in the latch. b. You can get the trace files after doing profiling (e.g., ./pimflow -m=profile -t=split -n=mobilenet-v2) c. Yes. You're correct. TVM is used to generate PIM commands.

I used Ubuntu 20.04, Cuda 11.3.1, cuDNN 8.2, the host NVIDIA driver version is 535.54.03. The driver version is not important if it supports Cuda 11.3.1. If you don't use docker, then make sure you set env vars related to CUDA and CUDNN. It may be helpful to completely remove the (TVM) build directory and start from scratch to make sure erasing build caches or reflecting changed cmake configs.

Thank you!

tanvisharma commented 1 year ago

Can I independently use `run --pim_codegen` and modified version of `run --stats`for given onnx derived csv files to get the PIM cycles for certain kernels?

I have arrived at this conclusion because I could independently run these commands even when my nvbit tracer and other commands failed. After taking a deeper look at the code when running ./pimflow -m=profile -t=split -n=mobilenet-v2, I have come up with the following understanding (Please correct anything that is wrong):

inspect_shape.py : transforms the original onnx graph for a given neural network and generates transformed .onnx and corresponding onnx_conv.csv files.
run --trace : Runs nvbit tracer from accel-sim to generate trace files corresponding to the above generated conv/matmul kernels. (kernelslist.g)
run --simulate : Runs accel-sim with these generated traces (creates trace-$NAME.txt)
run --pim_codegen : generates .pim files for each channel where we add GWRITE, COMP and other Newton commands. The addresses are generated based on the kernel configuration and only needs onnx_conv.csv files.
run --stats : Runs Ramulator using the .pim files to get pim cycles. This uses some scaling factor which I did not understand. What is the scaling factor based on? (creates model_...split...csv files with PIM and GPU cycles; would need gpgpusim output files for GPU cycles)
process_csv.py : Adds total_cycle, speedup and other stats to the earlier csv files. (creates max_perf and newton_perf files)

Similarly, the code for --profile -t=pipeline is similar except the way it transforms the onnx graph and eventually it calculates the end to end performance stats.

My understanding is that profile option does not require a modified tvm or accel-sim. It creates the transformed graph based on the kernel configs in the original graph and runs accel-sim & Ramulator separately to find the number of cycles.

The other commands given in the README.md after profile, however, use some modified version of tvm to partition and create the graphs for different policies. Please let me know if I have understood anything wrong. It would also be very helpful if you could briefly explain how you generated the addresses for the PIM commands during .pim file creation.

yongwonshin / PIMFlow