Testing TPU - Githubissues

yuyuranium commented 2 years ago

I'm now testing the TPU. A top module has been added to connect all the components, including 3 global buffers: A, B, and P, and the TPU. Also, the top module simulates the behavior of memory mapped control registers like the AXI ones. I think we will eventually package the TPU into an AXI IP.

A testbench written in c++ is also been added. Make sure you have verilateor installed. A Makefile is provided so that you can simply use make to run the simulation. After that a waveform file top.vcd should be generated.

I have only tested the controller for now. Several bugs are found in the controller, which I have fixed some of them. There may still exist some bugs in other modules. I'll test them later.

@japoka410666: If you have free time, will you help me test whether our PE can utilize the DSP module or not?.

Note that the repository name has been changed to simple-tpu.

BTW, I will continue testing after I finished my ESL term project QQ.

japoka410666 commented 2 years ago

Thanks for your hardworking! I will do the test in these two days.

japoka410666 commented 2 years ago

After synthesis of pe.v, you can see one DSP module, several LUTs, and several Flops with asynchronous reset are utilized. I think it's a desired result.

WillyChennnn commented 2 years ago

Thanks for hardworking. If anything I can help please tell me, too

yuyuranium commented 2 years ago

Looks so good! Thank you for testing! There are still some topics to survey:

How to use the DMA IP provided by Xilinx to help us move the data from DRAM to BRAM (or say, from PS to PL)?
What is the I/O of the DMA?
Whether we have to design another module to convert the output of DMA to a form that can write our global_buffer module?
If the DMA way doesn't work, how to move the data from PS to PL? (what modules to be added? how to design the system?)

Please help me survey these topics. I worry it will be too late if these topics are not resolved first after I finish testing.

Besides, we may have to modify the global_buffer to a dual-port one. Now it has only one port for simulation. We will utilize another port to recieve data from PS. Therefore, if you guys have free time, maybe you can help me write a dual-port version of global_buffer. You may create another file called, for example, global_buffer_tdp.v, to distinguinsh it from the single-port one. We may have to obey the coding style provided by Xilinx to correctly infer BRAM.

@WillyChennnn If you have done the ESL term project, will you help me test our inference_FxPt16 function, checking whether it can be correctly executed on the PYNQ board? Also, I was wondering

How many images in the dataset can be stored on the board?
How to load the images into python runtime?
How fast is our inference_FxPt16 function, i.e., what is the latency?

Please tell me the answers to these problem. Thank you!

WillyChennnn commented 2 years ago

I will complete this as soon as possible!

yuyuranium commented 2 years ago

Thank you! I think I will turn to work for ESL for a while... QQ

japoka410666 commented 2 years ago

Here is a few survey results for the first three topics:

In official tutorial(Part1, Part2), DMA is used to transfer data between PS and a FIFO IP in PL. It also shows how to controll DMA with ZYNQ python library. Inspired by this, I think there are 3 ways to use DMA in our project:

Replace FIFO with global_buffer
Connect FIFO to global_buffer
Directly use FIFO as global_buffer

If the 1st way is consdered, we have to deal with DMA. MM2S: Memory-Mapped to Stream S2MM: Stream to Memory-Mapped. To connected our global_buffer, we should focus on S_AXIS_S2MM and M_AXIS_MM2S which is the read/write ports from PS to PL IP, respectively. Other ports will be automatically connected.

If the 2nd or 3rd way is consdered, we have to deal with FIFO. Block RAM can be chosen in memory type.

Which one is better do you think?

yuyuranium commented 2 years ago

Sorry for my late reply. As you mentioned, I think the most important part is that we must deal with the AXI Stream port, which we may not be familiar with.

For the 1st or 2nd way, we will re-design one of the port into the one that support AXI Stream. I'm not sure if we can implement the protocol correctly. Maybe there is something like a AXI Stream wrapper? So that we can package our global_buffer into an AXI Stream IP?

For the 3rd solution, can FIFO be random accessed? Since our controller doesn't send demand requests sequentially, the storage must be random addressable.

Here is another issue. Since we defined a word as 128 bits, and the DMA seems to only transfer a 32 or 64-bit data a time (?), we may have to design a simple module that has a state machine, collecting, for example, 4 32-bit data, assembling them into a 128-bit word, and finally writing the word to the global_buffer sequentially from a given base address. This module can be considered as the interface between the AXI Stream and our address based global_buffer as I mentioned above.

As for the output of our TPU, the data word is sent sequentially, which may be easier mapped to the AXI Stream. What we only have to do is to decouple the 128-bit word into like 4 * 32-bit data, and use the FIFO to connect to the DMA.

Overall, I think we must come up with a good solution to the AXI-Stream. There seems to be an AXI Datamover by Xilinx that convert AXI-Stream to AXI memory mapped domain, which is what we desire. I haven't dived too deep yet. Maybe we'll get some inspiration there.

Thanks for your summary. It's clear and helpful.

yuyuranium commented 2 years ago

Oh! Good news! There is another type of DMA called AXI central DMA that provides high-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address using the AXI4 protocol. Maybe this is a better solution?

yuyuranium commented 2 years ago

https://support.xilinx.com/s/question/0D52E00006lLhFuSAK/%D1%81onnection-ps-and-pl-in-zynq?language=en_US

yuyuranium commented 2 years ago

I think this PR is ready for review. There is still a small bug, which could write something dirty to the global buffer. However, the location that the dirty data goes does not overlay the answer's location so it won't affect the correctness of the result. I think I will fix this later. Let's merge this version first.

yuyuranium / FPGA-Project-2022-simple-tpu

Testing TPU #3