FPGA Term Project - Simple TPU

Team members: <E24076459蕭又瑜><E24071069陳志瑜><E24073037鍾震>

Introduction

CNN has been widely used in many image-related machine learning algorithms due to its high accuracy for image recognition. Convolution and fully-connected layers are two essential components for CNN. Our goal is to design a simple TPU to accelerate them.

Algorithm

Target model for our TPU: CNN

Convolutional Layers: Feature extraction
Fully Connected Layers: Classification

2D Convolutional layers

The core building block of CNN
Each filter (kernel) is convolved across the width and height of the input
It compute the dot product between the filter entries and the input
It produces a 2-dimensional activation map (or say, output feature map) of that filter
Dot product implies similarity between input image and the filter
Different filters extract different features from the input image and pass the result to the next layer

GEMM

GEMM stands for GEneral Matrix to Matrix Multiplication
It is a single, well-understood function that gives us a very clear path to optimizating for speed and power usage

ref: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

GEMM on Fully-Connected Layers

FC layers are the calssic neural network layer
It is the easiest to start with how GEMM is used for the computation
The are k input values in the input feature map (or say, number of neurons) and n output neurons, so there are totally k * n weights

GEMM for Convolutional Layers

We can think of each kernel to be a 2D array of numbers, whose shape is Height by Width by Channel
The convolution produces the output by taking a number of kernels of weights and applying them across the image

Here we use a function called im2col
- im2col stands for image-to-column
- Given kernel size and the stride of a convolutional layer, im2col convert the 3D image into a 2D array that we can treat like a matrix
Also, we have to flatten the kernels, where k is the number of values in each patch so it is kernel Width Height Depth
The resulting matrix will be number of Patches high and Number of Kerenels wide

Architecture

Dataflow of GEMM

We can first construct the DFD of computing a output matrix entry directly form the definition

Problems
- Long data path
- Waste of area
- Not scalable for very large k

Iterative Decomposition

Through step-by-stop execution, we can accomplish resource sharing
We need to fetch a row from A and a column from B every time we compute an entry in C
Before iterative decomposition

After iterative decomposition

Problems
- A row from A does dot product to every column in B, so it is fetched from memory n times.
- On the other hand, a column in B is fetched from memory m times
- This could result in bad data reuse rate

Why systolic architecture?

We can think of these components as
- Memory: heart
- Data: blood
- PE's: cells
  
  "If a structrue can truly be decomposed into a few types of simple substructures or building blocks, which are used repetitively with simple interfaces, great savings can be achieved." - H.T. Kung, ref: http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-systolic-architecture.pdf

How GEMM can utilize systolic architecture

Use 1D systolic array

Now every row in A only need to be fetched from memory once, which improves the data reuse rate

Use 2D systolic array

Now every row in A and B only need to be fetched from memory once
This yields even better data reuse rate
Final confitureation
- Using a 8-by-8 2D systolic array

What if the output matrix C has a dimension larger than (8, 8)?

Here we take an output matrix of (10, 10) for example
We had to divide the computation into 4 batches, each doing a region of 8-by-8
However, PE utalization rate can decrease when processing the batch near margin
- Those PE's who are not in the dimension will be idling
- Tradeoff!!

Implementation

Software

Target model

In this project, the target model is a simple CNN for MNIST dataset, which is trained with tensoflow framework. Our goal is to design a TPU with high compatibility, so RGB-input model is also acceptable. Here is our model summary.

Im2col function

Two functions are designed to do the transformation between matrix and image:

Define im2col to turn RGB figures and multi-channel wieghts into matrices.
Define col2im to reshape the matrices back, so we can do activation and maxpooling with correct space relations.
Eliminate Tensoflow dependency

Although Convolution and Fully connect will be accelerated on TPU in PL, PS still have to deal all the other functions like: activation, pooling, softmax… Also, Keras framework is not supported on ARM CPU. To eliminate the dependency, model's weights are extracted and re-built it on ZYNQ processor without importing tensorflow library. We also have to use self made CONV and FC function to do algorithm simulation, ensuring the correctness before HW design. Also, all of them have an fixed-point version.

Fixed point quantization

It’s easier to consider floating point number as fixed point for hardware design. However, a bad quantization might decrease accuracy seriously. The follwing shows Some experiments with different fraction bit length for our model. We found that setting fraction bit length to 4~8 would yield a better result. Finally, 8 is chosen.

A simple model with only one convolution layer is used to visualize our quantization results.

Hardware

CDMA

Since our matrices size are quite large (at least 16kB each), the heavy data transmission may become the bottleneck.

Our solution: AXI Central DMA (CDMA) provides high-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address using the AXI4 protocol. In our project it's used to connect ZYNQ Processor DDR3 High Peformance Pin and Block RAM. Also, AXI Interconnection module is introdced to help us deal with Master/Slave connection.

Memory map layout

The following figures show our memory map layout for 3 I/O buffer (global_buffer) of TPU. After computation below, 3 of 64KB buffers are implemented. Note that each word length in PL is 128-bits.

Infer Block RAM & DSP module

global_buffer I/O buffers for TPU to buffer input matrics and output matrix, can utilize block RAM. We have two type I/O buffers, inferred with the coding style provided by Xillinx.
- source_buffer : Simple Dual-Port Asymmetric RAM When Read is wider than Write

target_buffer : Simple Dual-Port Asymmetric RAM When Write is wider than Read

Total size of block memory on PYNQ is 630KB: 630KB x 11.43 % = 72KB / IO buffer The difference between 64KB and 72KB is due to ECC.

PE The essential componets for TPU including mac operation, can utilize DSP Module. We infer it by descripting HW architecture of DSP48E1. Also, multiplier’s pipelined register is considered.
Block design

“CDMA + TDP BRAM” version “SDP Asymmetric BRAM” version

Results

Uitilization

one 8 x 8 Systolic Array = DSP48E1 x 64
three 64KB global_buffer = RAMB36E1 x 16 x 3
three BRAM controller
AXI interconnection
ZYNQ Processor
Demostration

This simulation shows that our TPU works fine. C++ code generated pattern : HDL waveform:

Unfortunately, on PYNQ, results from our system cannot match the one from SW. We find that the output after convolution is all zero.

Discussion

How to make our system work ? Bram address calculated by python have to be checked more closely.
Data transmission should transmit 32 bits or 128 bits each time? This is our first time to deal asymmetric-memory-access case. We should dive deeper to reaserch the details.
How to Improve our system? Improve software api to support more CNN models: more activation functions, cut large data into batches.

yuyuranium / FPGA-Project-2022-simple-tpu

readme