yuyuranium / FPGA-Project-2022-simple-tpu

Systolic array based simple TPU for CNN on PYNQ-Z2
18 stars 2 forks source link
fpga verilog

FPGA Term Project - Simple TPU

Team members: <E24076459蕭又瑜><E24071069陳志瑜><E24073037鍾震>

Introduction

CNN has been widely used in many image-related machine learning algorithms due to its high accuracy for image recognition. Convolution and fully-connected layers are two essential components for CNN. Our goal is to design a simple TPU to accelerate them.

Algorithm

Target model for our TPU: CNN

2D Convolutional layers

GEMM

ref: https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/

GEMM on Fully-Connected Layers

GEMM for Convolutional Layers

Architecture

Dataflow of GEMM

Iterative Decomposition

After iterative decomposition

Why systolic architecture?

How GEMM can utilize systolic architecture

Use 1D systolic array

Use 2D systolic array

What if the output matrix C has a dimension larger than (8, 8)?

Implementation

Software

Target model

In this project, the target model is a simple CNN for MNIST dataset, which is trained with tensoflow framework. Our goal is to design a TPU with high compatibility, so RGB-input model is also acceptable. Here is our model summary.

Im2col function

Two functions are designed to do the transformation between matrix and image:

  1. Define im2col to turn RGB figures and multi-channel wieghts into matrices.
  2. Define col2im to reshape the matrices back, so we can do activation and maxpooling with correct space relations.

    Eliminate Tensoflow dependency

    Although Convolution and Fully connect will be accelerated on TPU in PL, PS still have to deal all the other functions like: activation, pooling, softmax… Also, Keras framework is not supported on ARM CPU. To eliminate the dependency, model's weights are extracted and re-built it on ZYNQ processor without importing tensorflow library. We also have to use self made CONV and FC function to do algorithm simulation, ensuring the correctness before HW design. Also, all of them have an fixed-point version.

    Fixed point quantization

    It’s easier to consider floating point number as fixed point for hardware design. However, a bad quantization might decrease accuracy seriously. The follwing shows Some experiments with different fraction bit length for our model. We found that setting fraction bit length to 4~8 would yield a better result. Finally, 8 is chosen.

A simple model with only one convolution layer is used to visualize our quantization results.

Hardware

CDMA

Since our matrices size are quite large (at least 16kB each), the heavy data transmission may become the bottleneck.

Our solution: AXI Central DMA (CDMA) provides high-bandwidth Direct Memory Access (DMA) between a memory-mapped source address and a memory-mapped destination address using the AXI4 protocol. In our project it's used to connect ZYNQ Processor DDR3 High Peformance Pin and Block RAM. Also, AXI Interconnection module is introdced to help us deal with Master/Slave connection.

Memory map layout

The following figures show our memory map layout for 3 I/O buffer (global_buffer) of TPU. After computation below, 3 of 64KB buffers are implemented. Note that each word length in PL is 128-bits.

Infer Block RAM & DSP module

  1. global_buffer I/O buffers for TPU to buffer input matrics and output matrix, can utilize block RAM. We have two type I/O buffers, inferred with the coding style provided by Xillinx.

Total size of block memory on PYNQ is 630KB: 630KB x 11.43 % = 72KB / IO buffer The difference between 64KB and 72KB is due to ECC.

  1. PE The essential componets for TPU including mac operation, can utilize DSP Module. We infer it by descripting HW architecture of DSP48E1. Also, multiplier’s pipelined register is considered.

    Block design

    “CDMA + TDP BRAM” version “SDP Asymmetric BRAM” version

    Results

Uitilization

Unfortunately, on PYNQ, results from our system cannot match the one from SW. We find that the output after convolution is all zero.

Discussion