zeldin / litehyperram

LiteX driver for Cypress HyperRAM
Other
3 stars 0 forks source link

VexRiscV-SMP read speed: dcache width impact 32bit vs 64bit #1

Closed pottendo closed 1 year ago

pottendo commented 1 year ago

Hello @zeldin, sorry to bother on this channel - seems to be suitable here, though (see discussion referenced here). I had an exchange with @Dolu1990 on the memory read speed as shown during litex-boot of RVCop64 using the VexRiscV-SMP CPU. see here: https://github.com/SpinalHDL/VexRiscv/issues/368 Maybe that's an interesting challenge for you to solve?!

Feel free to close it, if you can't look into this subject. happy hacking, pottendo

zeldin commented 1 year ago

The HyperRAM interface supports neither 32-bit nor 64-bit bursts, only 16-bit bursts (since the interface width is 16 bits (actually 8 bits, but with DDR)). So regardless of whether you use a 32-bit or a 64-bit system bus, the accesses get converted by a StrideConverter, inserted by litehyperram.frontend.wishbone.LiteHyperRAMWishbone2Native. So in both cases there should be 16-bit bursts, just of different lengths. And larger bursts should be more efficient (= higher throughput). No idea why it would be slower.

zeldin commented 1 year ago

Hm, one thing does come to mind actually: In RVCop64 the wishbone interface connected to the HyperRAM is created as just Interface(), which creates a wishbone interface with the default data width of 32 bits. This line: https://github.com/zeldin/RVCop64/blob/687403605952c44469a474794271abcf38632ad9/hw/rtl/platform/orangecart.py#L159 You might want to try changing that to Interface(data_width=64) for the 64-bit case to see if it makes any difference...

pottendo commented 1 year ago

hi, yes it makes a difference - unfortunately not to the good. It locks up now here:

$ litex_term /dev/ttyACM0 
[...]
 BIOS built on Oct  1 2023 09:35:12
 BIOS CRC passed (a4a84f22)

 LiteX git sha1: 4959c6c8

--=============== SoC ==================--
CPU:        VexRiscv SMP-LINUX @ 80MHz
BUS:        WISHBONE 32-bit @ 4GiB
CSR:        32-bit data
ROM:        48KiB
SRAM:       16KiB
MAIN-RAM:   16384KiB 

--========== Initialization ============--
Memtest at 0x40000000 (2.0MiB)...
memtest_access error @ 0x40000000, exiting memtest.
Memspeed at 0x40000000 (Sequential, 2.0MiB)...
  Write speed: 20.3MiB/s

write path seems to work, the read-path locks. :-(

bye, pottendo

Dolu1990 commented 1 year ago

Could you try with --wishbone-force-32b ?

this will force VexRiscv to generate a 32 bits memory bus to connect to litex, instead of propagating a 64 bits one (i think)

pottendo commented 1 year ago

HiThanks for hint. No haven't tried this. Will do.I'm traveling so it will take a few days.This, mAm 06.10.2023 10:50 schrieb Dolu1990 @.***>: Could you try with --wishbone-force-32b ? this will force VexRiscv to generate a 32 bits memory bus to connect to litex, instead of propagating a 64 bits one (i think)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

zeldin commented 1 year ago

It appears that the reason sequential read is slow on VexRiscV-SMP is that it doesn't drive the CTI signals on the wishbone bus. Without these signals the HyperRAM controller can't know that the next access will be to the following address, and it is forced to terminate the burst. Regular VexRiscV by contrast drives these signals and achieves a much higher throughput because of it. I'm attaching litescope captures of the two setups. Notice how in the non-smp dump eight words can be fetched at a time due to the presence of CTI. dumps.zip

zeldin commented 1 year ago

I believe the problem is that litex.soc.interconnect.wishbone.DownConverter doesn't pass through CTI, so it gets lost at the 64->32 bit boundary.

zeldin commented 1 year ago

Yup, after fixing DownConverter to convert CTI as well, I do get CTI to the HyperRAM controller, and the speed is back:

Memspeed at 0x40000000 (Sequential, 2.0MiB)...
  Write speed: 18.7MiB/s
   Read speed: 48.8MiB/s

I should upstream this, but I need to sleep now. :smile:

pottendo commented 1 year ago

Great work!Thanks & good night! Am 06.10.2023 23:46 schrieb Marcus Comstedt @.***>: Yup, after fixing DownConverter to convert CTI as well, I do get CTI to the HyperRAM controller, and the speed is back: Memspeed at 0x40000000 (Sequential, 2.0MiB)... Write speed: 18.7MiB/s Read speed: 48.8MiB/s

I should upstream this, but I need to sleep now. 😄

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

pottendo commented 1 year ago

Hi, @Dolu1990 mentioned, that the write speed normally is expected to be higher with this CPU. On the simulation one can see that. Would this be applicable on every hardware?

Further I couldn't build the bitstream when updating all the submodules in deps/*

As said, I'm traveling. I'll test this once I'm back home.

Thanks for the efforts! Great progress! Bye pottendo

pottendo commented 1 year ago

Hi @zeldin, I'm back and merged your changes you've done in RVCop64 into my tree. Unfortunately my tree is now broken and the boot fails with memtest error.

[...]
--========== Initialization ============--
Memtest at 0x40000000 (2.0MiB)...
memtest_access error @ 0x40000000, exiting memtest.

I've merged:

commit 1989b93e35a1cdd325e7443877917f994be423f3 (zeldin-origin/master)
Author: Marcus Comstedt <marcus@mc.pp.se>
Date:   Sat Oct 7 11:57:45 2023 +0200

    Avoid use of deprecated register_mem

and

commit f75ffffd550248ee9f416778c14ef15d9383186e
Author: Marcus Comstedt <marcus@mc.pp.se>
Date:   Fri Oct 6 21:42:16 2023 +0200

    mailbox: Break back-to-back wishbone cycles

Further I've updated the deps:

hw/deps/litehyperram(master)$ git status
On branch master
Your branch is up to date with 'origin/master'
hw/deps/litex(wb_downconverter_cti)$ git status
On branch wb_downconverter_cti
Your branch is up to date with 'origin/wb_downconverter_cti'.

and

hw/deps/pythondata-cpu-vexriscv_smp(master)$ git status
On branch master
Your branch is up to date with 'origin/master'.

In addition to switch the branches, I pulled also submodules in hw/deps/pythondata-cpu-vexriscv_smp - selects the submodules also to the respective master branches (otherwise build would fail).

I've tried to build with vexriscv_smp CPU and also with vexriscv, even just 32 bit (i.e. not using FPU or explicit --dcache-width=64). So I obviously need to update more things?

thx, pottendo

PS: I've merged your changes into my clone of RVCop64 (not my fork where I added vexriscv_smp) - here I can build a working bitstream (with the standard CPU of course), so I think, my tooling is still OK.

zeldin commented 1 year ago

@pottendo The wb_downconverter_cti branch is the pull request branch. It only contains the upstreaming of the CTI fix, not other things needed for RVCop64 that I'm not trying to upstream at this juncture. Specifically it doesn't have the adjusted clock generator phase setup, which is needed for the HyperRAM to work properly. So it's not surprising that HyperRAM would not work if you use that branch as is.

I intend to rebase my wip branch on litex master once the PR has been merged, and update the submodule references in RVCop64 at that time. Please wait for this instead of mixing and matching branches. :smile:

pottendo commented 1 year ago

Could you try with --wishbone-force-32b ?

this will force VexRiscv to generate a 32 bits memory bus to connect to litex, instead of propagating a 64 bits one (i think)

Hi @Dolu1990, at the moment, I can't try as @zeldin's project needs some porting/updating and the 'old' frozen litex package, I've used, won't support --wishbone-force-32b. thanks for the feedback & hint - I'll let you know how things are going, pottendo

Dolu1990 commented 1 year ago

@pottendo

@Dolu1990 mentioned, that the write speed normally is expected to be higher with this CPU. On the simulation one can see that. Would this be applicable on every hardware?

Yes likely. thing is, VexRiscv is write through cache, so as long as the memory system can take the bandwidth, the CPU will never be blocked. While for memory load, the CPU will have to way the line of cache to be refilled, that's around 30 cycles on a 100 Mhz core with DDR3 litedram.

pottendo commented 1 year ago

@pottendo

@Dolu1990 mentioned, that the write speed normally is expected to be higher with this CPU. On the simulation one can see that. Would this be applicable on every hardware?

Yes likely. thing is, VexRiscv is write through cache, so as long as the memory system can take the bandwidth, the CPU will never be blocked. While for memory load, the CPU will have to way the line of cache to be refilled, that's around 30 cycles on a 100 Mhz core with DDR3 litedram.

Well, then we have the next challenge for @zeldin! ;-) all the best, pottendo

zeldin commented 1 year ago

I'm pretty sure there's just come confusion here. Because the cache is write-though the writes are sent to the memory individually, and so you will not have any bursts and you get the full brunt of memory latency for each single access. For linear reads on the other hand you will only suffer memory latency once per cache line. So obviously read performance is going to be much better than write performance when doing linear access to a high-latency memory system... The simulation is probably done using SRAM, which is a completely different cup of tea. (You can test that in RVCop64 since it has SRAM as well, just pick some region in the middle of the memory where you will neither hit global variables or the stack...)

pottendo commented 1 year ago

...as always, @zeldin knows!

--=============== SoC ==================--
CPU:        VexRiscv_Debug @ 64MHz
BUS:        WISHBONE 32-bit @ 4GiB
CSR:        32-bit data
ROM:        48KiB
SRAM:       16KiB
MAIN-RAM:   16384KiB 

--========== Initialization ============--
Memtest at 0x40000000 (2.0MiB)...
  Write: 0x40000000-0x40200000 2.0MiB
   Read: 0x40000000-0x40200000 2.0MiB
Memtest OK
Memspeed at 0x40000000 (Sequential, 2.0MiB)...
  Write speed: 18.7MiB/s
   Read speed: 48.7MiB/s
[...]
--============= Console ================--

litex> mem_speed 0x10001000 0x2800
Memspeed at 0x10001000 (Sequential, 10.0KiB)...
  Write speed: 107.8MiB/s
   Read speed: 51.7MiB/s

litex> mem_list
Available memory regions:
SRAM      0x10000000 0x4000 
MAIN_RAM  0x40000000 0x1000000 
ROM       0x70000000 0xc000 
C64       0x00000000 0x10000 
MAILBOX   0xe0000000 0x40 
CSR       0xf0000000 0x10000 

and vexriscv_smp (+FPU et al.):

--=============== SoC ==================--
CPU:        VexRiscv SMP-LINUX @ 64MHz
BUS:        WISHBONE 32-bit @ 4GiB
CSR:        32-bit data
ROM:        48KiB
SRAM:       16KiB
MAIN-RAM:   16384KiB 

--========== Initialization ============--
Memtest at 0x40000000 (2.0MiB)...
  Write: 0x40000000-0x40200000 2.0MiB
   Read: 0x40000000-0x40200000 2.0MiB
Memtest OK
Memspeed at 0x40000000 (Sequential, 2.0MiB)...
  Write speed: 18.7MiB/s
   Read speed: 11.0MiB/s
[...]
--============= Console ================--

litex> mem_speed 0x10001000 0x2800
Memspeed at 0x10001000 (Sequential, 10.0KiB)...
  Write speed: 80.6MiB/s
   Read speed: 41.7MiB/s

litex> mem_list
Available memory regions:
OPENSBI   0x40f00000 0x80000 
PLIC      0xf0c00000 0x400000 
CLINT     0xf0010000 0x10000 
SRAM      0x10000000 0x4000 
MAIN_RAM  0x40000000 0x1000000 
ROM       0x00000000 0xc000 
C64       0x0f000000 0x10000 
MAILBOX   0x80000000 0x40 
CSR       0xf0000000 0x10000 

thanks for clarifying! pottendo

Dolu1990 commented 1 year ago

For hyperram in particular, it is quite possible that the hyper ram can't go as fast to write stuff as the CPU can, as each memory store from the CPU would be it's own accesses, unlike sequancial memory load which will be grouped into a single nice burst.

zeldin commented 1 year ago

The hyperram has a bandwidth of 32 bits per cycle (which even SRAM can't beat given that the system bus is 32 bits), but if the master can't keep the burst going for even a single cycle (or needs to switch to a different address), you have to end the transaction and start a new one when you want to read or write more, which is a 10+ cycle latency. Technically it's possible to stop the clock if you want to pause a burst, but this is not supported by the PHY module.

pottendo commented 1 year ago

Have merged your changes and now the speed is where it's supposed to be. vexriscv_smp is still a bit slower than vexriscv, but this is probably because of the richer functionalities in the larger CPU. Still need to investigate some significant difference, when using the mailbox shared memory; need to do more tests to find out more. I saw that even before the change with the 32bit CPU (without FPU).

and, btw, the access to the shm is now OK wrt. data. So no duplication of every second 4 bytes happens.

Good work @zeldin - thanks, pottendo