zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.93k stars 6.65k forks source link

Hardcoded and unchecked IMR address map for Intel ADSP can cause memory corruption #76247

Open marc-hb opened 4 months ago

marc-hb commented 4 months ago

Describe the bug

This is a follow-up to https://github.com/thesofproject/sof/issues/9308 where the audio DSP firmware on MTL stopped booting across the board after a random combination of Pull Requests that all passed separately. After a long and tedious investigation by @tmleman and @lyakh , it was found that the IMR memory had been corrupted for a long time - we just never noticed it before.

The IMR (Isolated Memory Region) is an area of the main/host memory that stays up when the audio DSP (including its local memory) is off. It speeds up restarting the audio firmware. Starting with the "ACE" generation it is the standard way to restart (there is a sof_debug=0x80 debug bit/option to boot from scratch instead). Among others, the IMR contains a complete image of the firmware code which happened to be partially overwritten.

The reason it was partially overwritten is because the adsp_memory.h file (see https://github.com/zephyrproject-rtos/zephyr/commit/6069f946be1bd502) maps that IMR region with hardcoded constants that became disconnected with reality.

The urgent and temporary fix submitted in #76196 re-hardcodes "better" values but that's obviously not sustainable. This GitHub issue is to discuss and track the longer term solution(s).

To Reproduce

git clone git@github.com:thesofproject/sof.git
git checkout d629e521020c
west update
sof/scripts/xtensa-build-zephyr.py mtl

Inspect the code size and notice that it is to big for the IMR area meant for it.

Originally posted by @tmleman in https://github.com/thesofproject/sof/issues/9308#issuecomment-2244736317

@lyakh thank you for the (temporary) fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack.

To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied:

L3_MEM_BASE_ADDR             = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE   = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE        = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE       = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE       = 0xa1049000 # The beginning of the memory copied to the hpsram.
                               0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE        = 0xa1110000
IMR_BOOT_LDR_STACK_BASE      = 0xa1120000
IMR_L3_HEAP_BASE             = 0xa1121000
The last copied address      = 0xa1133000 -> 0xa011a000

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively. CI caught the problem only in builds with assertions enabled, but also in a normal build, the heap initialization was overwriting our FW, it just didn't cause crashes.

Expected behavior

At build time, a linker or elfutils script must 1. either compute some optimal allocation, or 2. at the very least check that areas are big enough for their intended purposes and fail when not big enough.

If possible, additional runtime checks/protections cannot hurt.

Impact

As usual with memory corruption: critical.

Failure to boot, security risks, etc.

andyross commented 4 months ago

@ceolin @kv2019i will want to be in on this too I suspect.

FWIW an awful lot of that stuff doesn't need to be managed automatically manually at all. The .text/.literal/.data offsets all come out cleanly from the linker, the stack is just a memory array, we automatically compute general heap boundaries with symbols marking the extents. Things like the "boot loader stack" should really be unified with the existing core 0 interrupt stack for efficiency. The memory windows can be sized as general buffers with an __aligned tag, etc... There's a ton of manual management here that is just historical and needless.

And the remaining areas that can't be automatic will end up corresponding to "addresses the CSME, boot ROM or host loader code knows about external to the firmware". And those probably belong in device tree or an evolved loader protocol.

marc-hb commented 3 months ago

Tentative Zephyr upgrade in sof/west.yml with the temporary fix:

EDI: sorry for the noise this comment was meant for related https://github.com/thesofproject/sof/issues/9308

dcpleung commented 2 months ago

Assigning this to @lyakh for now, as I believe this would require some firmware loader works too for the addresses.

github-actions[bot] commented 3 weeks ago

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.