zephyrproject-rtos / zephyr

Primary Git Repository for the Zephyr Project. Zephyr is a new generation, scalable, optimized, secure RTOS for multiple hardware architectures.
https://docs.zephyrproject.org
Apache License 2.0
10.57k stars 6.47k forks source link

Non-contiguous MPIDR support for AArch64 #32187

Open jharris-intel opened 3 years ago

jharris-intel commented 3 years ago

Is your enhancement proposal related to a problem? Please describe.

Zephyr assumes that core IDs are within a restricted range, 0..CONFIG_MP_NUM_CPUS-1. A number of data structures in Zephyr scale linearly in size w.r.t. CONFIG_MP_NUM_CPUS.

ARM's core ID is in the form of 4 affinity values, each a byte, with a given core identified by the unique tuple (aff3, aff2, aff1, aff0). Usually aff0 gives the core ID within a cluster and aff1 gives the cluster ID, although this is not guaranteed. (e.g. if you had a (hypothetical) large system with two sockets each with multiple clusters of ARM cores with hyperthreading you might have aff0 be the hyperthread, aff1 be the core ID, aff2 be the cluster ID within the socket, and aff3 be the socket id... although I don't know of any ARM systems that do this as of yet.) The only thing "guaranteed" by spec is that this gives some indication as to scheduling.

For instance, a CPU with two clusters of two cores each each might have the following mpidr_el1.aff2/1/0/ values for the various cores:

  1. 0x000000
  2. 0x000001
  3. 0x000100
  4. 0x000101

As a result, the current AArch64 code does not support multiple CPU clusters, or rather does not support non-contiguous mpidr_el1 values in general. (At the very least without setting CONFIG_MP_NUM_CPUS to a very large value and wasting a lot of memory as a result... and that has other issues.)

Describe the solution you'd like

Two components, one for the "common" case of symmetric affinities, one override for the non-common case.

For the common case of symmetric affinities (note: I am terrible at naming things):

  1. Add config flags CONFIG_NUM_AFF0_VALUES / CONFIG_NUM_AFF1_VALUES / CONFIG_NUM_AFF2_VALUES CONFIG_NUM_AFF3_VALUES for AArch64 (and maybe AArch32? I honestly haven't looked too much at how AArch32 handles this.).
    1. Note that the existing CONFIG_MP_NUM_CPUS on AArch64 maps to CONFIG_NUM_AFF0_VALUES, with CONFIG_NUM_AFF1_VALUES / CONFIG_NUM_AFF2_VALUES / CONFIG_NUM_AFF3_VALUES all set to 1.
  2. Change existing uses of mpidr_el1 for getting core ID to, roughly:
__ASSERT_NO_MSG(mpidr_el1.aff0  < MAX(CONFIG_NUM_AFF0_VALUES, 1));
__ASSERT_NO_MSG(mpidr_el1.aff1  < MAX(CONFIG_NUM_AFF1_VALUES, 1));
__ASSERT_NO_MSG(mpidr_el1.aff2  < MAX(CONFIG_NUM_AFF2_VALUES, 1));
__ASSERT_NO_MSG(mpidr_el1.aff3  < MAX(CONFIG_NUM_AFF3_VALUES, 1));
uint32_t mpidr_el1_aff0 = CONFIG_NUM_AFF0_VALUES <= 1 ? 0 : mpidr_el1.aff0;
uint32_t mpidr_el1_aff1 = CONFIG_NUM_AFF1_VALUES <= 1 ? 0 : mpidr_el1.aff1;
uint32_t mpidr_el1_aff2 = CONFIG_NUM_AFF2_VALUES <= 1 ? 0 : mpidr_el1.aff2;
uint32_t mpidr_el1_aff3 = CONFIG_NUM_AFF3_VALUES <= 1 ? 0 : mpidr_el1.aff3;
uint32_t aff = mpidr_el1_aff3;
aff *= MAX(CONFIG_NUM_AFF2_VALUES, 1);
aff += mpidr_el1_aff2;
aff *= MAX(CONFIG_NUM_AFF1_VALUES, 1);
aff += mpidr_el1_aff1;
aff *= MAX(CONFIG_NUM_AFF0_VALUES, 1);
aff += mpidr_el1_aff0;
__ASSERT_NO_MSG(aff < CONFIG_MP_NUM_CPUS);
return aff;

(Pardon the "cute" but rather non-obvious code in the above. It is designed to handle someone setting CONFIG_NUM_AFF_VALUES to zero or one, while still producing decent code - and I know that the above doesn't properly handle undefined CONFIG_NUM_AFF_VALUES.)

This covers the common case, and is reasonably efficient in translating from mpidr_el1 values to linear core IDs.

In the example, for instance, you would set CONFIG_NUM_AFF0_VALUES = 2, CONFIG_NUM_AFF1_VALUES=2, and the rest 0 (or 1), which would result in (after const-prop and assuming asserts are disabled):

return mpidr_el1.aff1*2 + mpidr_el1.aff0;

...which is pretty close to optimal. (Actual optimal would be if you could replace the two byte-masks with one, but meh.)

For the uncommon case (e.g. a cluster of 4 and a cluster of 2, or more likely e.g. two clusters of 4 where 2 cores are used for other things), I suggest adding a CONFIG override indicating that the platform supplies a function to get the linear core ID.

(Honestly, it may not be worth it to add the complexity for the former so long as the latter is supported.)

Both of these would also need a change for SGIs (and GIC in general) to map back from linear core ID to affinity ID, although this is relatively straightforward (because the linear core ID is, well, linear, worse comes to worse you can just store an array of affinity IDs set up during early boot.)

Describe alternatives you've considered

Use some per-cpu register to store the current linear core ID, and during startup initialize it somehow. However, I couldn't find any suitable registers. You could use VMPIDR_EL2, except that you're not guaranteed that it's even implemented. You could use TPIDR_EL0 / TPIDRRO_EL0, but other things in the system potentially use them. (And they are generally considered to be thread ID, not core ID.)

Do whatever AArch32 does? I haven't really looked too much at this.

Additional context

This is not required in the short term, and is largely dependent on #30676, but would be helpful to have upstream support for (even if it's best-effort).

I am fully willing to do the initial work for this - but I'd prefer to have someone poke at any obvious holes in the high-level approach before I spend the time to implement it :-)

nashif commented 3 years ago

@andyross FYI

andyross commented 3 years ago

For clarity: the Zephyr core ID is really nothing but an index within the _kernel.cpus[] array. There is no expectation that it correspond to any particular hardware feature. Things like x86 APIC IDs have an even more complicated architecture, and IIRC the CPU hardware ID values for the two CPUs on esp32 are essentially random-looking 32 bit numbers (though the intel_adsp parts, based on largely the same xtensa IP, use sequential indexes). The whole field is a mess, thus why we have our own IDs.

If this is needed in a low-latency way, I'd suggest adding a arch-specific substruct to the CPU record just like we do for struct thread_base (we don't have one now simply because we haven't needed one) and store the ID in that. And if there's a need for some kind of reverse lookup to map hardware ID to Zephyr index, that can best be done at the architecture level (IIRC x86_64 has something like this already).

But even then, note that the reason you'd need to track the more complicated hardware IDs is probably because you want to do something with the resulting cluster/affinity/whatever graph. And note that Zephyr has almost no support for any of that. We treat all SMP processors 100% symmetrically. We don't understand things like hyperthreading, for example. We have only the most minimal idea of interrupt affinity, our IPIs are expected to go to all CPUs at once, etc...

andyross commented 3 years ago

Oh, and upleveling a bit I see where some of the disconnect is here. The low level code in the PR is using the (oddly sparse) MPIDR_EL register as an index within a CPU array, which is the bug you're looking at. But the Zephyr API it's implementing, arch_curr_cpu() doesn't care about any of that. All it's supposed to do is return a per-CPU pointer to the CPU record in question, set up in a reasonably fast way however the arch layer wants.

So you can see that on x86, it's implemented as a record pointed to by the %GS segment descriptor. On Xtensa, it's sorted in one of several different special registers (either MISC or one of the EXCSAVEn, depending on what the particular instantiation has available), etc...

So what you're really looking for is not a way to map that ID to a smaller index, but a way to stash a pointer away such that routine kernel code won't mess it up. Some quick googling (note that I am not the ARM expert on the project!) turns up a family of TPIDR_* registers ("thread pointer" I guess?) which seem to be intended for more or less exactly this purpose. Maybe those are what we want to be looking at?

jharris-intel commented 3 years ago

But even then, note that the reason you'd need to track the more complicated hardware IDs is probably because you want to do something with the resulting cluster/affinity/whatever graph.

Unfortunately, not really. It's a combination of two things:

  1. The more complicated HW ID (MPIDR) here is all that the hardware provides. So yes, a linear ID is what Zephyr expects, but this question is essentially "how do I get here from there?"
  2. The GIC (ARM's interrupt controller) uses MPIDR values for differentiating cores, as again, that's what the hardware provides. This shows up in e.g. the internals of gic_raise_sgi. As a result, I do need the reverse map available. (I could add MPIDR values within the cpu_t struct itself I suppose.)

Some quick googling (note that I am not the ARM expert on the project!) turns up a family of TPIDR_* registers ("thread pointer" I guess?) which seem to be intended for more or less exactly this purpose.

I considered that as an alternative above ("You could use TPIDR_EL0 / TPIDRRO_EL0, but other things in the system potentially use them. (And they are generally considered to be thread ID, not core ID.)"). I'm lothe to use them because e.g. some thread-local-storage implementations use them for, well, a pointer for thread-local storage.

Also, this doesn't solve the problem, just moves it. TPIDR_EL0 / TPIDRRO_EL0 are really just per-core registers (one of which is always writable, and one of which is writable only from higher privilege levels) that we'd have to set up. And if I could set up TPIDR_EL0, I could use the same logic directly to get the linear core ID.

I guess the question here becomes "on the architectures where "CPU hardware ID values for the two CPUs on esp32 are essentially random-looking 32 bit numbers", how do you set up the initial linear IDs?" / On x86,how do you set up the segment descriptor such that every CPU gets a unique linear ID? / On Xtensa how do you set up the special registers such that every CPU gets a unique linear ID?

I suppose I could:

  1. Require that the full MPIDR affinity values are specified for every core in the device tree (not just every CPU).
  2. Do a linear scan through all device tree cores on every core in early boot, looking for the core with my affinity values.
  3. Store the resulting linear core ID in e.g. TPIDRRO_EL0 (I'd say TPIDR_EL1, but arch_is_in_isr directly calls arch_curr_cpu, which could be a problemin userspace mode).
    1. I could directly store the pointer to the cpu_t, but then mapping back to actual core ID becomes a pain. Not sure which is best to be honest.

...although I dislike the combination here of:

  1. Being very heavily tied to the exact layout of MPIDR values.
  2. Requiring a bunch of devicetree nodes.
  3. Requiring a linear scan over all cores in boot.
  4. Potentially shooting ourselves in the foot later if we want to optimize IPC such that you don't wake up all cores, only necessary cores. (If we have a pointer to cpu_t here, then suddenly mapping linear ID -> MPIDR value becomes annoying.)

arch_curr_cpu() doesn't care about any of that.

Sure, if there's a different method of getting a pointer to a unique cpu_t for each CPU I'm all ears.

andyross commented 3 years ago

I still think you're overthinking this. On x86, for example, there's a compiled-in table (probably should be in DT, but it's just a C array) of APIC IDs that gets searched linearly in arch_mp_start_cpu() (or maybe on the CPU that just got launched, I forget) to identify the CPU being started. Then it just stuffs the pointer to that CPU record into %GS on the initialization path and never touches it again. Same deal on esp32 really, though there because there are only two CPUs the hardware ID isn't involved and so the auxilliary CPU knows it should put &_kernel.cpus[1] in the MISC register.

As far as register choice: yes, we need one register for TLS, but just one. The read-only TPIDR looks to me like a perfectly acceptable home for the CPU pointer (we can't use the R/W one because you couldn't trust it in a syscall and would need to recover the CPU record some other way for security reasons, defeating the purpose).

zephyrbot commented 7 months ago

Hi @carlocaione,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@jharris-intel you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!