Floating point exception when using `-H` in a VM

manitofigh commented 1 month ago

Hello @zzrcxb ,

I know you are short on cycles; however, we came across this weird behavior of the program in a virtual machine, which we thought you may have a quick explanation to, if possible.

When running ./osc-single-evset LLC -H, the following occurs:

INFO: Algorithm: default
WARN: Number of sets: 40960 is not a power of 2,
please double check the number of slicesINFO: 1 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 36; L2: 46; L3: 78; DRAM: 204
INFO: Cache hit thresholds: L1D: 40; L2: 58; L3: 128
INFO: Latency upper bound for interrupts: 1020

INFO: Test target huge page: 9
[1]    108432 floating point exception (core dumped)  ./osc-single-evset LLC -H

As you see a floating point exception occurs, which does not tend to happen when the same version of the program is ran on the host.

Have you previously encountered this before in a virtual machine?

Through some basic debugging and print statements, we narrowed down the happening of this error to this line (L111 of osc-single-evset.c).

But the weirder thing is that when we try to narrow it down even more, until the very last line of the evcands_new function can be executed (right before return cands), but nothing after line 111 tends to be executed.

Any insight into this problem would be appreciated!

zzrcxb commented 1 month ago

Hi @manitofigh,

Thank you for reporting this issue. I will take a more detailed look this week. Just as a few questions:

Does this crash deterministically occur?
What's the host's CPU model?
It seems like the heuristics failed to detect the number of LLC sets and slices, could you please try to set NUM_L3_SLICES environment variable to see if the problem persists?

manitofigh commented 1 month ago

Thank you for the quick response!

Yes. This issue has happened every time without any exceptions as of now in the VM.
The host's CPU is an Intel SKX (Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz) (20 cores, 20 slices)

Regarding the LLC slice number, I have already tried hard-coding that to the right number (20); however, that just causes the cache hit thresholds to come out as 0, and fails at setting up cache env. (See below):


INFO: Algorithm: default
WARN: Number of sets: 40960 is not a power of 2,
please double check the number of slicesINFO: 20 L3 slices detected, does it look right?
ERROR: Failed to detect cache latencies after 5 retries
INFO: Cache latencies: L1D: 40; L2: 100; L3: 102; DRAM: 198
INFO: Cache hit thresholds: L1D: 0; L2: 0; L3: 0
INFO: Latency upper bound for interrupts: 0

ERROR: Failed to initialize cache env!



It's worth mentioning that Transparent Huge Page (THP) on both the host and the VM is enabled, and set to `[always]`. 
Also, the VM's vCPUs are pinned to the host's cores to reduce unexpected behavior due to events like vCPU migration, etc.

zzrcxb commented 1 month ago

A very quick check seems to suggest that the fault likely occurred at this line. Because without setting the correct number of L3 slices, cache_congruent_stride(cache) can return a value greater than the hugepage size, resulting in cands_per_page=0, and thus triggering a divide-by-zero exception at line 67. I will add some extra checks to safe guard that.

As for the case where you manually set the number of slice, the reason why those thresholds are 0 is because the detected L2/L3 latency does not pass the sanity check: They are too close. The root cause of that is very likely due to Issue #2.

manitofigh commented 1 month ago

Thank you for the insight. Your observation was correct and I got to validate that cands_per_page was equivalent to 0 right before the division. But the exception seems to be raised even when the correct number of L3 slice is hard-coded and the thresholds also gets to pass the sanity check:

ubuntu@vm1 build git:(master) ✗ ./osc-single-evset LLC -H
INFO: Algorithm: default
WARN: Number of sets: 40960 is not a power of 2,
please double check the number of slicesINFO: 20 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 36; L2: 106; L3: 216; DRAM: 532
INFO: Cache hit thresholds: L1D: 64; L2: 150; L3: 342
INFO: Latency upper bound for interrupts: 2660

INFO: Test target huge page: 0
*** CANDS PER PAGE ***: 0
[1]    1680140 floating point exception (core dumped)  ./osc-single-evset LLC -H

I also see that the core number (which is assumed to be the same as the slice number) is retrieved by using the CPUID instruction, but unfortunately such values (core, socket, etc) are often inaccurate when observed from inside of a VM. E.g., if we dedicate 16 vCPUs to a VM, it treats it as 16 sockets, each containing 1 core, hence 1 slice.

zzrcxb commented 1 month ago

Yes, the number of cores is determined by cpuid, as the repo was mostly targeting non-virtualized environments when it started.

I am wondering if the new crash behavior is deterministic and whether you can share some information about the crash site (e.g., the line number).

Thank you!

manitofigh commented 1 month ago

We figured that exposing the right topology to the VM as well as setting the cache information to passthrough resolves the division by zero exception when the program manages to finish computation. It's worth noting that in most cases the threshold sanity check fails.

Here is our XML config changes for the VM:

<cpu mode='host-passthrough' check='none' migratable='on'>
  <topology sockets='1' dies='1' cores='20' threads='1'/>
  <cache mode='passthrough'/>
</cpu>

As of now, these are some of the common results the program produces with -H passed in:

Threshold sanity check failure:


ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default
INFO: 20 L3 slices detected, does it look right?

ERROR: Failed to detect cache latencies after 5 retries INFO: Cache latencies: L1D: 32; L2: 60; L3: 208; DRAM: 194 INFO: Cache hit thresholds: L1D: 0; L2: 0; L3: 0 INFO: Latency upper bound for interrupts: 0

ERROR: Failed to initialize cache env!


2. Not enough candidates left after filtering:
```bash
ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default                                                                                                                                                               
INFO: 20 L3 slices detected, does it look right?                                                                                                                                       
INFO: Cache latencies: L1D: 32; L2: 38; L3: 72; DRAM: 160                                                                                                                              
INFO: Cache hit thresholds: L1D: 34; L2: 51; L3: 107                                                                                                                                   
INFO: Latency upper bound for interrupts: 800                                                                                                                                          

INFO: Test target huge page: 0                                                                                                                                                         
*** CANDS PER PAGE ***: 16                                                                                                                                                             
INFO: Need to allocate 42 huge pages                                                                                                                                                   
INFO: L2 Filter Duration: 3320us                                                                                                                                                       
ERROR: Not enough candidates due to filtering!

Inaccurate evset construction with -2 confidence level:


ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default
INFO: 20 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 32; L2: 38; L3: 70; DRAM: 166
INFO: Cache hit thresholds: L1D: 34; L2: 50; L3: 108
INFO: Latency upper bound for interrupts: 830

INFO: Test target huge page: 10 CANDS PER PAGE : 16 INFO: Need to allocate 42 huge pages INFO: L2 Filter Duration: 3578us INFO: Alloc: 0us; Population: 0us; Build: 37707us; Pruning: 0us; Extension: 0us; Retries: 10; Backtracks: 187; Tests: 3314; Mem Acc.: 533766; Pos unsure: 8; Neg unsure: 0; OOH: 0; OOC: 0; NoNex: 0; Timeout: 0 Pure acc: 466702; Pure tests: 2729; Pure acc 2: 284390; Pure tests 2: 1631 INFO: Retry dist: 10: 0/1-0/37ms; INFO: Backtrack dist: 14: 0/1-0/3ms; 15: 0/1-0/3ms; 18: 0/1-0/4ms; 20: 0/7-0/26ms; INFO: Meet: 12; Retry: 32467us INFO: Duration: 37.708ms; Size: 13; Candidates: 672 INFO: LLC EV Test Level: -2 INFO: SF EV Test Level: -2



Also, my other question is: Why is there the need for L2 filtering when using a huge page, as all the cache set bits are already exposed to us (e.g., bits `[5:15]` for our CPU's case with 20 slices and 40960 sets total) when we have control over the first 21 bits within a 2MiB HP), as we could just stride over the bits?

Thank you!

zzrcxb commented 1 month ago

Thank you for reporting these results! Indeed, the support for the VM environment needs some work.

To answer your question: There's no need to enable L2 filtering when using hugepages. I added the logic that disables L2 filtering when hugepages are used.

manitofigh commented 1 month ago

Thank you for your response. We actually got to stop the program's failure in our VM after exposing the right topology to the VM and pinning its vCPUs on the same socket, on the host. Although, as of now only ~20% of the computations return level 2 confidence (EV_POS) with -H. The rest tend to be -2.

We however realized if we hardcode a more accurate latency for different cache levels for the VM (higher than the latency available on the host, due to virtualization overhead), we get EV_POS ~50% of the times using -H.

Regarding my question, I was curious since the output prints the line

INFO: L2 Filter Duration: 3578us

even when -H is used.

Also, could you please share if there is a reason ./osc-multi-evset does not have huge page support?

Thank you!

zzrcxb commented 1 month ago

Because huge page is not the main focus of our work, I did not implement huge page support in osc-multi-evset. There's nothing fundamental that prevents huge page from working in a multi-eviction set setting.

manitofigh commented 1 month ago

Understandable. I was just wondering if there was/is anything restrictive, or just outside of the paper's focus. Thank you for your time and clarification!

manitofigh commented 1 month ago

As a follow up on the L2 filtering question: How come the L2 filtering duration message is present in the output when using -H? Because I see l2_filter is not set to false in the switch-case statement for -H. Is a similar behavior to --no-filter expected (l2_filter = false;)?

Thank you for your time and clarification.

zzrcxb commented 1 month ago

Hi, I've disabled l2_filter if huge pages are used in commit 3a237c9.

manitofigh commented 1 month ago

Great! Thank you very much.

zzrcxb / LLCFeasible

Floating point exception when using `-H` in a VM #7