Open manitofigh opened 1 month ago
Hi @manitofigh,
Thank you for reporting this issue. I will take a more detailed look this week. Just as a few questions:
NUM_L3_SLICES
environment variable to see if the problem persists?Thank you for the quick response!
INFO: Algorithm: default
WARN: Number of sets: 40960 is not a power of 2,
please double check the number of slicesINFO: 20 L3 slices detected, does it look right?
ERROR: Failed to detect cache latencies after 5 retries
INFO: Cache latencies: L1D: 40; L2: 100; L3: 102; DRAM: 198
INFO: Cache hit thresholds: L1D: 0; L2: 0; L3: 0
INFO: Latency upper bound for interrupts: 0
ERROR: Failed to initialize cache env!
It's worth mentioning that Transparent Huge Page (THP) on both the host and the VM is enabled, and set to `[always]`.
Also, the VM's vCPUs are pinned to the host's cores to reduce unexpected behavior due to events like vCPU migration, etc.
A very quick check seems to suggest that the fault likely occurred at this line. Because without setting the correct number of L3 slices, cache_congruent_stride(cache)
can return a value greater than the hugepage size, resulting in cands_per_page=0
, and thus triggering a divide-by-zero exception at line 67. I will add some extra checks to safe guard that.
As for the case where you manually set the number of slice, the reason why those thresholds are 0
is because the detected L2/L3 latency does not pass the sanity check: They are too close. The root cause of that is very likely due to Issue #2.
Thank you for the insight. Your observation was correct and I got to validate that cands_per_page
was equivalent to 0
right before the division. But the exception seems to be raised even when the correct number of L3 slice is hard-coded and the thresholds also gets to pass the sanity check:
ubuntu@vm1 build git:(master) ✗ ./osc-single-evset LLC -H
INFO: Algorithm: default
WARN: Number of sets: 40960 is not a power of 2,
please double check the number of slicesINFO: 20 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 36; L2: 106; L3: 216; DRAM: 532
INFO: Cache hit thresholds: L1D: 64; L2: 150; L3: 342
INFO: Latency upper bound for interrupts: 2660
INFO: Test target huge page: 0
*** CANDS PER PAGE ***: 0
[1] 1680140 floating point exception (core dumped) ./osc-single-evset LLC -H
I also see that the core number (which is assumed to be the same as the slice number) is retrieved by using the CPUID
instruction, but unfortunately such values (core, socket, etc) are often inaccurate when observed from inside of a VM. E.g., if we dedicate 16 vCPUs to a VM, it treats it as 16 sockets, each containing 1 core, hence 1 slice.
Yes, the number of cores is determined by cpuid
, as the repo was mostly targeting non-virtualized environments when it started.
I am wondering if the new crash behavior is deterministic and whether you can share some information about the crash site (e.g., the line number).
Thank you!
We figured that exposing the right topology to the VM as well as setting the cache information to passthrough
resolves the division by zero exception when the program manages to finish computation. It's worth noting that in most cases the threshold sanity check fails.
Here is our XML config changes for the VM:
<cpu mode='host-passthrough' check='none' migratable='on'>
<topology sockets='1' dies='1' cores='20' threads='1'/>
<cache mode='passthrough'/>
</cpu>
As of now, these are some of the common results the program produces with -H
passed in:
ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default
INFO: 20 L3 slices detected, does it look right?
ERROR: Failed to detect cache latencies after 5 retries INFO: Cache latencies: L1D: 32; L2: 60; L3: 208; DRAM: 194 INFO: Cache hit thresholds: L1D: 0; L2: 0; L3: 0 INFO: Latency upper bound for interrupts: 0
ERROR: Failed to initialize cache env!
2. Not enough candidates left after filtering:
```bash
ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default
INFO: 20 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 32; L2: 38; L3: 72; DRAM: 160
INFO: Cache hit thresholds: L1D: 34; L2: 51; L3: 107
INFO: Latency upper bound for interrupts: 800
INFO: Test target huge page: 0
*** CANDS PER PAGE ***: 16
INFO: Need to allocate 42 huge pages
INFO: L2 Filter Duration: 3320us
ERROR: Not enough candidates due to filtering!
-2
confidence level:
ubuntu@vm1 build ➜ ./osc-single-evset LLC -H
INFO: Algorithm: default
INFO: 20 L3 slices detected, does it look right?
INFO: Cache latencies: L1D: 32; L2: 38; L3: 70; DRAM: 166
INFO: Cache hit thresholds: L1D: 34; L2: 50; L3: 108
INFO: Latency upper bound for interrupts: 830
INFO: Test target huge page: 10 CANDS PER PAGE : 16 INFO: Need to allocate 42 huge pages INFO: L2 Filter Duration: 3578us INFO: Alloc: 0us; Population: 0us; Build: 37707us; Pruning: 0us; Extension: 0us; Retries: 10; Backtracks: 187; Tests: 3314; Mem Acc.: 533766; Pos unsure: 8; Neg unsure: 0; OOH: 0; OOC: 0; NoNex: 0; Timeout: 0 Pure acc: 466702; Pure tests: 2729; Pure acc 2: 284390; Pure tests 2: 1631 INFO: Retry dist: 10: 0/1-0/37ms; INFO: Backtrack dist: 14: 0/1-0/3ms; 15: 0/1-0/3ms; 18: 0/1-0/4ms; 20: 0/7-0/26ms; INFO: Meet: 12; Retry: 32467us INFO: Duration: 37.708ms; Size: 13; Candidates: 672 INFO: LLC EV Test Level: -2 INFO: SF EV Test Level: -2
Also, my other question is: Why is there the need for L2 filtering when using a huge page, as all the cache set bits are already exposed to us (e.g., bits `[5:15]` for our CPU's case with 20 slices and 40960 sets total) when we have control over the first 21 bits within a 2MiB HP), as we could just stride over the bits?
Thank you!
Thank you for reporting these results! Indeed, the support for the VM environment needs some work.
To answer your question: There's no need to enable L2 filtering when using hugepages. I added the logic that disables L2 filtering when hugepages are used.
Thank you for your response. We actually got to stop the program's failure in our VM after exposing the right topology to the VM and pinning its vCPUs on the same socket, on the host.
Although, as of now only ~20% of the computations return level 2 confidence (EV_POS
) with -H
. The rest tend to be -2
.
We however realized if we hardcode a more accurate latency for different cache levels for the VM (higher than the latency available on the host, due to virtualization overhead), we get EV_POS
~50% of the times using -H
.
Regarding my question, I was curious since the output prints the line
INFO: L2 Filter Duration: 3578us
even when -H
is used.
Also, could you please share if there is a reason ./osc-multi-evset
does not have huge page support?
Thank you!
Because huge page is not the main focus of our work, I did not implement huge page support in osc-multi-evset
. There's nothing fundamental that prevents huge page from working in a multi-eviction set setting.
Understandable. I was just wondering if there was/is anything restrictive, or just outside of the paper's focus. Thank you for your time and clarification!
As a follow up on the L2 filtering question: How come the L2 filtering duration message is present in the output when using -H
? Because I see l2_filter
is not set to false
in the switch-case statement for -H
. Is a similar behavior to --no-filter
expected (l2_filter = false;
)?
Thank you for your time and clarification.
Hi, I've disabled l2_filter
if huge pages are used in commit 3a237c9.
Great! Thank you very much.
Hello @zzrcxb ,
I know you are short on cycles; however, we came across this weird behavior of the program in a virtual machine, which we thought you may have a quick explanation to, if possible.
When running
./osc-single-evset LLC -H
, the following occurs:As you see a floating point exception occurs, which does not tend to happen when the same version of the program is ran on the host.
Have you previously encountered this before in a virtual machine?
Through some basic debugging and print statements, we narrowed down the happening of this error to this line (L111 of osc-single-evset.c).
But the weirder thing is that when we try to narrow it down even more, until the very last line of the
evcands_new
function can be executed (right beforereturn cands
), but nothing after line 111 tends to be executed.Any insight into this problem would be appreciated!