xylix / dotfiles

My dotfiles
1 stars 1 forks source link

Ryzen (3950X, 3900X ) MCE that appears on linux in video calls and some other loads #45

Open xylix opened 3 years ago

xylix commented 3 years ago

(Apologies for the random texts and edit weirdness in the thread. This was originally just a tracking issue for miscallenous configuration issues on my personal machine, but later sort of converted to a discussion thread on the specific MCE.)

I hid some of my comments because they were basically overclocking logs in Finnish that didn't reach any tangible conclusions. The MCE seems to look something like this:

Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: Machine check events logged
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: baa0000000010145
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000 
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1616494331 SOCKET 0 APIC 0 microcode 8701021

And it seems over- and undervolting may affect how often the error appears. It also seems to appear mostly (or exclusively?) on Linux, with multiple people reporting being affected but dual boot Windows working stable.

See bottom of thread for up-to-date info https://github.com/xylix/dotfiles/issues/45#issuecomment-1005649328

xylix commented 3 years ago

done ones:

crashes:

xylix commented 3 years ago

Crash issue:

Possible related journalctl logs:

Mar 03 16:18:12 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000002080bMar 03 16:18:12 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d020002 IPID 1002e00000500Mar 03 16:18:12 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614781091 SOCKET 0 APIC 0 microcode 8701021
__common_interrupt: 7.55 No irq handler for vector

https://bbs.archlinux.org/viewtopic.php?id=256227

Can't be just zoom, happened in google meets as well.

04.03.2021 09:00 Currently testing disabled C-states. 04.03.2021: 15:00 No crashes yet 04.03.2021: 18:30 no crashes. Haven't run docker today though. Maybe it affects? 04.03.2021: 19:50 crash.

Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: Machine check events logged
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000002080b
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 5d020002 IPID 1002e00000500 
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614880297 SOCKET 0 APIC 0 microcode 8701021
xylix commented 3 years ago

Memtest kaatuu mutta ei löydä muistista virheitä ennen kaatumista. Tällä hetkellä olettaisin CPU-vikaa. LLC nosto vaikuttaisi stabilisoivan hiukan (ennen LLC nostoa kaatui alle minuutissa prime95 small FFT:illä).

Ennen LLC nostoa (ja ehkä jälkeenkin) prime95 small fft:issä tulee rounding erroreita.

Rounding erroreita myös noston jälkeen, ainakin workerissä #17

xylix commented 3 years ago

Llc minimi emolla 5 (stock). Nosto 3 tai 2 ei korjannut prime rounding erroreja

xylix commented 3 years ago

Kokeilussa 0.075 offset voltage.

Ei rounding erroreja, kaatuminen noin 10min kohdalla

Toienn kaatuminen 10 kohdalla

Seuraavaksi precision boost overdrive disabled, edelleen 0.075 offset voltage

PBO disabled 0.075 offset jaksoi pisimpään tähän meneessä, noin 25 minuuttia kaatumiseen.

xylix commented 3 years ago

Teoriassa on myös mahdollistaa että kaatuiluvirhe ja cpu vika ovat toisisgaan erillisiä... Jos esimerkiksi paska seinäsähkö on aiheuttanut vikoja komponentteihin

xylix commented 3 years ago
xylix commented 3 years ago

https://community.amd.com/t5/processors/my-games-crash-when-using-precision-boost-overdrive/td-p/168472 https://community.amd.com/t5/processors/ryzen-3000-safe-voltages-and-degradation/td-p/317194 https://www.reddit.com/r/Amd/comments/cahr5r/max_safe_all_core_voltage_for_zen_2_is_1325v/

xylix commented 3 years ago

Manuaali recommended SOC ja vCore voltagen asettaminen vaikuttaa stabilisoivan. LLC level 2:nen, precision boost edelleen disabled.

manuaalit voltit: 1.1 SOC, 1.3 vCore.

mPrime aloitetti 9:50 ajaa stablesti 10:20 nämä ajot lopetettu 10:25

mitä on SOC voltage: https://www.reddit.com/r/overclocking/comments/7abqgn/overclocking_ryzen_soc_voltage/ safe voltage ja LLC rajat: https://www.reddit.com/r/Amd/comments/eht7zz/ryzen_3800x_safe_voltage_and_llc/ https://www.reddit.com/r/Amd/comments/5zmg6s/maximum_safe_vcore_voltages_for_ryzen/

mobo vrm tier list https://linustechtips.com/topic/1137619-motherboard-vrm-tier-list-v2-currently-amd-only/

etsi järkevä tapa vahtia voltageja linuxissa g: "ryzen input voltage linux"

Kun VID voltagen ottaa pois autolta (tarpeellista että voi ylikellottaa frequencyn) temperaturet pomppaavat jo 1.1 arvolla taivaalle ja kaikki kaatuu. Ilmeisesti 1.1 on paljon korkeampi arvo kuin AUTO.

xylix commented 3 years ago

Kokeilin nostaa kelloja @1.3V, ei toiminut. Multiplierin vaihto autosta manualiin, ja samalla jonkun uuden votlage asetuksen pakollisuus nostivat lämpöjä.

Nyt palaamassa aiempiin asetuksiin ja stabiiliuteen, tosin disabled core boost biosissa (eri kuin PBO).

xylix commented 3 years ago

1.3V ja llc 2 ei vaikuta enää stablelta... Joko aiempi llc merkattu väärin ylös tai jotain muuta muuttui. Kokeilussa 1.325V ja LLC 2

1.325 llc 2 ei toiminut, testissä 1.3 ll c 1

1.3 llc1 ei stable

^ kaikki nämä failuret tapahtunut microcode reinstall jälkeen... Testiin re-uninstall

xylix commented 3 years ago

docs on going deeper https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/7

about power usage: https://www.igorslab.de/en/a-class-of-its-own-amd-ryzen-3950x-power-consumption-and-efficiency-analysed-and-documented-in-detail/

xylix commented 3 years ago

3600mhz ram, 1.3v + lvl 2 VCORE LLC, core boost ei ollit stable. Kokeilen vielä lvl 1 llcn ja 2666 ramin core boostilla

xylix commented 3 years ago

Lvl 2 llcllä 1.3V stoppasi at 1.3125ish, lvl 1 llcllä käy 1.35V:ssä

Llc 1 ja 2666mhz + core boost on ainakin 30min prime95 stable, ja boostaa 4.6ghz 1 coren frekvenssiksi

xylix commented 3 years ago

Post 16GB ram most things seem pretty stable with 2667 mhz, lvl 1 LLC and 1.3V cpu voltage (which is pretty high for just normal use and lets temps get up to 80C again under prime95 small FFTs), but had some idle crash during lunch today.

MCE exception in boot log (freedom is the hostname of the machine)

Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: Machine check events logged
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: baa0000000010145
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000 
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1616494331 SOCKET 0 APIC 0 microcode 8701021
```, wonder if this is separate from the high-load crashes and related to the gentoo-wiki described idle power usage problem?

Mar 23 12:12:14 freedom kernel: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 Mar 23 12:12:14 freedom kernel: __common_interrupt: 6.55 No irq handler for vector Mar 23 12:12:14 freedom kernel: #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 Mar 23 12:12:14 freedom kernel: __common_interrupt: 16.55 No irq handler for vector Mar 23 12:12:14 freedom kernel: #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31

asperling commented 3 years ago

@xylix Did you gain any more insight in this? Because you have a very unique MCE there. Search the web for it and you will find exactly 3 matches where one of which is this very issue.

Now, I happen to have the very same MCE (not once but frequently) and I tried to get more information on it. Since it's kind of an error code I tried to get something more human readable out of it... You could dive into the kernel implementation (https://github.com/torvalds/linux/blob/master/drivers/edac/mce_amd.c) or use a nice little python script to give you the string representation (https://github.com/DimitriFourny/MCE-Ryzen-Decoder) - anyways, it's

Bank: Load-Store Unit (LS)
Error: Store queue parity error (STQ 0x1)

Additionally I had that with Ubuntu too (I'm on Manjaro now), don't have any Issues with Windows (dual boot) and am kinda stuck. So sorry to highjack this issue but do you have any news or insights into your specific problem?

defunctl commented 2 years ago

Also not trying to hijack your issue, xylix, but as @asperling mentioned, there are very few search results for this specific MCE.

My reboots would almost always happen on a Google Meet video call, randomly. I could have 4-5 calls no problem, and then the next one would cause a reboot with an MCE in journalctl.

Using Manjaro XFCE with Kernel 4.14, my MCE is:

 Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: Machine check events logged
 Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 0: baa0000000010145
 Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000  SYND 4d00002e IPID b000000000 
 Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1635256438 SOCKET 0 APIC 1 microcode 8701021

Hardware:

I don't want to jinx it, but one of these two things seems to have fixed it for me:

  1. In the Asus bios, setting the PBO Fmax Enhancer under Extreme Tweaker/Precision Boost Overdrive to disabled.
  2. Setting a slight negative offset for SOC Voltage. This will depend on each person's chip, but I'm pretty sure it was the PBO Fmax enhancer.

Still using my ram's D.O.C.P. profile with PBO enabled and no more reboots or MCE's so far :crossed_fingers:

xylix commented 2 years ago

(No worries about hijacking issues... These are for my personal bookkeeping but this is a relevant topic)

I never figured out what the exact problem was. Switching RAM did not seem to help. What actually helped, was undervolting the CPU. I think I did set the slight negative offset to SOC voltage like you did @defunctl .

Extra details I never wrote out here: The issue seemed to come up specifically on Zoom / google meet calls, on arch linux.

I have used the PC a lot on Windows and the issue never came up there.

If this is actually a high-ish google hit for the issue, I will edit it somewhat to make it more clear to anybody who lands here in the future.

justinimel commented 2 years ago

Figured I would chime in. I'm having the exact same issues on one of my machines. Not finding much detail myself. Ironically my system is very similar in build to defunctl. The only difference is that I use mine as a home server and run ECC memory and have a 2060. Wondering if there is some correlation there...

defunctl commented 2 years ago

I haven't had a single freeze up/restart since I've posted. I'm pretty sure it was the Fmax enhancer.

I'll try to get an export of my bios in a gist of anyone thinks that would help.

defunctl commented 2 years ago

@justinimel can you post the MCE errors from your logs?

justinimel commented 2 years ago

I've since cleared the logs. But if it happens again I certainly will.

My use case is entirely different than the above, but I seen this thread so figured I would take a look here. It was the most I had to go on.

I think I've gotten some idea of what might be happening for me.

I'm actually splitting up an IOMMU group with the ACS override patch available in some builds of Linux and passing devices on the same IOMMU group up to two separate VMs. The exceptions happen occasionally when I reboot one of the VMs. Thinking the VMs are probably writing into protected memory on the other VM during the reboot/init process. It's able to do this because true isolation isn't there between the devices. This is because the ACS override patch is a software hack to let us split IOMMU groups apart that are not physically capable of doing so at the hardware level. I think this upsets the processor and brings down the system in some edge cases. But not proven yet!