Open xylix opened 3 years ago
done ones:
[/] weird audio latency https://www.google.com/search?client=firefox-b-d&q=pulseaudio+latency
[x] encrypted /home/kerkko
[x] polkatdot & casperlabs browser extensions
[x] fisher cannot update itself
[x] steam linux stuff https://wiki.archlinux.org/index.php/steam#Proton_Steam-Play
[x] mx master sensitivity and button binds https://wiki.archlinux.org/index.php/Logitech_MX_Master
[x] daily man db regen hidastaa boottia
[x] openpgpg remember passphrase
[x] timeshift https://github.com/teejee2008/timeshift
crashes:
Crash issue:
Possible related journalctl logs:
Mar 03 16:18:12 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000002080bMar 03 16:18:12 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d020002 IPID 1002e00000500Mar 03 16:18:12 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614781091 SOCKET 0 APIC 0 microcode 8701021
__common_interrupt: 7.55 No irq handler for vector
https://bbs.archlinux.org/viewtopic.php?id=256227
Can't be just zoom, happened in google meets as well.
04.03.2021 09:00 Currently testing disabled C-states. 04.03.2021: 15:00 No crashes yet 04.03.2021: 18:30 no crashes. Haven't run docker today though. Maybe it affects? 04.03.2021: 19:50 crash.
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: Machine check events logged
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000002080b
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 5d020002 IPID 1002e00000500
Mar 04 19:51:40 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614880297 SOCKET 0 APIC 0 microcode 8701021
Memtest kaatuu mutta ei löydä muistista virheitä ennen kaatumista. Tällä hetkellä olettaisin CPU-vikaa. LLC nosto vaikuttaisi stabilisoivan hiukan (ennen LLC nostoa kaatui alle minuutissa prime95 small FFT:illä).
Ennen LLC nostoa (ja ehkä jälkeenkin) prime95 small fft:issä tulee rounding erroreita.
Rounding erroreita myös noston jälkeen, ainakin workerissä #17
Llc minimi emolla 5 (stock). Nosto 3 tai 2 ei korjannut prime rounding erroreja
Kokeilussa 0.075 offset voltage.
Ei rounding erroreja, kaatuminen noin 10min kohdalla
Toienn kaatuminen 10 kohdalla
Seuraavaksi precision boost overdrive disabled, edelleen 0.075 offset voltage
PBO disabled 0.075 offset jaksoi pisimpään tähän meneessä, noin 25 minuuttia kaatumiseen.
Teoriassa on myös mahdollistaa että kaatuiluvirhe ja cpu vika ovat toisisgaan erillisiä... Jos esimerkiksi paska seinäsähkö on aiheuttanut vikoja komponentteihin
Is the motherboard VRM just too bad? Check a am4 power delivery tier chart
PBO disabling and overvolting seems to stabilize, but it still does crash sometimes (especially in prime95 small FFTs)
Since there are situations where no instability appears, could it be a non-CPU part causing the issue?
Manuaali recommended SOC ja vCore voltagen asettaminen vaikuttaa stabilisoivan. LLC level 2:nen, precision boost edelleen disabled.
manuaalit voltit: 1.1 SOC, 1.3 vCore.
mPrime aloitetti 9:50 ajaa stablesti 10:20 nämä ajot lopetettu 10:25
mitä on SOC voltage: https://www.reddit.com/r/overclocking/comments/7abqgn/overclocking_ryzen_soc_voltage/ safe voltage ja LLC rajat: https://www.reddit.com/r/Amd/comments/eht7zz/ryzen_3800x_safe_voltage_and_llc/ https://www.reddit.com/r/Amd/comments/5zmg6s/maximum_safe_vcore_voltages_for_ryzen/
mobo vrm tier list https://linustechtips.com/topic/1137619-motherboard-vrm-tier-list-v2-currently-amd-only/
etsi järkevä tapa vahtia voltageja linuxissa g: "ryzen input voltage linux"
Kun VID voltagen ottaa pois autolta (tarpeellista että voi ylikellottaa frequencyn) temperaturet pomppaavat jo 1.1 arvolla taivaalle ja kaikki kaatuu. Ilmeisesti 1.1 on paljon korkeampi arvo kuin AUTO.
Kokeilin nostaa kelloja @1.3V, ei toiminut. Multiplierin vaihto autosta manualiin, ja samalla jonkun uuden votlage asetuksen pakollisuus nostivat lämpöjä.
Nyt palaamassa aiempiin asetuksiin ja stabiiliuteen, tosin disabled core boost biosissa (eri kuin PBO).
1.3V ja llc 2 ei vaikuta enää stablelta... Joko aiempi llc merkattu väärin ylös tai jotain muuta muuttui. Kokeilussa 1.325V ja LLC 2
1.325 llc 2 ei toiminut, testissä 1.3 ll c 1
1.3 llc1 ei stable
^ kaikki nämä failuret tapahtunut microcode reinstall jälkeen... Testiin re-uninstall
3600mhz ram, 1.3v + lvl 2 VCORE LLC, core boost ei ollit stable. Kokeilen vielä lvl 1 llcn ja 2666 ramin core boostilla
Lvl 2 llcllä 1.3V stoppasi at 1.3125ish, lvl 1 llcllä käy 1.35V:ssä
Llc 1 ja 2666mhz + core boost on ainakin 30min prime95 stable, ja boostaa 4.6ghz 1 coren frekvenssiksi
Post 16GB ram most things seem pretty stable with 2667 mhz, lvl 1 LLC and 1.3V cpu voltage (which is pretty high for just normal use and lets temps get up to 80C again under prime95 small FFTs), but had some idle crash during lunch today.
MCE exception in boot log (freedom is the hostname of the machine)
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: Machine check events logged
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: baa0000000010145
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
Mar 23 12:12:14 freedom kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1616494331 SOCKET 0 APIC 0 microcode 8701021
```, wonder if this is separate from the high-load crashes and related to the gentoo-wiki described idle power usage problem?
Mar 23 12:12:14 freedom kernel: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 Mar 23 12:12:14 freedom kernel: __common_interrupt: 6.55 No irq handler for vector Mar 23 12:12:14 freedom kernel: #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 Mar 23 12:12:14 freedom kernel: __common_interrupt: 16.55 No irq handler for vector Mar 23 12:12:14 freedom kernel: #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
@xylix Did you gain any more insight in this? Because you have a very unique MCE there. Search the web for it and you will find exactly 3 matches where one of which is this very issue.
Now, I happen to have the very same MCE (not once but frequently) and I tried to get more information on it. Since it's kind of an error code I tried to get something more human readable out of it... You could dive into the kernel implementation (https://github.com/torvalds/linux/blob/master/drivers/edac/mce_amd.c) or use a nice little python script to give you the string representation (https://github.com/DimitriFourny/MCE-Ryzen-Decoder) - anyways, it's
Bank: Load-Store Unit (LS)
Error: Store queue parity error (STQ 0x1)
Additionally I had that with Ubuntu too (I'm on Manjaro now), don't have any Issues with Windows (dual boot) and am kinda stuck. So sorry to highjack this issue but do you have any news or insights into your specific problem?
Also not trying to hijack your issue, xylix, but as @asperling mentioned, there are very few search results for this specific MCE.
My reboots would almost always happen on a Google Meet video call, randomly. I could have 4-5 calls no problem, and then the next one would cause a reboot with an MCE in journalctl.
Using Manjaro XFCE with Kernel 4.14, my MCE is:
Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: Machine check events logged
Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 0: baa0000000010145
Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d00002e IPID b000000000
Oct 26 13:54:00 xxx kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1635256438 SOCKET 0 APIC 1 microcode 8701021
Hardware:
I don't want to jinx it, but one of these two things seems to have fixed it for me:
Still using my ram's D.O.C.P. profile with PBO enabled and no more reboots or MCE's so far :crossed_fingers:
(No worries about hijacking issues... These are for my personal bookkeeping but this is a relevant topic)
I never figured out what the exact problem was. Switching RAM did not seem to help. What actually helped, was undervolting the CPU. I think I did set the slight negative offset to SOC voltage like you did @defunctl .
Extra details I never wrote out here: The issue seemed to come up specifically on Zoom / google meet calls, on arch linux.
I have used the PC a lot on Windows and the issue never came up there.
If this is actually a high-ish google hit for the issue, I will edit it somewhat to make it more clear to anybody who lands here in the future.
Figured I would chime in. I'm having the exact same issues on one of my machines. Not finding much detail myself. Ironically my system is very similar in build to defunctl. The only difference is that I use mine as a home server and run ECC memory and have a 2060. Wondering if there is some correlation there...
I haven't had a single freeze up/restart since I've posted. I'm pretty sure it was the Fmax enhancer.
I'll try to get an export of my bios in a gist of anyone thinks that would help.
@justinimel can you post the MCE errors from your logs?
I've since cleared the logs. But if it happens again I certainly will.
My use case is entirely different than the above, but I seen this thread so figured I would take a look here. It was the most I had to go on.
I think I've gotten some idea of what might be happening for me.
I'm actually splitting up an IOMMU group with the ACS override patch available in some builds of Linux and passing devices on the same IOMMU group up to two separate VMs. The exceptions happen occasionally when I reboot one of the VMs. Thinking the VMs are probably writing into protected memory on the other VM during the reboot/init process. It's able to do this because true isolation isn't there between the devices. This is because the ACS override patch is a software hack to let us split IOMMU groups apart that are not physically capable of doing so at the hardware level. I think this upsets the processor and brings down the system in some edge cases. But not proven yet!
(Apologies for the random texts and edit weirdness in the thread. This was originally just a tracking issue for miscallenous configuration issues on my personal machine, but later sort of converted to a discussion thread on the specific MCE.)
I hid some of my comments because they were basically overclocking logs in Finnish that didn't reach any tangible conclusions. The MCE seems to look something like this:
And it seems over- and undervolting may affect how often the error appears. It also seems to appear mostly (or exclusively?) on Linux, with multiple people reporting being affected but dual boot Windows working stable.
See bottom of thread for up-to-date info https://github.com/xylix/dotfiles/issues/45#issuecomment-1005649328