xcp-ng / xcp

Entry point for issues and wiki. Also contains some scripts and sources.
https://xcp-ng.org
1.26k stars 74 forks source link

No longer able to load coretemp.ko module in XCP-NG 8.0 #249

Open RMerl opened 5 years ago

RMerl commented 5 years ago

With XCP-NG 7.x, I was using lm-sensors to monitor the CPU temperature of my Qotom server (running on an Intel i5 5250 CPU). This was provided through the coretemp kernel module.

After upgrading to 8.0, I am no longer able to load that module, it returns "No such device", as if it wasn't detecting the CPU.

[02:11 xcphost /]#  modprobe coretemp
modprobe: ERROR: could not insert 'coretemp': No such device

Output from /proc/cpuinfo:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
stepping    : 4
microcode   : 0x2d
cpu MHz     : 1596.366
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
bugs        : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 3192.61
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
stepping    : 4
microcode   : 0x2d
cpu MHz     : 1596.366
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
bugs        : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 3192.61
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
stepping    : 4
microcode   : 0x2d
cpu MHz     : 1596.366
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
bugs        : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 3192.61
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
stepping    : 4
microcode   : 0x2d
cpu MHz     : 1596.366
cache size  : 3072 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
bugs        : null_seg cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips    : 3192.61
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
stormi commented 5 years ago

I can confirm that on my test pool in 8.0. The module is present but won't load with the "no such device" error. I don't know whether this means that my CPUs are not supported or a more general issue.

Maybe @rushikeshjadhav has an idea?

RMerl commented 5 years ago

Since for me it worked with 7.x, and checking on kernel.org's commit log I couldn't see any commit that removed support for older device, I suspect it might be a regression. Could be a missing kernel option for example. I had a quick glance at /proc/config.gz, nothing in particular stood out to me.

rushikeshjadhav commented 5 years ago

I see that the module is present /lib/modules/4.19.0+1/kernel/drivers/hwmon/coretemp.ko I get same error on my nested XCP-NG but it could be valid as no cpu temp monitors are passed through.

Need to check if there is any BIOS setting that enables this.

If it works on a factory installation of XCP-NG 7.6 and doesn't on XCP-NG 8.0 (same BIOS settings) then we can take a deeper look.

RMerl commented 5 years ago

I don't remember if I ever tried it with 7.6, but it was definitely working correctly with 7.4 without any BIOS change.

TurtleFX commented 4 years ago

Hi, Is there any update on this issue?

I have similar PC/Server as @RMerl (Qotom Q355G4 based on i5-5250U). After upgrade to xcp-ng 8.0 lm-sensors are not detecting any sensors. It was working on xcp-ng 7.6 - I checked it before the upgrade.

modprobe results in same error:

[22:56 XCP ~]# modprobe coretemp
modprobe: ERROR: could not insert 'coretemp': No such device
olivierlambert commented 4 years ago

@rushikeshjadhav any idea what could we do in this case? testing a more recent kernel?

rushikeshjadhav commented 4 years ago

@olivierlambert I think we might have to back port coretemp from earlier working kernel to this one and test. @TurtleFX can you bear with us for testing and help in fixing this issue?

TurtleFX commented 4 years ago

@rushikeshjadhav ok, I can try to help in testing.

rushikeshjadhav commented 4 years ago

So it seems there is not much change in coretemp.ko itself from previous versions. Can you share o/p of # dmidecode -t processor?

RMerl commented 4 years ago
# dmidecode -t processor
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x0041, DMI type 4, 42 bytes
Processor Information
    Socket Designation: SOCKET 0
    Type: Central Processor
    Family: Core i5
    Manufacturer: Intel(R) Corporation
    ID: D4 06 03 00 FF FB EB BF
    Signature: Type 0, Family 6, Model 61, Stepping 4
    Flags:
        FPU (Floating-point unit on-chip)
        VME (Virtual mode extension)
        DE (Debugging extension)
        PSE (Page size extension)
        TSC (Time stamp counter)
        MSR (Model specific registers)
        PAE (Physical address extension)
        MCE (Machine check exception)
        CX8 (CMPXCHG8 instruction supported)
        APIC (On-chip APIC hardware supported)
        SEP (Fast system call)
        MTRR (Memory type range registers)
        PGE (Page global enable)
        MCA (Machine check architecture)
        CMOV (Conditional move instruction supported)
        PAT (Page attribute table)
        PSE-36 (36-bit page size extension)
        CLFSH (CLFLUSH instruction supported)
        DS (Debug store)
        ACPI (ACPI supported)
        MMX (MMX technology supported)
        FXSR (FXSAVE and FXSTOR instructions supported)
        SSE (Streaming SIMD extensions)
        SSE2 (Streaming SIMD extensions 2)
        SS (Self-snoop)
        HTT (Multi-threading)
        TM (Thermal monitor supported)
        PBE (Pending break enabled)
    Version: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
    Voltage: 0.9 V
    External Clock: 100 MHz
    Max Speed: 1600 MHz
    Current Speed: 2500 MHz
    Status: Populated, Enabled
    Upgrade: Socket BGA1168
    L1 Cache Handle: 0x003E
    L2 Cache Handle: 0x003F
    L3 Cache Handle: 0x0040
    Serial Number: NULL
    Asset Tag: To Be Filled By O.E.M
    Part Number: To Be Filled By O.E.M
    Core Count: 2
    Core Enabled: 2
    Thread Count: 4
    Characteristics:
        64-bit capable

I checked the kernel.org commit history back when I was looking into this, and saw very little changes to that code indeed. I wonder if it might be a missing/misconfigured kernel config option? Would be worth comparing config.gz between 7.6 and 8.0.

rushikeshjadhav commented 4 years ago

Also, please try following kernel module on your XCP-NG 8 host, to read CPU capability

# wget https://gist.github.com/rushikeshjadhav/220fbe8ea68cef32bfe2a7a6ea99000d/raw/f6043688fd0b67dbeb88b8fa7e22e60acb463522/eax.ko
# insmod eax.ko
# dmesg | tail
# rmmod eax
RMerl commented 4 years ago
[253233.175185] Hi!
[253233.175186] No such device
[253233.175187] vendor  : 71
[253233.175191] cpuid_eax   : 0
RMerl commented 4 years ago

Could it be something as simple as a missing /dev node? I haven't studied the coretemp.ko code, so I don't know how it interfaces with the system.

rushikeshjadhav commented 4 years ago

So even though, your CPU has TM (Thermal monitor supported), the cpuid_eax (CPUID.06H:EAX.[7]) is returning 0. On a system where its working, the o/p is cpuid_eax : 128.

Whats your output for # sensors-detect?

RMerl commented 4 years ago
# sensors-detect
# sensors-detect revision 3.4.0-6 (2016-06-01)
# Board: INTEL Corporation Q3XXG4-P
# Kernel: 4.19.0+1 x86_64
# Processor: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz (6/61/4)

This program will help you determine which kernel modules you need
to load to use lm_sensors most effectively. It is generally safe
and recommended to accept the default answers to all questions,
unless you know what you're doing.

Some south bridges, CPUs or memory controllers contain embedded sensors.
Do you want to scan for them? This is totally safe. (YES/no): 
Module cpuid loaded successfully.
Silicon Integrated Systems SIS5595...                       No
VIA VT82C686 Integrated Sensors...                          No
VIA VT8231 Integrated Sensors...                            No
AMD K8 thermal sensors...                                   No
AMD Family 10h thermal sensors...                           No
AMD Family 11h thermal sensors...                           No
AMD Family 12h and 14h thermal sensors...                   No
AMD Family 15h thermal sensors...                           No
AMD Family 16h thermal sensors...                           No
AMD Family 17h thermal sensors...                           No
AMD Family 15h power sensors...                             No
AMD Family 16h power sensors...                             No
Intel digital thermal sensor...                             No
Intel AMB FB-DIMM thermal sensor...                         No
Intel 5500/5520/X58 thermal sensor...                       No
VIA C7 thermal sensor...                                    No
VIA Nano thermal sensor...                                  No

Some Super I/O chips contain embedded sensors. We have to write to
standard I/O ports to probe them. This is usually safe.
Do you want to scan for Super I/O sensors? (YES/no): 
Probing for Super-I/O at 0x2e/0x2f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      Yes
Found unknown chip with ID 0x8785
Probing for Super-I/O at 0x4e/0x4f
Trying family `National Semiconductor/ITE'...               No
Trying family `SMSC'...                                     No
Trying family `VIA/Winbond/Nuvoton/Fintek'...               No
Trying family `ITE'...                                      No

Some systems (mainly servers) implement IPMI, a set of common interfaces
through which system health data may be retrieved, amongst other things.
We first try to get the information from SMBIOS. If we don't find it
there, we have to read from arbitrary I/O ports to probe for such
interfaces. This is normally safe. Do you want to scan for IPMI
interfaces? (YES/no): 
Probing for `IPMI BMC KCS' at 0xca0...                      No
Probing for `IPMI BMC SMIC' at 0xca8...                     No

Some hardware monitoring chips are accessible through the ISA I/O ports.
We have to write to arbitrary I/O ports to probe them. This is usually
safe though. Yes, you do have ISA I/O ports even if you do not have any
ISA slots! Do you want to scan the ISA I/O ports? (YES/no): 
Probing for `National Semiconductor LM78' at 0x290...       No
Probing for `National Semiconductor LM79' at 0x290...       No
Probing for `Winbond W83781D' at 0x290...                   No
Probing for `Winbond W83782D' at 0x290...                   No

Lastly, we can probe the I2C/SMBus adapters for connected hardware
monitoring devices. This is the most risky part, and while it works
reasonably well on most systems, it has been reported to cause trouble
on some systems.
Do you want to probe the I2C/SMBus adapters now? (YES/no): 
Found unknown SMBus adapter 8086:9ca2 at 0000:00:1f.3.
Sorry, no supported PCI bus adapters found.
Module i2c-dev loaded successfully.

Next adapter: SMBus I801 adapter at f040 (i2c-0)
Do you want to scan it? (YES/no/selectively): 
Client found at address 0x50
Probing for `Analog Devices ADM1033'...                     No
Probing for `Analog Devices ADM1034'...                     No
Probing for `SPD EEPROM'...                                 Yes
    (confidence 8, not a hardware monitoring chip)
Probing for `EDID EEPROM'...                                No

Sorry, no sensors were detected.
Either your system has no sensors, or they are not supported, or
they are connected to an I2C or SMBus adapter that is not
supported. If you find out what chips are on your board, check
http://www.lm-sensors.org/wiki/Devices for driver status.
rushikeshjadhav commented 4 years ago

It should show similar to below.

Intel digital thermal sensor...                             Success!
    (driver `coretemp')

Will check more.

RMerl commented 4 years ago

I wish I could run 7.6 to do the same tests for an A-B compare against 8.0, unfortunately this server is in production, so that's not really possible in my case.

Let me know if you need any further tests to be done on my current server. We could possibly try inserting a module version with increased debug logging, for instance to see which actual function is returning the No such device error message.

rushikeshjadhav commented 4 years ago
for instance to see which actual function is returning the No such device error message

I did code that in eax.ko

What is your # cat /proc/cpuinfo | grep flags?

RMerl commented 4 years ago

See the first post for the complete cpuinfo output.

TurtleFX commented 4 years ago

On my machine I have exactly same results of # dmidecode -t processor, eax.ko and sensors-detect as @RMerl.

My cpuflags:

# cat /proc/cpuinfo | grep flags
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
flags           : fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
rushikeshjadhav commented 4 years ago

The flags should show dtherm which is a cpu flag used by DTS/coretemp to read temp.

Were any of you @TurtleFX @RMerl were seeing thermal information in BIOS? e.g. direct temperatures or some temperature related settings?

RMerl commented 4 years ago

I can't remember if I did, it`s been a while since I've checked (this is a headless server). I only remember that lm-sensors was working with XCP-NG 7.4.

I can't reboot the server right now, but I could check it later tonight, unless @TurtleFX remembers if he's seen that information on his own end.

TurtleFX commented 4 years ago

I can see temperature in the BIOS: Qotom_Temperature

rushikeshjadhav commented 4 years ago

Ok, found something interesting.. cpuid_fault which usually occurs on CPU0 and means Intel CPUID faulting is supported. Will find some more info on why cpuid would fault.

@olivierlambert it could be on XenServer 8 as well. Seems this needs to be logged on XS Bug system.

Edit: This is a new feature added to Intel procs and supported above 4.15 Ref: http://xenbits.xenproject.org/docs/xtf/test-cpuid-faulting.html

rushikeshjadhav commented 4 years ago

@RMerl @TurtleFX please install # yum install cpuid --enablerepo=base and share # cpuid o/p.

RMerl commented 4 years ago

Here you go.

rmerl-cpuid-output.txt

TurtleFX commented 4 years ago

Here is mine cpuid output. turtletx-cpuid-output.txt

rushikeshjadhav commented 4 years ago

@RMerl @TurtleFX Please try this new module with more verbose information.


# wget https://gist.github.com/rushikeshjadhav/220fbe8ea68cef32bfe2a7a6ea99000d/raw/5358bc1a1de826b4df5c6864b1481d5f26eb6844/eax2.ko
# insmod eax2.ko
# dmesg | tail -n 20
# rmmod eax2

Edit : Updated eax2.ko
TurtleFX commented 4 years ago

Here is dmseg results with eax2.ko:

[75441.485233] Hi!
[75441.485235] m->vendor         0
[75441.485236] m->family         0
[75441.485236] m->model  0
[75441.485237] m->feature        448
[75441.485237] c->vendor        : 0
[75441.485238] c->family        : 6
[75441.485239] c->model  61
[75441.485239] c->vendor id     : GenuineIntel
[75441.485240] model name       : Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
[75441.485242] cpuid_eax 0      : 0
[75441.485244] cpuid_eax 6      : 0
[75441.485245] X86_VENDOR_ANY 65535 X86_FAMILY_ANY 0 X86_MODEL_ANY 0 X86_FEATURE_ANY 0 X86_FEATURE_DTHERM 448
[75441.485246] Pass 4
[75441.485246] No X86_FEATURE_DTHERM
[75441.485246] Has X86_FEATURE_FPU
[75441.485247] No Suitable Device for Coretemp
rushikeshjadhav commented 4 years ago

Are you using Pool and were this fresh installs or upgrades? Can you fetch # xe host-cpu-info and # xl dmesg?

@stormi can you try the same module for your test pool where coretemp did not work and also the # xe host-cpu-info and #xl dmesg?

TurtleFX commented 4 years ago

In my case, in the beginning it was a single host XenServer (hosting home router and not so important VMs), then it was upgraded to XCP-ng 7.4 or 7.5 (after XenServer licensing/features debacle, don't remember exact version), afterwards it was upgraded to XCP-ng 7.6 and 8.0.

Here is the output of #xe host-cpu-info:

# xe host-cpu-info
cpu_count       : 4
    socket_count: 1
          vendor: GenuineIntel
           speed: 1596.232
       modelname: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
          family: 6
           model: 61
        stepping: 4
           flags: fpu de tsc msr pae mce cx8 apic sep mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms rdseed adx xsaveopt
        features: 7ffafbbf-bfebfbff-00000121-2c100800
     features_pv: 1fc9cbf5-f6f83203-2191cbf5-00000123-00000001-000c0329-00000000-00000000-00001000-8c000400-00000000-00000000-00000000-00000000
    features_hvm: 1fcbfbff-f7fa3223-2d93fbff-00000523-00000001-001c07ab-00000000-00000000-00001000-9c000400-00000000-00000000-00000000-00000000

And the output of # xl dmesg is in the text file: TurtleFx-xl-dmesg.txt

rushikeshjadhav commented 4 years ago

I'm looking for dtherm cpu flag which is masked in some cases. Can you check if your MB can export temperature info via IPMI?

# yum install freeipmi --enablerepo=base # ipmi-locate # ipmi-sensors

Edit: Ignore it if you have in sensors-detect

Probing for `IPMI BMC KCS' at 0xca0...                      No
Probing for `IPMI BMC SMIC' at 0xca8...                     No
TurtleFX commented 4 years ago

This machine does not have IPMI and it has those two lines in sensors-detect.

rushikeshjadhav commented 4 years ago

Please install # yum install cpuid --enablerepo=base and share # cpuid -r -1 o/p.

Essentially, something like following comes up

# cpuid -r -1
Disclaimer: cpuid may not support decoding of all cpuid registers.
CPU:
   0x00000000 0x00: eax=0x0000000d ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x000306c3 ebx=0x03100800 ecx=0x7ffafbff edx=0xbfebfbff
   0x00000002 0x00: eax=0x76036301 ebx=0x00f0b5ff ecx=0x00000000 edx=0x00c10000
   0x00000003 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000004 0x00: eax=0x1c004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
   0x00000004 0x01: eax=0x1c004122 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
   0x00000004 0x02: eax=0x1c004143 ebx=0x01c0003f ecx=0x000001ff edx=0x00000000
   0x00000004 0x03: eax=0x1c03c163 ebx=0x03c0003f ecx=0x00001fff edx=0x00000006
   0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x00042120
   0x00000006 0x00: eax=0x00000077 ebx=0x00000002 ecx=0x00000009 edx=0x00000000
   0x00000007 0x00: eax=0x00000000 ebx=0x000027ab ecx=0x00000000 edx=0x9c000400

Here, important is 0x00000006 0x00: eax=0x00000077 which tells that Thermal sensor is present in CPU. Ref: https://www.felixcloutier.com/x86/cpuid

In XCP-NG 8, it seems kernel module is not able to read correct value of eax register.

[91155.555157] cpuid(0x06)  : eax:0 ebx:0 ecx:0 edx:0

Whereas in XCP-NG 7.x

[716094.720401] cpuid(0x06)     : eax:7 ebx:1 ecx:8 edx:0

It could have been because of recent kernel level cpu mitigations, but even after setting mitigations=off for Dom0 kernel, its not effective.

There is difference in the way kernel 4.4 (XCP-NG 7.x) used to read & understand register EAX than kernel 4.19 (XCP-NG 8).

RMerl commented 4 years ago
# cpuid -r -1
Disclaimer: cpuid may not support decoding of all cpuid registers.
CPU:
   0x00000000 0x00: eax=0x00000014 ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
   0x00000001 0x00: eax=0x000306d4 ebx=0x01100800 ecx=0x7ffafbbf edx=0xbfebfbff
   0x00000002 0x00: eax=0x76036301 ebx=0x00f0b5ff ecx=0x00000000 edx=0x00c30000
   0x00000003 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000004 0x00: eax=0x1c004121 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
   0x00000004 0x01: eax=0x1c004122 ebx=0x01c0003f ecx=0x0000003f edx=0x00000000
   0x00000004 0x02: eax=0x1c004143 ebx=0x01c0003f ecx=0x000001ff edx=0x00000000
   0x00000004 0x03: eax=0x1c03c163 ebx=0x02c0003f ecx=0x00000fff edx=0x00000006
   0x00000005 0x00: eax=0x00000040 ebx=0x00000040 ecx=0x00000003 edx=0x11142120
   0x00000006 0x00: eax=0x00000077 ebx=0x00000002 ecx=0x00000009 edx=0x00000000
   0x00000007 0x00: eax=0x00000000 ebx=0x021c27ab ecx=0x00000000 edx=0x9c000400
   0x00000008 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000009 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x0000000a 0x00: eax=0x07300403 ebx=0x00000000 ecx=0x00000000 edx=0x00000603
   0x0000000b 0x00: eax=0x00000001 ebx=0x00000002 ecx=0x00000100 edx=0x00000001
   0x0000000b 0x01: eax=0x00000004 ebx=0x00000004 ecx=0x00000201 edx=0x00000001
   0x0000000c 0x00: eax=0x00000000 ebx=0x00000001 ecx=0x00000001 edx=0x00000000
   0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 edx=0x00000000
   0x0000000d 0x01: eax=0x00000001 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
   0x0000000e 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x0000000f 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000010 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000011 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000012 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000013 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x00000014 0x00: eax=0x00000000 ebx=0x00000001 ecx=0x00000001 edx=0x00000000
   0x80000000 0x00: eax=0x80000008 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000121 edx=0x2c100800
   0x80000002 0x00: eax=0x65746e49 ebx=0x2952286c ecx=0x726f4320 edx=0x4d542865
   0x80000003 0x00: eax=0x35692029 ebx=0x3532352d ecx=0x43205530 edx=0x40205550
   0x80000004 0x00: eax=0x362e3120 ebx=0x7a484730 ecx=0x00000000 edx=0x00000000
   0x80000005 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80000006 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x01006040 edx=0x00000000
   0x80000007 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000100
   0x80000008 0x00: eax=0x00003027 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
   0x80860000 0x00: eax=0x00000000 ebx=0x00000001 ecx=0x00000001 edx=0x00000000
   0xc0000000 0x00: eax=0x00000000 ebx=0x00000001 ecx=0x00000001 edx=0x00000000
RMerl commented 4 years ago

lm-sensors 3.5.0 mentions a fix related to kernel 4.19 detection, could this provide a hint, if the actual commit is tracked down?

https://github.com/lm-sensors/lm-sensors/blob/master/CHANGES

rushikeshjadhav commented 4 years ago

I checked and tried the new version. However it did not change the o/p for me. I'm stumbled by the difference between cpuid raw o/p and kernel call cpuid_eax(0x06) o/p. Ref: https://elixir.bootlin.com/linux/v4.19/source/arch/x86/include/asm/processor.h#L626

rushikeshjadhav commented 4 years ago

@RMerl @TurtleFX Please get these kernel modules and execute as below I got these working on a host that was not reporting temperature earlier.

# cd /tmp/
# wget "https://gist.github.com/rushikeshjadhav/ef60707111b7b0fefe32c0c0e22effeb/raw/530e4f50eee20760f7fbab411c3287e6ea819b35/coretemp.ko"
# wget "https://gist.github.com/rushikeshjadhav/ef60707111b7b0fefe32c0c0e22effeb/raw/530e4f50eee20760f7fbab411c3287e6ea819b35/cpuid.ko"
# rmmod cpuid ; insmod cpuid.ko
# rmmod coretemp ; insmod coretemp.ko
# sensors-detect
# sensors
# lsmod | egrep 'cpuid|coretemp'
RMerl commented 4 years ago

Success here :)

... snip ...
Intel digital thermal sensor...                             Success!
    (driver `coretemp')
... snip ...
# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +47.0°C  (high = +105.0°C, crit = +105.0°C)
TurtleFX commented 4 years ago

Same here. Installed these kernel modules:

# lsmod | egrep 'cpuid|coretemp'
coretemp               16384  0
cpuid                  16384  0

Intel digital thermal sensor is detected and temperatures are shown:

# sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +40.0°C  (high = +105.0°C, crit = +105.0°C)
rushikeshjadhav commented 4 years ago

Thanks @TurtleFX @RMerl for testing through.

RMerl commented 4 years ago

Good job tracking this down.

Out of curiosity, is the issue upstream with Citrix, or upstream with kernel.org?

rushikeshjadhav commented 4 years ago

Looks like upstream kernel.. Checking with kernel community. But it is possible that this is an edge case because of Xen security.

stormi commented 4 years ago

We now know more about it: recent Xen overrides the thermal and power management information to guests, and the control domain also is a (privileged) guest . The reason for the override is that it never really worked correctly: dom0 would get information from the wrong real CPUs.

Unfortunately that change removes useful features.

@rushikeshjadhav may be able to give more details.

rushikeshjadhav commented 4 years ago

Correct. There are two CPU flags which were requirement of coretemp to function. Xen is hiding these from Dom0 as these functionalities are not correctly implemented.

  1. PTS
  2. DTHERM

PTS is Intel Package Thermal Sensor which is essentially a socket temperature instead of each core of socket. The MSR from which the temperature values are read is readable via any Domain. So one can have a Special Purpose VM pinned its vCPUs to certain pCPUs and expose thermal data.

Or a modified coretemp driver rpm which just shows Package Temperature from any CPU at its runtime.

RMerl commented 4 years ago

I think having a custom version of coretemp through an RPM package would be an acceptable compromise (as long it's well documented for people setting up a new Xen server and looking into monitoring their server's health).

rushikeshjadhav commented 4 years ago

I've made a temporary RPM available at https://github.com/rushikeshjadhav/coretemp - This will tell package temperature. Its a kernel driver so highly experimental.

stormi commented 4 years ago

Update: @rushikeshjadhav's kernel driver is now available in our repositories.

On XCP-ng 8.0 or 8.1:

yum install coretemp-module-alt
RMerl commented 4 years ago

The actual package name is coretemp-module-alt (just found it by browsing through the repo).

stormi commented 4 years ago

The actual package name is coretemp-module-alt (just found it by browsing through the repo).

Indeed! I've fixed my comment.