yosupo06 commented 6 months ago

reported in discord https://discord.com/channels/1087310259447681114/1098365529217061014/1208898310321086574

yosupo06 commented 6 months ago

By launching multiplie judge instances, it can be observed that

some instance has stable performance
some instance has unstable performance, e.g. just running ./a.out < max_line_00.in > /dev/null (a.out is compiled from above source) runs on 60 ~ 120ms randomly

yosupo06 commented 6 months ago

Seems to be the affected part is memory accessing? I run mbw a few times. "unstable" instance als have unstable results in this test

run mbw 10 -t2

stable instance

root@test-instance3:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MCBLOCK Elapsed: 0.00069        MiB: 10.00000   Copy: 14430.014 MiB/s
1       Method: MCBLOCK Elapsed: 0.00047        MiB: 10.00000   Copy: 21097.046 MiB/s
2       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23255.814 MiB/s
3       Method: MCBLOCK Elapsed: 0.00042        MiB: 10.00000   Copy: 23696.682 MiB/s
4       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23474.178 MiB/s
5       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23474.178 MiB/s
6       Method: MCBLOCK Elapsed: 0.00042        MiB: 10.00000   Copy: 23866.348 MiB/s
7       Method: MCBLOCK Elapsed: 0.00042        MiB: 10.00000   Copy: 23640.662 MiB/s
8       Method: MCBLOCK Elapsed: 0.00042        MiB: 10.00000   Copy: 23866.348 MiB/s
9       Method: MCBLOCK Elapsed: 0.00042        MiB: 10.00000   Copy: 23584.906 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00046        MiB: 10.00000   Copy: 21949.078 MiB/s
root@test-instance3:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MCBLOCK Elapsed: 0.00079        MiB: 10.00000   Copy: 12706.480 MiB/s
1       Method: MCBLOCK Elapsed: 0.00052        MiB: 10.00000   Copy: 19047.619 MiB/s
2       Method: MCBLOCK Elapsed: 0.00045        MiB: 10.00000   Copy: 22123.894 MiB/s
3       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22624.434 MiB/s
4       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22624.434 MiB/s
5       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22624.434 MiB/s
6       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23148.148 MiB/s
7       Method: MCBLOCK Elapsed: 0.00046        MiB: 10.00000   Copy: 21645.022 MiB/s
8       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23041.475 MiB/s
9       Method: MCBLOCK Elapsed: 0.00043        MiB: 10.00000   Copy: 23201.856 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00048        MiB: 10.00000   Copy: 20622.809 MiB/s

unstable instance

root@test-instance2:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MCBLOCK Elapsed: 0.00081        MiB: 10.00000   Copy: 12422.360 MiB/s
1       Method: MCBLOCK Elapsed: 0.00051        MiB: 10.00000   Copy: 19569.472 MiB/s
2       Method: MCBLOCK Elapsed: 0.00046        MiB: 10.00000   Copy: 21551.724 MiB/s
3       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22883.295 MiB/s
4       Method: MCBLOCK Elapsed: 0.00048        MiB: 10.00000   Copy: 21008.403 MiB/s
5       Method: MCBLOCK Elapsed: 0.00049        MiB: 10.00000   Copy: 20533.881 MiB/s
6       Method: MCBLOCK Elapsed: 0.00046        MiB: 10.00000   Copy: 21929.825 MiB/s
7       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22624.434 MiB/s
8       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22935.780 MiB/s
9       Method: MCBLOCK Elapsed: 0.00044        MiB: 10.00000   Copy: 22831.050 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00050        MiB: 10.00000   Copy: 20193.861 MiB/s
root@test-instance2:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0       Method: MCBLOCK Elapsed: 0.00152        MiB: 10.00000   Copy: 6565.988 MiB/s
1       Method: MCBLOCK Elapsed: 0.00136        MiB: 10.00000   Copy: 7369.197 MiB/s
2       Method: MCBLOCK Elapsed: 0.00131        MiB: 10.00000   Copy: 7627.765 MiB/s
3       Method: MCBLOCK Elapsed: 0.00135        MiB: 10.00000   Copy: 7407.407 MiB/s
4       Method: MCBLOCK Elapsed: 0.00135        MiB: 10.00000   Copy: 7407.407 MiB/s
5       Method: MCBLOCK Elapsed: 0.00138        MiB: 10.00000   Copy: 7262.164 MiB/s
6       Method: MCBLOCK Elapsed: 0.00141        MiB: 10.00000   Copy: 7092.199 MiB/s
7       Method: MCBLOCK Elapsed: 0.00133        MiB: 10.00000   Copy: 7547.170 MiB/s
8       Method: MCBLOCK Elapsed: 0.00137        MiB: 10.00000   Copy: 7304.602 MiB/s
9       Method: MCBLOCK Elapsed: 0.00141        MiB: 10.00000   Copy: 7107.321 MiB/s
AVG     Method: MCBLOCK Elapsed: 0.00138        MiB: 10.00000   Copy: 7257.421 MiB/s

yosupo06 commented 6 months ago

We use c2-instance-4 instance type = it only rent a part of a big node, it may be affected by other user's application (= noisy neighbor). But I feel 2x is too large...

Or, GCP assigned me just an "unstable" instance, e.g. whose distance between CPU - memory is far? like a NUMA thing

yosupo06 commented 6 months ago

Test with c2-instance-60 (= should be rent entire a node). Note I disabled MST (= 30 physical core and 30 vCPU)

mbw test: stable

by lscpu, it has 2 NUMAs

root@test-instance-big:/home/yosupo# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  30
  On-line CPU(s) list:   0-29
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 3.10GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  1
    Core(s) per socket:  15
    Socket(s):           2
    Stepping:            7
    BogoMIPS:            6200.46
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc
                         cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ib
                         rs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   960 KiB (30 instances)
  L1i:                   960 KiB (30 instances)
  L2:                    30 MiB (30 instances)
  L3:                    49.5 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-14
  NUMA node1 CPU(s):     15-29
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Mitigation; Enhanced IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; Clear CPU buffers; SMT Host state unknown

Note: lscpu of c2-standard-4

yosupo@test-instance:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 3.10GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  1
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            6200.46
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_
                         tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ib
                         rs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt
                         xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   64 KiB (2 instances)
  L1i:                   64 KiB (2 instances)
  L2:                    2 MiB (2 instances)
  L3:                    24.8 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0,1
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Mitigation; Enhanced IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Mitigation; Clear CPU buffers; SMT Host state unknown

yosupo06 commented 6 months ago

In c2-instance-60, run mbw on multiple threads by yes 100000 | head -n 30 | xargs -t -L1 -P30 mbw 10 -t2 -n

And we can see the score is decreased much

8077    Method: MCBLOCK Elapsed: 0.00441        MiB: 10.00000   Copy: 2266.032 MiB/s
11564   Method: MCBLOCK Elapsed: 0.00190        MiB: 10.00000   Copy: 5274.262 MiB/s
7899    Method: MCBLOCK Elapsed: 0.00296        MiB: 10.00000   Copy: 3384.095 MiB/s
8041    Method: MCBLOCK Elapsed: 0.00299        MiB: 10.00000   Copy: 3340.013 MiB/s
7896    Method: MCBLOCK Elapsed: 0.00297        MiB: 10.00000   Copy: 3372.681 MiB/s
7699    Method: MCBLOCK Elapsed: 0.00300        MiB: 10.00000   Copy: 3333.333 MiB/s
10148   Method: MCBLOCK Elapsed: 0.00389        MiB: 10.00000   Copy: 2568.053 MiB/s
8378    Method: MCBLOCK Elapsed: 0.00250        MiB: 10.00000   Copy: 3993.610 MiB/s
9703    Method: MCBLOCK Elapsed: 0.00435        MiB: 10.00000   Copy: 2298.851 MiB/s
11765   Method: MCBLOCK Elapsed: 0.00193        MiB: 10.00000   Copy: 5184.033 MiB/s
6723    Method: MCBLOCK Elapsed: 0.00303        MiB: 10.00000   Copy: 3294.893 MiB/s
8353    Method: MCBLOCK Elapsed: 0.00304        MiB: 10.00000   Copy: 3293.808 MiB/s
11159   Method: MCBLOCK Elapsed: 0.00185        MiB: 10.00000   Copy: 5399.568 MiB/s
11423   Method: MCBLOCK Elapsed: 0.00187        MiB: 10.00000   Copy: 5347.594 MiB/s
8001    Method: MCBLOCK Elapsed: 0.00343        MiB: 10.00000   Copy: 2913.753 MiB/s
7780    Method: MCBLOCK Elapsed: 0.00302        MiB: 10.00000   Copy: 3311.258 MiB/s
8621    Method: MCBLOCK Elapsed: 0.00188        MiB: 10.00000   Copy: 5313.496 MiB/s
7611    Method: MCBLOCK Elapsed: 0.00303        MiB: 10.00000   Copy: 3301.420 MiB/s
6757    Method: MCBLOCK Elapsed: 0.00441        MiB: 10.00000   Copy: 2268.603 MiB/s
8142    Method: MCBLOCK Elapsed: 0.00389        MiB: 10.00000   Copy: 2568.053 MiB/s
7700    Method: MCBLOCK Elapsed: 0.00319        MiB: 10.00000   Copy: 3133.814 MiB/s

So... maybe the root cause is other users usage?

yosupo06 commented 6 months ago

hypothesis: L3 cache

yosupo06 commented 6 months ago

mbw 10 -t2 results reachs >20GiB / s. However, mbw with bigger size e.g. mbw 100 -t2 returns lower results (around 6GiB / s) => Probably mbw 10 -t2 uses CPU cache.

mbw 10 writes 20MB data, it shouldn't fit in L1 / L2 cache, so it should use L3 cache.

In C2 family (L3 cache is 25MB), we can start to see performance degeneration in mbw 20 -t2. In the other hands, in C3 family (L3 cache is 105MiB), we can see >20GiB / s in mbw 30 -t2

yosupo06 commented 6 months ago

So I almost sure the root cause is from L3 cache + noisy neighbor problem... but if this is true, I don't come up the solution.

Other cloud platform, e.g. AWS / Azure have the same problem? (probably yes)

yosupo06 commented 6 months ago

399 try to change judge instance type from c2 to c3. At least c3 seems to be stable rather than c2 in us-east1 region.

yosupo06 commented 6 months ago

C2D instance is interesting. It uses EPYC 7B13 family which probably contains 112 physical cores. However, the lscpu of c2d-standard-112 (maximum size) with 1 vCPU = 1 physical cores setting is as follows.

Cores per socket is 28, and each CCX contains 4 cores. Though each CCX of T2D (it also uses 7B13) contains 8 cores. So, I guess GCP may disable a half of cores? probably for getting higher and stabled CPU HZ? Anyway, if this is correct, we can get dedicated CCX by c2d-standard-8 which is still reasonable costs

yosupo@test-instance:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  56
  On-line CPU(s) list:   0-55
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7B13
    CPU family:          25
    Model:               1
    Thread(s) per core:  1
    Core(s) per socket:  28
    Socket(s):           2
    Stepping:            0
    BogoMIPS:            6099.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_
                         tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f1
                         6c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase ts
                         c_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip vaes vpc
                         lmulqdq rdpid fsrm
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   1.8 MiB (56 instances)
  L1i:                   1.8 MiB (56 instances)
  L2:                    28 MiB (56 instances)
  L3:                    448 MiB (14 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-27
  NUMA node1 CPU(s):     28-55
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Vulnerable: Safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

yosupo06 commented 6 months ago

I tested a few instance types, and c2d and c3 instance type seems to stabled

c2d
- probably we can allocate a entire CCX by c2d-highcpu-8
- It uses AMD EPYC, because AFAIK most online judges uses Intel CPU, so it may have some trouble when we turn program for other judges
- we can't use AVX-512
- Compute-optimized machine family
c3
- because L3 cache of Intel CPU is shared by all CPUs, it is hard to get dedicated L3 cache
- we can use AVX-512

yosupo06 commented 6 months ago

Just tried to change instance type c2d

yosupo06 commented 4 months ago

Probably the situations was improved? Let's close it

yosupo06 / library-checker-judge

Judge performance is unstable #396

399 try to change judge instance type from c2 to c3. At least c3 seems to be stable rather than c2 in us-east1 region.