Closed yosupo06 closed 4 months ago
By launching multiplie judge instances, it can be observed that
./a.out < max_line_00.in > /dev/null
(a.out
is compiled from above source) runs on 60 ~ 120ms randomlySeems to be the affected part is memory accessing? I run mbw a few times. "unstable" instance als have unstable results in this test
run mbw 10 -t2
stable instance
root@test-instance3:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0 Method: MCBLOCK Elapsed: 0.00069 MiB: 10.00000 Copy: 14430.014 MiB/s
1 Method: MCBLOCK Elapsed: 0.00047 MiB: 10.00000 Copy: 21097.046 MiB/s
2 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23255.814 MiB/s
3 Method: MCBLOCK Elapsed: 0.00042 MiB: 10.00000 Copy: 23696.682 MiB/s
4 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23474.178 MiB/s
5 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23474.178 MiB/s
6 Method: MCBLOCK Elapsed: 0.00042 MiB: 10.00000 Copy: 23866.348 MiB/s
7 Method: MCBLOCK Elapsed: 0.00042 MiB: 10.00000 Copy: 23640.662 MiB/s
8 Method: MCBLOCK Elapsed: 0.00042 MiB: 10.00000 Copy: 23866.348 MiB/s
9 Method: MCBLOCK Elapsed: 0.00042 MiB: 10.00000 Copy: 23584.906 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00046 MiB: 10.00000 Copy: 21949.078 MiB/s
root@test-instance3:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0 Method: MCBLOCK Elapsed: 0.00079 MiB: 10.00000 Copy: 12706.480 MiB/s
1 Method: MCBLOCK Elapsed: 0.00052 MiB: 10.00000 Copy: 19047.619 MiB/s
2 Method: MCBLOCK Elapsed: 0.00045 MiB: 10.00000 Copy: 22123.894 MiB/s
3 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22624.434 MiB/s
4 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22624.434 MiB/s
5 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22624.434 MiB/s
6 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23148.148 MiB/s
7 Method: MCBLOCK Elapsed: 0.00046 MiB: 10.00000 Copy: 21645.022 MiB/s
8 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23041.475 MiB/s
9 Method: MCBLOCK Elapsed: 0.00043 MiB: 10.00000 Copy: 23201.856 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00048 MiB: 10.00000 Copy: 20622.809 MiB/s
unstable instance
root@test-instance2:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0 Method: MCBLOCK Elapsed: 0.00081 MiB: 10.00000 Copy: 12422.360 MiB/s
1 Method: MCBLOCK Elapsed: 0.00051 MiB: 10.00000 Copy: 19569.472 MiB/s
2 Method: MCBLOCK Elapsed: 0.00046 MiB: 10.00000 Copy: 21551.724 MiB/s
3 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22883.295 MiB/s
4 Method: MCBLOCK Elapsed: 0.00048 MiB: 10.00000 Copy: 21008.403 MiB/s
5 Method: MCBLOCK Elapsed: 0.00049 MiB: 10.00000 Copy: 20533.881 MiB/s
6 Method: MCBLOCK Elapsed: 0.00046 MiB: 10.00000 Copy: 21929.825 MiB/s
7 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22624.434 MiB/s
8 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22935.780 MiB/s
9 Method: MCBLOCK Elapsed: 0.00044 MiB: 10.00000 Copy: 22831.050 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00050 MiB: 10.00000 Copy: 20193.861 MiB/s
root@test-instance2:/home/yosupo# mbw 10 -t2
Long uses 8 bytes. Allocating 2*1310720 elements = 20971520 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 10 runs per test.
0 Method: MCBLOCK Elapsed: 0.00152 MiB: 10.00000 Copy: 6565.988 MiB/s
1 Method: MCBLOCK Elapsed: 0.00136 MiB: 10.00000 Copy: 7369.197 MiB/s
2 Method: MCBLOCK Elapsed: 0.00131 MiB: 10.00000 Copy: 7627.765 MiB/s
3 Method: MCBLOCK Elapsed: 0.00135 MiB: 10.00000 Copy: 7407.407 MiB/s
4 Method: MCBLOCK Elapsed: 0.00135 MiB: 10.00000 Copy: 7407.407 MiB/s
5 Method: MCBLOCK Elapsed: 0.00138 MiB: 10.00000 Copy: 7262.164 MiB/s
6 Method: MCBLOCK Elapsed: 0.00141 MiB: 10.00000 Copy: 7092.199 MiB/s
7 Method: MCBLOCK Elapsed: 0.00133 MiB: 10.00000 Copy: 7547.170 MiB/s
8 Method: MCBLOCK Elapsed: 0.00137 MiB: 10.00000 Copy: 7304.602 MiB/s
9 Method: MCBLOCK Elapsed: 0.00141 MiB: 10.00000 Copy: 7107.321 MiB/s
AVG Method: MCBLOCK Elapsed: 0.00138 MiB: 10.00000 Copy: 7257.421 MiB/s
We use c2-instance-4 instance type = it only rent a part of a big node, it may be affected by other user's application (= noisy neighbor). But I feel 2x is too large...
Or, GCP assigned me just an "unstable" instance, e.g. whose distance between CPU - memory is far? like a NUMA thing
Test with c2-instance-60 (= should be rent entire a node). Note I disabled MST (= 30 physical core and 30 vCPU)
mbw test: stable
by lscpu, it has 2 NUMAs
root@test-instance-big:/home/yosupo# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 30
On-line CPU(s) list: 0-29
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 3.10GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 15
Socket(s): 2
Stepping: 7
BogoMIPS: 6200.46
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc
cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ib
rs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsav
ec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 960 KiB (30 instances)
L1i: 960 KiB (30 instances)
L2: 30 MiB (30 instances)
L3: 49.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-14
NUMA node1 CPU(s): 15-29
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Meltdown: Not affected
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Mitigation; Enhanced IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
Note: lscpu
of c2-standard-4
yosupo@test-instance:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 3.10GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: 7
BogoMIPS: 6200.46
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_
tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ib
rs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt
xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 64 KiB (2 instances)
L1i: 64 KiB (2 instances)
L2: 2 MiB (2 instances)
L3: 24.8 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Meltdown: Not affected
Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Retbleed: Mitigation; Enhanced IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown
In c2-instance-60, run mbw on multiple threads by yes 100000 | head -n 30 | xargs -t -L1 -P30 mbw 10 -t2 -n
And we can see the score is decreased much
8077 Method: MCBLOCK Elapsed: 0.00441 MiB: 10.00000 Copy: 2266.032 MiB/s
11564 Method: MCBLOCK Elapsed: 0.00190 MiB: 10.00000 Copy: 5274.262 MiB/s
7899 Method: MCBLOCK Elapsed: 0.00296 MiB: 10.00000 Copy: 3384.095 MiB/s
8041 Method: MCBLOCK Elapsed: 0.00299 MiB: 10.00000 Copy: 3340.013 MiB/s
7896 Method: MCBLOCK Elapsed: 0.00297 MiB: 10.00000 Copy: 3372.681 MiB/s
7699 Method: MCBLOCK Elapsed: 0.00300 MiB: 10.00000 Copy: 3333.333 MiB/s
10148 Method: MCBLOCK Elapsed: 0.00389 MiB: 10.00000 Copy: 2568.053 MiB/s
8378 Method: MCBLOCK Elapsed: 0.00250 MiB: 10.00000 Copy: 3993.610 MiB/s
9703 Method: MCBLOCK Elapsed: 0.00435 MiB: 10.00000 Copy: 2298.851 MiB/s
11765 Method: MCBLOCK Elapsed: 0.00193 MiB: 10.00000 Copy: 5184.033 MiB/s
6723 Method: MCBLOCK Elapsed: 0.00303 MiB: 10.00000 Copy: 3294.893 MiB/s
8353 Method: MCBLOCK Elapsed: 0.00304 MiB: 10.00000 Copy: 3293.808 MiB/s
11159 Method: MCBLOCK Elapsed: 0.00185 MiB: 10.00000 Copy: 5399.568 MiB/s
11423 Method: MCBLOCK Elapsed: 0.00187 MiB: 10.00000 Copy: 5347.594 MiB/s
8001 Method: MCBLOCK Elapsed: 0.00343 MiB: 10.00000 Copy: 2913.753 MiB/s
7780 Method: MCBLOCK Elapsed: 0.00302 MiB: 10.00000 Copy: 3311.258 MiB/s
8621 Method: MCBLOCK Elapsed: 0.00188 MiB: 10.00000 Copy: 5313.496 MiB/s
7611 Method: MCBLOCK Elapsed: 0.00303 MiB: 10.00000 Copy: 3301.420 MiB/s
6757 Method: MCBLOCK Elapsed: 0.00441 MiB: 10.00000 Copy: 2268.603 MiB/s
8142 Method: MCBLOCK Elapsed: 0.00389 MiB: 10.00000 Copy: 2568.053 MiB/s
7700 Method: MCBLOCK Elapsed: 0.00319 MiB: 10.00000 Copy: 3133.814 MiB/s
So... maybe the root cause is other users usage?
hypothesis: L3 cache
mbw 10 -t2
results reachs >20GiB / s. However, mbw with bigger size e.g. mbw 100 -t2
returns lower results (around 6GiB / s) => Probably mbw 10 -t2
uses CPU cache.
mbw 10
writes 20MB data, it shouldn't fit in L1 / L2 cache, so it should use L3 cache.
In C2 family (L3 cache is 25MB), we can start to see performance degeneration in mbw 20 -t2
. In the other hands, in C3 family (L3 cache is 105MiB), we can see >20GiB / s in mbw 30 -t2
So I almost sure the root cause is from L3 cache + noisy neighbor problem... but if this is true, I don't come up the solution.
C2D instance is interesting. It uses EPYC 7B13
family which probably contains 112 physical cores. However, the lscpu
of c2d-standard-112
(maximum size) with 1 vCPU = 1 physical cores setting is as follows.
Cores per socket is 28, and each CCX contains 4 cores. Though each CCX of T2D (it also uses 7B13) contains 8 cores. So, I guess GCP may disable a half of cores? probably for getting higher and stabled CPU HZ? Anyway, if this is correct, we can get dedicated CCX by c2d-standard-8
which is still reasonable costs
yosupo@test-instance:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7B13
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
Stepping: 0
BogoMIPS: 6099.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_
tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f1
6c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase ts
c_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip vaes vpc
lmulqdq rdpid fsrm
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 1.8 MiB (56 instances)
L1i: 1.8 MiB (56 instances)
L2: 28 MiB (56 instances)
L3: 448 MiB (14 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-27
NUMA node1 CPU(s): 28-55
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Vulnerable: Safe RET, no microcode
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
I tested a few instance types, and c2d
and c3
instance type seems to stabled
c2d
c2d-highcpu-8
c3
Just tried to change instance type c2d
Probably the situations was improved? Let's close it
reported in discord https://discord.com/channels/1087310259447681114/1098365529217061014/1208898310321086574