wenyuzhao / mmtk-core

Memory Management ToolKit
https://www.mmtk.io
Other
31 stars 1 forks source link

LXR Crash on HBase Launch #69

Open kk2049 opened 8 months ago

kk2049 commented 8 months ago

Hello,

I am currently conducting performance testing on LXR with HBase and have encountered some issues. The commit ID of mmtk-core I am using is 333ffb8ad9f9089fcad4338eae0ca9bf81d4955e, which is the latest commit of branch lxr-2023-07-09. Below is the reproduction process for the identified bug.

wget https://archive.apache.org/dist/hbase/2.4.14/hbase-2.4.14-bin.tar.gz
tar -zxvf hbase-2.4.14-bin.tar.gz

export MMTK_PLAN=Immix 
export TRACE_THRESHOLD2=10 
export LOCK_FREE_BLOCKS=32 
export MAX_SURVIVAL_MB=256 
export SURVIVAL_PREDICTOR_WEIGHTED=1
export JAVA_HOME="MY-LXR-JDK-HOME"
export HBASE_OPTS="-Xms1650m -Xmx1650m  -XX:+UseThirdPartyHeap "

bash ./hbase-2.4.14/bin/start-hbase.sh

After the script execution is complete, HBase did not start correctly. By examining the logs at ./hbase-2.4.14/log, the following error messages can be observed.

2024-03-01 02:05:27,911 ERROR [main] impl.MetricsSourceAdapter: Error getting metrics from source JvmMetrics
java.lang.IllegalArgumentException: committed = 9223372036854775807 should be < max = 1730150400
    at java.management/java.lang.management.MemoryUsage.<init>(MemoryUsage.java:166)
    at java.management/sun.management.MemoryImpl.getMemoryUsage0(Native Method)
    at java.management/sun.management.MemoryImpl.getHeapMemoryUsage(MemoryImpl.java:71)
    at org.apache.hadoop.metrics2.source.JvmMetrics.getMemoryUsage(JvmMetrics.java:146)
    at org.apache.hadoop.metrics2.source.JvmMetrics.getMetrics(JvmMetrics.java:138)
    at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMetrics(MetricsSourceAdapter.java:200)
    at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.updateJmxCache(MetricsSourceAdapter.java:183)
    at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.getMBeanInfo(MetricsSourceAdapter.java:156)
    at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getNewMBeanClassName(DefaultMBeanServerInterceptor.java:329)
    at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:315)
    at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
    at org.apache.hadoop.metrics2.util.MBeans.register(MBeans.java:66)
    at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.startMBeans(MetricsSourceAdapter.java:223)
    at org.apache.hadoop.metrics2.impl.MetricsSourceAdapter.start(MetricsSourceAdapter.java:101)
    at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.registerSource(MetricsSystemImpl.java:268)
    at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:233)
    at org.apache.hadoop.metrics2.source.JvmMetrics.create(JvmMetrics.java:111)
    at org.apache.hadoop.metrics2.source.JvmMetrics$Singleton.init(JvmMetrics.java:61)
    at org.apache.hadoop.metrics2.source.JvmMetrics.initSingleton(JvmMetrics.java:120)
    at org.apache.hadoop.hbase.metrics.BaseSourceImpl$DefaultMetricsSystemInitializer.init(BaseSourceImpl.java:54)
    at org.apache.hadoop.hbase.metrics.BaseSourceImpl.<init>(BaseSourceImpl.java:111)
    at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:42)
    at org.apache.hadoop.hbase.io.MetricsIOSourceImpl.<init>(MetricsIOSourceImpl.java:37)
    at org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactoryImpl.createIO(MetricsRegionServerSourceFactoryImpl.java:95)
    at org.apache.hadoop.hbase.io.MetricsIO.<init>(MetricsIO.java:32)
    at org.apache.hadoop.hbase.io.hfile.HFile.<clinit>(HFile.java:172)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:591)
    at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:422)
    at org.apache.hadoop.hbase.master.HMasterCommandLine$LocalHMaster.<init>(HMasterCommandLine.java:318)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:122)
    at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:222)
    at org.apache.hadoop.hbase.LocalHBaseCluster.<init>(LocalHBaseCluster.java:169)
    at org.apache.hadoop.hbase.LocalHBaseCluster.<init>(LocalHBaseCluster.java:112)
    at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:240)
    at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:145)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:144)
    at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2934)

Thank you for your attention to this matter. Let me know if you need any further information.

wenyuzhao commented 8 months ago

Hi @kk2049, thanks for reporting the error. I can confirm that the error exists even with the latest MMTk. Our OpenJDK binding does not yet fully support some interfaces, e.g., inspecting heap usage.

Another issue I noticed is that you were probably running Immix, not LXR. Immix is a fully stop-the-world, single-generation, full-heap GC that differs from LXR in many ways. At the time we were conducting our artifact evaluation, we hadn't had enough time to clean up our codebase sufficiently. The version we submitted for evaluation was directly piggybacked on the Immix GC. Since May 2022, LXR has undergone significant refactoring, one of which (https://github.com/wenyuzhao/mmtk-core/commit/b00a332be919f84a06b86a595598e02ff21ab155) was to move the code to a separate GC plan, no longer piggybacking on Immix. You should use export MMTK_PLAN=LXR now to enable it correctly.

If you are building JDK with MMTk/LXR on your own, please remove any additional cargo features. Also, for precise performance measurement, MMTk by default counts all the malloc-ed metadata as part of the Java heap size. This differs from OpenJDK GCs. Please enable the Rust feature no_meta_counting in mmtk-core to disable this behavior, for a fair comparison with OpenJDK GCs.

Also, the runtime GC parameters TRACE_THRESHOLD2, LOCK_FREE_BLOCKS, MAX_SURVIVAL_MB, SURVIVAL_PREDICTOR_WEIGHTED should be removed. These parameters have changed significantly over the past year. Their current default values are relatively optimized and should just work.

wenyuzhao commented 8 months ago

Another issue related to the JVM args: Based on what I found in your AE repo, -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 did not work with MMTk in July 2023. LXR would always use all the hardware threads for both concurrent and stop-the-world GC. Using export MMTK_THREADS=8 would set both parallel and concurrent GC threads to 8.

However, the latest LXR in the LXR branch has fixed this issue and should now respect the numbers specified in -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2.

wenyuzhao commented 8 months ago

Re-submitted as an upstream issue: https://github.com/mmtk/mmtk-openjdk/issues/270

kk2049 commented 8 months ago

Thank you for providing valuable information. We are currently in the process of re-evaluating the performance of LXR based on your feedback. However, it appears that the performance we are observing still differs from what is documented in the performance history at https://github.com/wenyuzhao/mmtk-core/blob/lxr-2023-07-09/performance-history/history.md. Could you kindly share the parameters used to reproduce the data mentioned in the performance history?

Additionally, if there are any recommended stable versions of LXR that you believe would be more suitable for our testing environment? Greatly appreciate your guidance!

wenyuzhao commented 8 months ago

We use an automation tool called running-ng for precise and reproducible performance evaluation. And I use this tool to periodically monitor LXR performance change over time, to check if it still matches the claims we made in the paper.

For example, the config for the latest result is stored in these two files. Some noticeable details:

Here are two lusearch logs produced by this evaluation. Each file contains the 40 invocations of LXR or G1. If you manually expand the modifiers in runbms.yml, it should match the the first line of these two log files. The numbers in the log files should also match the graphs here http://squirrel.anu.edu.au/plotty-public/wenyuz/v8/p/CaGxpz

The whole evaluation is a relatively automated process. You should be able to reproduce some similar results if you use the same parameters, same commits, and roughly same machine specs.

You can also try to directly use running-ng to run the benchmark. You need to update runbms.yml at the following places:

With all these fixed, use the following command to run the benchmark:

running runbms -p lxr-xput-regression ./logs runbms.yml 12 7

if there are any recommended stable versions of LXR

LXR is not yet fully stable. But the above evaluation run had the best performance results. So probably just use these commits for your evaluation. You can also use the latest LXR commits, the performance should be a little bit better than this evaluation (I haven't upload the results yet).

wenyuzhao commented 8 months ago

Sorry one thing I forgot to mention: For a fair comparison with ZGC, you may want to disable compressed pointers. But I'm dealing with some issues on production right now. I believe one of my workarounds only works when enabling compressed oops, otherwise it will crash. Recent versions of LXR probably can't run without compressed pointers. (But lxr-2023-07-09 probably can).

wenyuzhao commented 8 months ago

Hi @kk2049, I've recently taken a look at the AE repo again, particularly focusing on the script used for running the dacapo benchmarks. I noticed a few potential areas for refinement in the methodology that you're currently employing.

Firstly, a single benchmark invocation cannot lead to a stable result and hence a strong conclusion. There can be noise across different runs. It would be better to run more invocations, perhaps ~40 or more. Please note that this is different to the -n 6 argument passed to dacapo, which is used for warmup. The overall result should be the average or geomean of all invocations, with error bars.

Secondly, it’s not accurate to compare two JDK configs with more than one variable changed. For instance:

Lastly, it would be better to use the latest stable release of dacapo-23.11-chopin. The version you were using, dacapo-evaluation-git-b00bfa9, is a very early-stage snapshot (~2 years ago) and many issues have been resolved since then in the stable release.

We also adapt the above methodology for all our evaluations. Hope this can be helpful for both your evaluation and reproducing our results.

kk2049 commented 8 months ago

Thank you for your previous guidance! @wenyuzhao

We are working on addressing some of the shortcomings you mentioned. As you may know, we have a modified version of h2 which includes an additional parameter to limit the instantaneous emission rate of requests to a certain value. Currently, we are conducting tests on the response time curves of both lxr and g1. The launch command for lxr has been adjusted based on your provided instructions. The lxr version has been updated according to your recommendation. Could you please assess whether the data we have obtained appears reasonable? Are we employing lxr in the correct manner?

Your guidance is highly valued and appreciated!

Launch script

# LXR
MMTK_PLAN=LXR numactl -C 0-7 -m0 $LXR_JDK/jdk/bin/java -XX:+UseThirdPartyHeap -XX:MaxMetaspaceSize=1024m -XX:MetaspaceSize=1024m -Xmx2382m -Xms2382m -Xlog:gc=info -jar $DACAPO_PATH h2 --throttle $Throughput

# G1
numactl -C 0-7 -m0 $JDK11/java -XX:MaxMetaspaceSize=1024m -XX:MetaspaceSize=1024m -Xmx2382m -Xms2382m  -XX:ParallelGCThreads=8 -XX:ConcGCThreads=2 -XX:InitiatingHeapOccupancyPercent=65 -XX:+UseG1GC -Xlog:gc=info -jar $DACAPO_PATH h2 --throttle $Throughput

RT Curve

G1
Thro        metered-p99
1970.96     4319.8  
3882.24     3985    
5738.16     2650.8  
7535.32     3333.8  
9205.94     5249.6  
10656.96    5100.2  
11903.62    5932    
12901.96    6491    
13811.54    8696.4  
14546.1     11424.8 
15194.12    12427.6 
15818.56    15096.2 
16246.32    18284.6 

LXR
Thro        metered-p99
1966.96     3581.2  
3864.72     4457.6  
5683.22     1686.4  
7421.88     7354.2  
8957.3      7377.8  
10240.92    6580.4  
11316.18    5844.2  
12263.82    12238.4 
12922.76    10192.8 
13581.58    20549.4 
14144.34    13480.6 
14696.84    13267.8 
15098.82    17917.2 

LXR version

commits:
    mmtk-core: 4ab99bb8270
    mmtk-openjdk: cdbc8decffa
    openjdk: 7caf8f7d
wenyuzhao commented 8 months ago

Looks like LXR is enabled correctly. FYI this commit (4ab99bb8270) requires to enable the mmtk/opt cargo feature. Otherwise it will have performance issues.

I'm not sure how h2 throttling is enabled. H2 in dacapo is a fixed-workload benchmark. I’m not sure whether scaling throughput like this can work well. It may affect metered latency calculation(?)