xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.11k stars 67 forks source link

BUG: ConnectionError: Unable to connect to application when deploying on yarn #605

Closed smartguo closed 1 year ago

smartguo commented 1 year ago

Describe the bug

when deploying on yarn, creation of the cluster fails.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version: 3.9.12
  2. The version of Xorbits you use: 0.4.2
  3. Versions of crucial packages, such as numpy, scipy and pandas: pandas==1.4.2
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Hadoop version: Hadoop 3.2.2

Code

import os
from xorbits._mars.deploy.yarn import new_cluster
import xorbits.pandas as pd

os.environ['JAVA_HOME'] = '/usr/jdk64/jdk1.8.0_191'
os.environ['HADOOP_HOME'] = "/usr/local/service/hadoop"
os.environ['PATH'] = '/usr/local/service/hadoop:/usr/local/service/hadoop/bin:' + os.environ['PATH']
cluster = new_cluster(
    environment='hdfs:///python/senv/anaconda3.zip',
    supervisor_num=1,
    supervisor_cpu=1,
    supervisor_mem='4g',
    redirect=False,
    web_num=1,
    app_name="test-xorbits-deploy-on-yarn",
    app_queue="eng",
    worker_num=4,
    worker_cpu=1,
    worker_mem='4g',
    min_worker_num=2,
    timeout=6000,
    supervisor_extra_args='--log-level DEBUG',
    worker_extra_env={
        "MARS_USE_PROCESS_STAT": "1",
        'HADOOP_HOME': "/usr/local/service/hadoop"
    },
    supervisor_extra_env={
        "MARS_USE_PROCESS_STAT": "1",
    },
    worker_cache_mem='3g')
print(cluster.session.endpoint)
print(pd.DataFrame({'a': [1,2,3,4]}).sum())

Error message:

WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
23/07/19 13:54:04 INFO client.AHSProxy: Connecting to Application History server at /10.***.**.**:10200
23/07/19 13:54:04 INFO skein.Driver: Driver started, listening on 34513
23/07/19 13:54:05 INFO conf.Configuration: resource-types.xml not found
23/07/19 13:54:05 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
23/07/19 13:54:05 INFO skein.Driver: Uploading application resources to hdfs://HDFS***/user/***/.skein/application_16835364***_109379
23/07/19 13:54:05 INFO skein.Driver: Submitting application...
23/07/19 13:54:05 INFO impl.YarnClientImpl: Submitted application application_16835364***_109379
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
23/07/19 13:54:12 INFO client.AHSProxy: Connecting to Application History server at /10.***.**.**:10200
23/07/19 13:54:12 INFO skein.Driver: Driver started, listening on 37649
23/07/19 13:54:13 INFO impl.YarnClientImpl: Killed application application_16835364***_109379
Traceback (most recent call last):
  File "/home/***/xorbits/test.py", line 10, in <module>
    cluster = new_cluster(
  File "/opt/anaconda3/lib/python3.9/site-packages/xorbits/_mars/deploy/yarn/client.py", line 189, in new_cluster
    wait_services_ready(
  File "/opt/anaconda3/lib/python3.9/site-packages/xorbits/_mars/deploy/utils.py", line 42, in wait_services_ready
    readies[idx] = count_fun(selector)
  File "/opt/anaconda3/lib/python3.9/site-packages/xorbits/_mars/deploy/yarn/client.py", line 192, in <lambda>
    lambda svc: _get_ready_container_count(app_client, svc),
  File "/opt/anaconda3/lib/python3.9/site-packages/xorbits/_mars/deploy/yarn/client.py", line 64, in _get_ready_container_count
    c.yarn_container_id for c in app_client.get_containers([svc], ["RUNNING"])
  File "/opt/anaconda3/lib/python3.9/site-packages/skein/core.py", line 1090, in get_containers
    resp = self._call('getContainers', req)
  File "/opt/anaconda3/lib/python3.9/site-packages/skein/core.py", line 280, in _call
    raise ConnectionError("Unable to connect to %s" % self._server_name)
skein.exceptions.ConnectionError: Unable to connect to application
ChengjieLi28 commented 1 year ago

take