Storm supervisor configs overwrite with each other

clockfly commented 10 years ago

Hi,

When starting supervisor container, it will try to write config under /user/xx/.storm/appattempt_xx/conf. When starting multiple supervisors, they will share the same path. Then one supervisor may overwrite the setting of another supervisor, which will results in following errors.

2014-01-02 14:37:31,183 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id { app_attempt_id { application_id { id: 39 cluster_timestamp: 1388020955932 } attemptId: 1 } id: 3 } state: C_COMPLETE diagnostics: "Resource hdfs://IDHV22-01:8020/user/yarn/.storm/appattempt_1388020955932_0039_000001/conf changed on src filesystem (expected 1388644474340, was 1388644474889\n" exit_status: -1000

clockfly commented 10 years ago

Patch for this:

Index: StormAMRMClient.java
===================================================================
--- StormAMRMClient.java    (revision 23570)
+++ StormAMRMClient.java    (revision 23571)
@@ -182,8 +182,11 @@
               LocalResourceType.ARCHIVE, LocalResourceVisibility.APPLICATION));

     String appHome = Util.getApplicationHomeForId(appAttemptId.toString());
-    Path confDst = Util.createConfigurationFileInFs(fs, appHome,
+    String containerHome = appHome + Path.SEPARATOR + container.getId().getId();
+    
+    Path confDst = Util.createConfigurationFileInFs(fs, containerHome,
             this.storm_conf, this.hadoopConf);
+    
     localResources.put("conf", Util.newYarnAppResource(fs, confDst));

     launchContext.setLocalResources(localResources);

revans2 commented 10 years ago

The exact same config file should be output for each of the supervisors. This would only change if the AM/nimbus crashed and needed to be brought up on a new node, which should be handled by the app attempt id. Or if someone pushes new config values to the AM, which still needs some work so that the changes are reflected in the supervisors.

That is why it was made an info message and not an error or a warning. If you want to separate them out you can, but I would prefer moving the creation of the config file to when Nimbus comes up, and then it can be reused.

clockfly commented 10 years ago

Hi Bobby，

The key here is that supervisor will fail to start! It is not about the content, it is about the timestamp.

Here is the sequence: write conf for supervsor A at time 100, and then start A. write conf for supervor B at time 110, and then start B. A is starting..

When A try to start, it will try to get the conf from distributed cache, and it expect the timestamp be 100, but actually it gets 110, so A fails to download resource, and A fails to start.

clockfly commented 10 years ago

Before this fix, I have to start supervisor one by one. When trying to start multiple supervisors, it won't work: storm-yarn addSupervisors -appId=xx -supervisors=3

revans2 commented 10 years ago

I understand, that is why moving the creation of the file to happen only once would fix it, because the time stamp would not change after that. If you want me to merge this in for now I am fine with that. it just means we are now wasting a lot more HDFS name space. Which is a much smaller problem, I would just like another issue raised to try and reduce the namespace usage later on.

clockfly commented 10 years ago

OK, I will create a pull request for this, let's move this in and then improve the file duplication later. Since we will only have a copy when supervisor starts, then the problem should not be big considing a cluster only has limited supervisors.

And maybe we should clean the folders on HDFS when shutdowning the clusters. 1st folder for master, 2nd folder for supervisors, then no problems any more.

On Thu, Jan 2, 2014 at 11:20 PM, Robert (Bobby) Evans < notifications@github.com> wrote:

I understand, that is why moving the creation of the file to happen only once would fix it, because the time stamp would not change after that. If you want me to merge this in for now I am fine with that. it just means we are now wasting a lot more HDFS name space. Which is a much smaller problem, I would just like another issue raised to try and reduce the namespace usage later on.

— Reply to this email directly or view it on GitHubhttps://github.com/yahoo/storm-yarn/issues/62#issuecomment-31458756 .

yahoo / storm-yarn

Storm supervisor configs overwrite with each other #62