yahoo / storm-yarn

Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN.
Other
417 stars 161 forks source link

DRPC servers on YARN #8

Open anfeng opened 11 years ago

anfeng commented 11 years ago

Currently, Storm-YARN launch Nimbus, UI and Supervisor servers. We should enable DRPC servers to be launched if requested by community.

revans2 commented 11 years ago

The issue with DRPC is that the clients and the topologies themselves need a way to get to the servers after the DRPC servers are launched on. YARN does not have any service registration/virtual networking system yet. We could build a simple registration system using zk, but that would require changes on the DRPC clients as well.

clockfly commented 11 years ago

I don;t understand why there requires changes to DRPC Client.

  1. After starting a storm cluster. we can use "storm-yarn getStormConfig" to get storm.yaml
  2. And inside the storm.yaml, there is "nimbus.host address", and "drpc.port".
  3. We can assume "nimbus.host.address" is same as DRPC server address.
  4. We can use the drpc server and drpc port to sumbit a DRPC topology and issue DRPCClient request.

Clarification?

revans2 commented 11 years ago

Ya that seems totally wrong. Is that on the wiki or something so I can correct it?

DRPC is not really a part of storm on YARN yet because DRPC needs to be in a place that external services can easily get to. YARN does not have a service registry of any kind yet that would allow external DRPC clients to find the servers. So the correct way to use DRPC with storm on YARN is to have a number of DRPC servers already launched. When you launch a storm cluster you would include the addresses of these external DRPC servers in the config, you may also need to include them in the config when you launch a topology. The DRPC spouts reach out and connect to the servers to pull data down so they need to know where to go to get the data.

The thing to be aware of here is that DRPC servers are not designed to be shared between several different storm clusters. It should not be a problem because they are essentially stateless. You just have to be careful that each topology has a unique function name. You have to be sure of that now, but that is only within a single cluster, if you are using shared DRPC servers you have to be sure of that across all clusters.

--Bobby

From: Sean Zhong notifications@github.com<mailto:notifications@github.com> Reply-To: yahoo/storm-yarn reply@reply.github.com<mailto:reply@reply.github.com> Date: Monday, September 9, 2013 9:06 AM To: yahoo/storm-yarn storm-yarn@noreply.github.com<mailto:storm-yarn@noreply.github.com> Cc: "Yahoo! Inc." evans@yahoo-inc.com<mailto:evans@yahoo-inc.com> Subject: Re: [storm-yarn] DRPC servers on YARN (#8)

I don;t understand why there requires changes to DRPC Client.

  1. After starting a storm cluster. we can use "storm-yarn getStormConfig" to get storm.yaml
  2. And inside the storm.yaml, there is "nimbus.host address", and "drpc.port".
  3. We can assume "nimbus.host.address" is same as DRPC server address.
  4. We can use the drpc server and drpc port to sumbit a DRPC topology and issue DRPCClient request.

Clarification?

— Reply to this email directly or view it on GitHubhttps://github.com/yahoo/storm-yarn/issues/8#issuecomment-24077557.

clockfly commented 11 years ago

The thing to be aware of here is that DRPC servers are \ not designed to be shared between several different storm clusters **.

DRPC Server is stateful, there is a map container in the server. Use unique function name across all clusters are too much assumption. So DRPC server better located inside each storm cluster?

DRPC is not really a part of storm on YARN yet because DRPC needs to be in a place that \ external services can easily get to **

Maybe we can design another layer of gateway Server which will bridge the request to DRPC Server and reply response? The gateway server is visiable from outside, and can be shared by all clusters.

In this case:

  1. There will be multiple storm clusters in the YARN, In each cluster, DRPC server is same as App Master.
  2. Gateway server can be designed as stateless, as all states is stored in upper stream DRPC server.
  3. Gateway server can be implemented as Thrift/REST server.
  4. Gateway accept Tuple <"yarn application Id", "function Id", "parameter"> as request. It will lookup the host name of DRPC server by checking the "yarn application Id".

Maybe this is a much cleaner approach.

revans2 commented 11 years ago

That does sound interesting, but I would like to see how things play out in YARN too. There has been some discussion about a service registry on the long lived applications JIRA YARN-896. It would not be too difficult to have a way to discover where the DRPC servers are located for a given storm cluster. It is mostly a matter of getting something like that into YARN, and then updating the clients to use it to find the DRPC servers.

Alternatively we could update the storm on YARN App Master to play that role for the time being. It could provide an API that could be queried to see where the DRPC servers are located. Then the client could cache this information and refresh it periodically or when an error occurs. This might be a good intermediate step as the YARN work may take a while.

clockfly commented 11 years ago

But DRPC server are not supposed to be shared by different cluster, as there are states in DRPC server. How do handle this with single DRPC server? If there are multiple DRPC server in the YARN cluster, then maybe a another layer of broker is needed to manage them, otherwise the client code need to connect to multiple DRPC server directly.

Or you can modify the existing Storm DRPC server code, so that a single DRPC server can manage different storm cluster.

Currently DRPC server works as follows:

  1. First client push the request to DRPC server,
  2. DRPC server record the request in a state hash map
  3. DRPCSpout pull the DRPC server to find a request.
  4. After Transaction query finish, the topology lookup the request id in the DRPC server state map, and update the map value.
  5. Return the result to user who initiate the request.

On Wed, Sep 25, 2013 at 4:25 AM, Robert (Bobby) Evans < notifications@github.com> wrote:

That does sound interesting, but I would like to see how things play out in YARN too. There has been some discussion about a service registry on the long lived applications JIRA YARN-896. It would not be too difficult to have a way to discover where the DRPC servers are located for a given storm cluster. It is mostly a matter of getting something like that into YARN, and then updating the clients to use it to find the DRPC servers.

Alternatively we could update the storm on YARN App Master to play that role for the time being. It could provide an API that could be queried to see where the DRPC servers are located. Then the client could cache this information and refresh it periodically or when an error occurs. This might be a good intermediate step as the YARN work may take a while.

— Reply to this email directly or view it on GitHubhttps://github.com/yahoo/storm-yarn/issues/8#issuecomment-25039300 .