Questions about the project's structure

xiechengsheng commented 7 years ago

Hi, Chen Wei, I have read your ATC paper and look through this repository. And I have some questions about the project's structure.

In my opinion, the folder PyDockerMonitor is the core codes of using a preemptive way to schedule hadoop cluster on docker without yarn framework. That's to say you write a set of tools to manage the hadoop-on-docker cluster's resources, and the object which is scheduled is docker rather than hadoop framwork. And the PyDockerMonitor can be run as a standalone project. Is my thought right?
The big-c project has many other folders(such as hadoop-hdfs-project, hadoop-project, etc...), are they used to apply the PyDockerMonitor scheduler framework to yarn and improve the yarn's DRF scheduler?

Thanks in advance:smile:~

yncxcw commented 7 years ago

hi Xiecheng Sorry for some miss-understanding, PyDockerMonitor is just my first version of this project, in which I implemented it as a standalone service and communicate with Yarn through RPC as you think. Later I found this solution was frustrating and hard to debug. Then I just quit this option.

My final implementation is pure in Yarn and only use Java, you can check my commit logs to see which part of codes in YARN I have changed. Actually, the most of the codes I implemented are in NodeManager and Capacity Scheduler.

And also, if you just started working on Hadoop ecosystem, the newly released hadoop-3.0 has a whole new implementation for Docker management and is not compatible with Hadoop-2.7 version.

Wei Chen

xiechengsheng commented 7 years ago

Hi, @yncxcw sorry to disturb you again. I have another question about the application scene of big-c system. As many papers mentioned, the long jobs will consume most cluster's resources and if the short jobs are scheduled after the long jobs, the head-of-line blocking will happen. But I wonder whether the head-of-line blocking problem exists in the real production cluster. As the Alibaba company's public cluster data wiki told us, the utilization rate of resources in the real production cluster can always be lower than 50%, so in my opinion, above the half of servers in the real production cluster have enough idle resources to execute the short jobs, and we won't see the head-of-line blocking problem in the real production cluster. What's your opinion about the head-of-line blocking problem in the real production cluster?

yncxcw commented 7 years ago

Hi, xiecheng

That's fine. I am happy to discuss research questions.

For head-of-line blocking, it really exists in some production systems, since small jobs are often blocked by long jobs. But it is more about scheduling strategy, like how to partition jobs into different categories, how to isolate resources between long jobs and short jobs(like configuring different queue quota)
The second situation where cluster utilization is low is another problem. It occurs because the system needs to guarantee a very strict SLA for some high priority workloads(like Hbase or Solr), in which case there should be plenty of resources reserved for request load fluctuation. The reserved part of resources is the result of wasted 50% of cluster resources.

1 is a special case of 2, like to ensure SLA of short jobs and avoid head-of-line blocking, the cluster needs to reserve part of resources in case the burst of short jobs will not be blocked by long jobs.

xiechengsheng commented 7 years ago

@yncxcw Thanks again for your detailed explanation.

For reply 1, like you said, to schedule burst of highly parallel short jobs is a big pressure for primary scheduler in the cluster. So there are many works on increasing scheduler's number in the cluster, or scheduling short jobs preemptively before long jobs, in order to decrease the scheduler's scheduling delay for short jobs, such as sparrow, omega, etc. The common motivation of these system is to optimize the short jobs' scheduling waiting time before short jobs are scheduled to destination machine.
For big-c system, as far as I can see, it is different from above-mentioned systems. The optimization of this system is after the short jobs are scheduled to the destination machine where long jobs are already running and consuming most resources, the node resource manager should think that the new arriving short jobs have higher priority and limit resources that long jobs can use, so the destination machine have enough resources to execute short jobs to meet short jobs' Qos requirements.Is my thought right? If so, we can consider such a scenario, above half of servers in the real production cluster have enough idle resources, the short jobs can be scheduled to enough idle resources successfully, so the Qos of short jobs can be guaranteed. In such a case, can big-c system gain better performance than the native YARN scheduler?

Thanks in advance:smile:~

yncxcw commented 7 years ago

hi, xiecheng

That's OK.

For 1, yes, the motivation of these projects is to minimize the queueing delays, either queued at master node, like yarn or queuing at slave node, like sparrow.

For2. yes, our design purpose is to have a mechanism to implement "preemption without killing", like the traditional OS. One thing should be noticed here is we preempt before the jobs are scheduled to target nodes since the resource manager has the full picture of cluster utilization and can make the optimal decisions.

For your understanding about real cluster scenario, I think the 50% idleness is coming from the resource reservation, in which 50% of the resources are reserved for short jobs to guarantee its Qos, no matter how much its real usage is. I mean sometimes the short jobs usage is below 10% but still, 50% of cluster resources need to be reserved. That will cause huge resource waste, for example, what about the average usage for short jobs is only 10%, but only can reach 50% at some burstiness. The goal of big-c is to overcome this situation because big-c does not rely on resources reservation. When the burstiness of short jobs comes, big-c preempt resources from long jobs and give these preempted resources back to long jobs when the burstiness is over.

Wei

xiechengsheng commented 7 years ago

@yncxcw That's great. Thanks for your patient explanation. So the big-c system doesn't need extra special servers to execute short jobs, and the final goal of preemptive scheduling is to make full use of cluster's resources and save the costing. Your explanation cleared up my misunderstandings about this system.

yncxcw commented 7 years ago

That's OK~

yncxcw / big-c

Questions about the project's structure #7