OPtimized Shuffle is a distributed shuffle management system which focuses on optimizing the shuffle phase of DAG computing frameworks. The name OPS is originated from the ancient Roman religion, she was a fertility deity and earth goddess of Sabine origin.
Figure 1: OPS Architecture
Figure 2: Internal View Comparison between Legacy Hadoop and Hadoop with OPS
Figure 3: Lifecyle of OPS
Figure 4: OPS Structure
- ops/
- jobs/
- job-${jobId}: JobConf
- mapTaskFirstAlloc/
- mapTaskFirstAlloc-${jobId}: MapTaskAlloc
- mapTaskSecondAlloc/
- mapTaskSecondAlloc-${jobId}: MapTaskAlloc
- reduceTaskAlloc/
- reduceTaskAlloc-${jobId}: ReduceTaskAlloc
- shuffle/
- reduceNum/
- reduceNum-${nodeIp}-${jobId}-${reduceId}: num
- mapCompleted/
- mapCompleted-${nodeIp}-${jobId}-${mapId}: MapReport
- shuffleCompleted/
- shuffleCompleted-${dstNodeIp}-${jobId}-${num}-${mapId}: path
- indexRecords/
- indexRecords-${nodeIp}-${jobId}-${mapId}: CollectionConf
- reduceNum/
- tasks/
- reduceTasks/
- reduceTask-${nodeIp}-${jobId}-${reduceId}: ReduceConf
- reduceTasks/
- jobs/
In order to use OPS, the modification on DAG computing frameworks is inevitable. For now, we only implement OPS on Hadoop MapReduce. Our customized Hadoop is available in here. We believe that the costs of enabling OPS on other DAG computing frameworks are also very low.
$ mvn clean install
OPS: Optimized Shuffle
Usage:
ops.sh [command]
Commands:
master OpsMaster
worker OpsWorker
Options:
-h, --help Show usage
Use ops.sh [command] --help for more information about a command.