Auto-MAP: A DQN Framework for Exploring

Distributed Execution Plans for DNN Workloads

Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng,

LanSong Diao, Guoping Long, Jun Yang, Xiaoyong Liu, Wei Lin

Alibaba Group

{siyu.wsy, rongyi.ry, shiqing.fsq, james.zz,

lansong.dls, guopinglong.lgp, muzhuo.yj, xiaoyong.liu, weilin.lw}@alibaba-inc.com

ABSTRACT

e last decade has witnessed growth in the computational require-

ments for training deep neural networks. Current approaches (e.g.,

data/model parallelism, pipeline parallelism) parallelize training

tasks onto multiple devices. However, these approaches always rely

on specic deep learning frameworks and requires elaborate man-

ual design, which make it dicult to maintain and share between

dierent type of models. In this paper, we propose Auto-MAP,

a framework for exploring distributed execution plans for DNN

workloads, which can automatically discovering fast paralleliza-

tion strategies through reinforcement learning on IR level of deep

learning models. Ecient exploration remains a major challenge

for reinforcement learning. We leverage DQN with task-specic

pruning strategies to help eciently explore the search space in-

cluding optimized strategies. Our evaluation shows that Auto-MAP

can nd the optimal solution in two hours, while achieving beer

throughput on several NLP and convolution models.

KEYWORDS

deep learning, data parallelism, pipeline parallelism, model paral-

lelism, DQN algorithm, HLO IR.

1 INTRODUCTION

Deep learning (DL) models has become increasingly complicated

in articial intelligence (AI) community to acquire beer accuracy.

Training deep models is extremely both time and resources con-

suming. Distributed training with multiple devices is a irreversible

trend for training especially for large models. [

]. To harness

computing power to achieve beer throughput, a critical challenge

is how to map diversied workloads to hardware accelerators auto-

matically and eciently.

Existing solutions like data parallelism, model parallelism and

pipeline parallelism make trade-os between computation, com-

munication, and development eciency. Data Parallelism (DP) is

workload-neutral to models that could be t into single device while

facing the problem of memory footprint pressure for large models.

Model parallelism (MP) [

–

] and pipeline parallelism (PP) [

]

are eective way for alleviating the memory issue of large models,

which split the model among processes, in vertical and horizontal

way respectively. But the experts experiences are required to de-

sign a specic strategy to fully utilize hardware under the limited

computation resources.

Some previous works[

–

] oriented to exploring distributed

plans which combine dynamic programming and heuristic methods

have been proposed as promising approaches for training complex

models. But these approaches are designed for specic category

of parallelism strategy. [

] aims to nding the best solutions of

PP in a synchronous way and [

] tries to search the OPP.

is leads to the limited scenarios mainly because the optimal

strategies for diversied workloads are very dierent, and all these

planners do not contain solution space of DP, operator partitioning

parallelism (OPP) and PP at the same time. e heuristic method in

[

] results in the lack of generalization in NLP models. Another

issue is that the planners are coupled with the specic APIs and

deep learning frameworksso that they only take eect in a limited

usage. Moreover, the coarse granularity exploration on layer- or

operator-level loses the potentials for beer solutions. Moreover,

to integrate the planner to other frameworks is unfeasible in real-

word.

Recently, a trend of machine learning oriented approaches to

optimize performance of systems has been receiving increasingly

aention in AI research community. [

] adopts reinforcement

learning to learn the proper parallelism strategies, which inspired

researchers to use learning approaches to extract features of deep

models. However, it only search for simple model parallelism strat-

egy without OPP and PP solution space for the given workload and

clusters at the expense of huge time- and resources-consuming,

which leads to non-applicable in industry.

Above all, we conclude the following deciencies of these ap-

proaches: (1) Limited applicable scenarios, which lacks the coverage

of convolution, language model (LM), search/recommendation mod-

els at the same time. (2) Limited parallelism scenarios. None of

these eorts have achieved the support of model parallelism, data

parallelism, and pipeline parallelism on a unied computing layer

(e.g, TF graph). (3) Inevitable code intrusion. e planners only

take eect when specic APIs are called. It fails to to shield users

from low-level distributed details.

We propose Auto-MAP, a unied framework for exploring dis-

tributed execution plans, which works on HLO IR via DQN method

for DNN workloads.

Auto-MAP works on HLO IR instead of operators or layers. HLO

IR is an intermediate representation which produced by XLA (Ac-

celerated Linear Algebra) from TensorFlow framework, which de-

scribes the entire training task with more general and expressive

computation Instructions instead of the operation like GraphDef in

TensorFlow. Each instruction contains all necessary information for

computation. Some extra information such as the corresponding

operator name it belongs to is also recorded. ere are two reasons

for choosing HLO IR as the operational level of Auto-MAP. One

is that to explore distributed plans on HLO IR can achieve beer

performance benets from its ner granularity than operators. e

arXiv:2007.04069v1 [cs.DC] 8 Jul 2020

other is XLA can exist independently from TensorFlow and it has

the ability to support other front ends like Jax[

] and Trax[

which leads to no invasion to user codes.

Figure 1 gives the high-level design of TF’XLA compiler. As

the gure shows, the XLA compiler compiles a TF graph (an ML

network in TF) into executable machine code through a sequence

of stages. e TF graph is rst transformed into HLO IR by a

front-end (e.g., the

xla.compile

API[

]). Optimizations, such as

operator fusion and common-subexpression elimination [

] are

performed on HLO before the graph is transformed into a lower-

level representation for a target hardware architecture.

Deep Q network (DQN) is a reinforcement learning (RL) is the

approach to teach machines to interact with the environments and

receive rewards for performing the right actions until they suc-

cessfully meet their goals. It is adopted in Auto-MAP to learn the

features of deep models and provide workload-neutral distributed

plans on given computation resources. It should be noted that the

solution space is still huge even with DQN. erefore, some heuris-

tic pruning methods is also integrated in our approach. As far as we

know, there is no previous work focusing on exploring strategies in-

cluding the three category of parallelism simultaneously mentioned

above with DQN.

As shown in gure 2, Auto-MAP performs distributed plans ex-

ploration at

HLO

layer. Compared with previous approach, this has

the following advantages: (1) Free user code intrusion. e user only

needs to provide a single-device model and the distributed details

generated by our Auto-MAP framework are absolutely shielded.

(1) Rich and unied parallelism and application scenarios. Unify

DP/MP/PP for CNN/LM/Recommendation models. (3) Diverse pro-

gramming abstractions over HLO IR. Popular AI frameworks such

Tensorow, PyTorch[

], Flax[

]/Trax[

] can all map to

HLO

layer. In this work, we leverage DQN algorithm [

] to automati-

cally explore the search space of operator partitioning parallelism,

auto data parallelism and pipeline parallelism over

HLO

with device

and network interconnect topology specied.

In this paper, we focus on solving the two main challenges of dis-

tributing diverse and complex models to distributed heterogeneous

hardware platforms: leverage DQN algorithm to build a search

space including optimized strategies over

HLO IR

, and leverage

task-specic pruning method for more eciently exploration of

search space.

To summarize, our contributions are:

(1)

We propose a unied framework named Auto-MAP for

three typical parallelism strategies (i.e., operation parti-

tioning, auto data parallel and pipeline) and two typical

model types (i.e., CNN and language models);

(2)

We leverage DQN with task-specic pruning strategies to

help eciently explore the search space including opti-

mized strategies;

(3)

We fully simplies the burden of users in the selection and

implementation of distributed execution plans. With our

framework, users only need to provide a single-card graph,

and our framework automatically explores the distributed

execution plans that is compatible with the hardware com-

puting power/interconnection topology;

(4)

We show that our framework can nd the optimal solution

in a limited time compared to enumeration.

TensorFlow Graph

xla.compile

HLO IR

XLA Backend

Figure 1: Illustration of the high-level design of Tensor-

ow’s XLA compiler.

Python Code

GraphDef

HLO IR

XLA Backend

LLVM

Python Code

GraphDef

HLO IR

XLA Backend

LLVM

Parallelism

Exploration

Figure 2: Illustration of our approach over TF XLA com-

piler’s work.

2 PROBLEM FORMULATION AND

PRELIMINARIES

Data and model parallelism

have been widely used by existing

deep learning frameworks to distribute the models across devices.

Data parallelism is parallelization across multiple devices in paral-

lel computing environments, which allows to operate on the data

in parallel. For large models which cannot t on single device,

model parallelism turns out to be a good choice. Model parallelism

(MP) [

] partitions a DNN into disjoint subsets and trains each

subset on a dedicated device, which reduces communication costs

for synchronizing network parameters in a DNN but exposes lim-

ited parallelism as well as extra communication between model

partitions.

Pipeline parallelism

(PP) [

] goes beyond DP and MP,

mixing inter-batch and intra-batch parallelism. In pipeline scheme,

Input

Output

(a)

(b)

(1) Data Parallel

(2) Pipeline Parallel

(4) Hybrid of above

three approaches.

(3) Model Parallel

Figure 3: Typical types of parallelism.

one or more consecutive layers are grouped into stages and pro-

cessed with separate GPU(s), and both the forward pass and back-

ward pass of all the layers are scheduled in one stage.

Planner

[

] in PP is responsible for cuing model layers into stages and this

approach improves device utilization through pipelining multiple

micro-batches. Figure 3 shows the schematic of those three parallel

strategies.

Deep RL

has been proven to be successful with Deep Q-Learning

(DQN)[

] introducing the idea of using neural networks as a Q-

function approximator.

Rainbow DQN

[

] combining improve-

ments in deep RL, and has been shown to be promising for further

improvements of deep RL agents in benchmark environments. Al-

though not so straightforward, We try to leverage rainbow agent to

assist the

automatic

search of massive distributed strategies space.

e Rainb ow agent.

Following the methodology from [

we extend the DQN algorithm with prioritized experience replay,

double DQN, and dueling network architecture[

]. Furthermore

in contrast to [

], we apply the following changes to successfully

train the Rainbow agent: (1) we discard the noisy linear layers [

relying on

-greedy exploration instead. Since the agent was already

required to learn environmental noises from the user simulator, a

possible explanation could be that the inclusion of a second noise

distribution might have been too dicult to learn. (2) We adjust

the number of DNN layers for dierent tasks. As the greater the

number of layers, the stronger the network learning ability. Figure

4 shows the workow of our leveraged DQN method.

Problem formulation for DQN algorithm on HLO IR.

For-

mally, we dene our learning task as follows. In reinforcement

learning, the sequential decision-making problem is modeled us-

ing the Markov Decision Process formulation dened by the tuple

< S, A, R, S

. For any Q-learning task, we need to dene the

following ve aspects: state space, actions, rewards, policy and

termination.

We will illustrate our framework solving three optimization

problems over directed HLO graphs. Let

G(V , E)

denotes a directed

HLO graph, where

is the set of nodes,

the set of edges. In

our seings, each HLO instruction refers to one node of

and the

data-ow between producer instruction and consumer instruction

refers to the corresponding edge of

. Specially, we refer the nodes

with no inputs as

SourceNodes

, those nodes with no outputs as

SinkNodes

and the others as

ComputeNodes

. Give device topology

D, these optimization problems are:

Environment

Q Network

Target

Q Network

DQN Loss

Replay

Memory

Store (s, a, r, s’)

argmax

Q(s,a;θ)

Copy Every

N updates

(s,a)

Gradient

wrt loss

Q(s,a;θ)

max

a’

Q(s’,a’;θ

)

s’

Reward

Function

State

Transition

Function

Figure 4: Workow of leveraged DQN method.

• Auto Data Parallelism (ADP)

: Given a graph

, nd a

subset of dimensions of

Sourcenodes

from

such that com-

munication overhead of the propagation graph from the

selected slicing dimension of

SourceNodes

SinkNodes

is minimized.

• Operator Partitioning Parallelism (OPP)

: Given a graph

, nd a slicing strategy of all dimensions of all

trainable variables

, such that the average device utilization is maximized.

• Pipeline Parallelism (PP)

: Given a graph

and the num-

ber of stages (

) expected to be split, nd a subset of nodes

S ⊆ V

such that the pipeline length with cross-stage com-

munication overlap considered is minimized.

3 AUTO-MAP APPROACH

1. xla.compile

User Input

Data Parallelism

Explorer

Operation Partitioning

Explorer

Pipelined Explorer

2. Plans Explorer

computation

resources

Available

Strategies

HLO

deep model

Figure 5: Workow of our approach.

3.1 Exploration Workow

In order to decouple the distributed plans from the APIs of spe-

cic deep learning framework, the exploration module should be

constructed on an intermediate representation layer designed for

describing computation ow of deep learning task. Specically,

we build our search algorithm over HLO borrowed from Tensor-

Flow XLA. Figure 5 shows the workow of our approach. Taken

deep models wrien by any framework (e.g. TensorFlow, PyTorch,

MXNet[

]), the XLA compiles and transfers the original exible

computation graph into HLO IR. e Plans Explorer will search

three dierent categories of plans including data parallelism, op-

erator partitioning parallelism and pipeline parallelism over HLO

based on given computation resources.

For pipeline parallelism, we only do cut on the forward com-

putation subgraph in HLO, which can be detected by the meta

information of each instruction. Both the online inference and on-

line training approach are provided respectively to explore pipeline

parallelism. Users need to specify the number of stages in advance

for both approaches. Finally, the workow produces the best one

among all available candidate plans.

We explore these three dierent categories of plans separately.

To cope with the huge solution space and provide totally work-

load neutral plan, we use DQN approach combined with heuristic

pruning instead of ordinary heuristic algorithms to search. For a

specied workload, the corresponding solution would be found

during training stage or inferred from models that has been trained

oine. To adapt to the reinforcement learning training ow, state,

action and reward should be carefully designed according to their

objectives. We briey introduce our approach in the following

subsections, and the details of design and implementations will be

discussed in 4.

3.2 Operator Partitioning Parallelism

3.2.1 DQN flow Setup. Since the trend goes to increase the

size of deep learning models, the on-device memory is a scarce

resource for training tasks. Fortunately, the memory issue can

be alleviated through model parallelism. In practice, an eective

way to parallelize deep models is to partition operators, which

not only alleviates the memory pressure but also parallelizes the

computations. With operator partitioning, the saved memory can

be used for injecting larger batch size to improve the GPU cluster

utilization.

Each instruction produces a unique variable in HLO. erefore,

to partition operators is identical to partition variables of each in-

struction. e derivation rules of each instruction are designed

carefully for inferring partitioning decisions of unknown variables

or parameters from the known ones. Obviously, some partition-

ing plans are invalid because some derivation rules are violated.

is can only be detected during the procedure called propagation,

which performs derivation rules for each instruction when given

the known partitioning decisions of variables or parameters. e

propagation terminates when encountering the following three

situations. (1) ere is no enough information to derive the remains

variables. (2) A conict case is encountered for the violation of

derivation rules. (3) All variables have been inferred without any

conict.

We agree that only trainable variables which respect to model

parameters may be partitioned in our approach. We also set the

heuristic objective for operator partitioning parallelism to partition

trainable variables as much as possible.

In Auto-MAP, since each trainable variable may has dierent

dimension size, we make decisions for each dimension of each

trainable variable about whether to be replicated or partitioned

across all devices. ese dimension status of all trainable variables

are viewed as one strategy and the feasibility need to be veried by

doing propagation in HLO.

a (0)

c (0)

d (0)

b (-1)

e (0)

a (0)

d (0)

%a = parameter()

%b = constant

%c = dot(%a, %b)

%d = constant()

%e = add(%c, %d)

%f = broadcast(%e)

Extract

Figure 6: Linkage group example.

State and Action

. We dene state as one dimension vector

which concatenates all dimensions partition status for trainable

variables and there are three possible values at each position. And

the action is a binary ag which is True for partitioning across all

devices and False for replicating among all devices.

Reward

. According to the objective mentioned above, we en-

courage partitioning by giving higher reward than replicating and

punish the conict case by giving negative reward.

3.2.2 Linkage Group. e search space of operator partitioning

is so huge that even DQN requires lots of time based on the above

setup, thus we introduce an heuristic pruning technique called link-

age groups. Linkage group exists for each trainable variable, which

records the deterministic partitioning decisions for other trainable

variables caused by itself. Figure 6 illustrates the concept of link-

age group. When the partition status of one dimension has been

decided, the linkage group will be detected that whether current di-

mension with its partition exists. All the deterministic decisions of

caused by current decision should be inferred via linkage group so

that the search process can be greatly pruned to avoid unnecessary

exploration.

Due to the termination conditions of propagation mentioned

above, the linkage group does contain only parts of partitioning

decisions of other trainable variables. at is because the propaga-

tion procedure driven by one trainable variable with its decision

always stops early when no enough information is given. However,

larger linkage groups always perform beer pruning eect.

3.3 Auto Data Parallelism

Implementing data parallelism over HLO is not intuitive because

variables representing the input batch size cannot be easily identi-

ed. It is observed that the batch size dimensions will follow the

data ow throughout the entire model. As a result, most variables

expected to be inuenced when the partition happens on the batch

size dimension. With the help of propagation procedure, the vari-

ables represented training data and labels with their batch-size

dimensions can be easily detected.

More formally, the objective is to nd the partition strategy for

all input tensors, which results in the largest number of tensors

to be partitioned. Moreover, the more tensors to be partitioned

on the input tensor under the propagation rule, the closer to our

objective. In Auto-MAP, the action and reward is almost the same

compared to the operator partitioning task except the state. Speci-

cally, we dene state as one dimension vector which concatenates

all dimensions partition status for all input tensors.

3.4 Pipeline Parallelism Exploration by Online

Training

ere are two key issues in pipeline partitioning. One is to cut the

model into multiple stages, and the other is to place them onto a

given GPU cluster. In industry, GPU clusters are always hierarchi-

cal which has relatively higher communication bandwidth within

each node than across nodes[

]. In Auto-MAP, we highlight that

the exploration should be performed only on HLO. e main idea

is that the distributed plan should allocate computation resources

according to the computation ratio among all stages and the stage

that allocated with more than one devices should be replicated. Fig-

ure 7 shows the common mapping from HLO to devices. Stage 0 is

assigned with two devices with NVLink connection so that the gra-

dients reduction could achieve good performance with NCCL[

e activation between stage 0 and stage 1 are transmied via

Ethernet.

Softmax

Encoder Layer 4

Encoder Layer 3

Encoder Layer 2

Encoder Layer 1

Encoder Layer 0

Embedding

Stage 0

Stage 133.3%

66.7%

GPU 0 GPU 1

NVLink

GPU 2

Network

Three devices clusterModel

Computation

Ratio

Ring AllReduce for

gradients reduction

Figure 7: Cuts mapping from HLO to devices.

State and Action

. Pipeline length is an eective way to estimate

its performance and is inuenced by the activation size across stage,

the computation time and gradients all-reduce time in each stage. In

Auto-MAP, we pre-compute these features at every possible pivot

and encode them into one vector before applying the nal cuts at

the current step. e action outputs the pivot at each step. If one

cut has been applied on HLO, the model will be further split into

two stages and we limit the next cuing point should not happened

at the previous stage.

reward

. For a pipeline model, we can calculate pipeline length

to estimate performance. In this case, we use

√

as reward for

the higher performance could be achieved when L is shorter.

3.5 Pipeline Parallelism by Online Inference

3.5.1 Motivation. For Pipeline Parallelism Planning, we also

present an alternative approach for a faster and generalizable way

of inferencing an optimal hybrid parallelism strategy. is would

allow us to train at a large generated dataset and inference on a

real-world model.

In order to nd the optimal partitioning solution that yields max-

imal performance under our pipeline parallelism strategy, we need

to 1) partition the model into dierent stages, 2) decide the replica-

tion factor for each stage, and lastly 3) map the stages to underlying

hardware devices. In the following section, we will formulate our

pipeline partitioning problem into a pure mathematical problem

whose data can be randomly generated.

3.5.2 Problem Formulation. e original problem states: Given

a/an HLO module

, and the number of stages

, nd the optimal

pipeline partition

that minimizes the end-to-end time of one-batch

L with pipeline parallelism.

And with our proler, we can get the per-instruction perfor-

mance data

, which represents the execution time on a given

proling device for each instruction in milliseconds. For communi-

cation, we use our DFG analyzer on

to calculate the parameter

sizes

later used for allreduce time calculation, and activation

communication

for each instruction if we partition the model at

that specic instruction.

So this problem is now equivalent to: given three arrays of

proling data of an HLO model

and

, each of length the

number of instruction in the original model

, nd a partition

that minimizes the end-to-end time

, which we can calculate with

our value function: L = V (P | C, A,W ).

Since the number of instructions would certainly vary between

models, and their proling data might not even be close, we pro-

ceed with a round of data normalization described in the following

section to ensure the training data has a consistent size and a

reasonably close measure. And this this problem is now a array

partitioning problem irrelevant to the input model, and the three

arrays C, A, W can be generated on large scales.

Our rst approach presented above uses DQN to search through

the solution space of

for proling data generated by each given

model

. is approach tries to train our DQN with generated data

for this abstract array partitioning problem that could apply to real

models upon inference.

3.5.3 DQN Workflow.

State and Action

. We use the three

performance metrics mentioned above (

and

), and process

the data along with device topology metrics to form the nal state

representation. e data processing will be detailed in section 4.

Reward

. Since we want to minimize the time of completing one

global batch, we use 1/L as our reward.

Training and inference

. First we use the data generation

method detailed above to generate the training dataset, which will

be then used to create a large number of environments ready to be

interacted with. During the training process, since each environ-

ment represents a distribution of performance data, we will restrict

the number of interactions with one environment to a very small

number. In practice, we set each environment to be explored and

exploited 50 times.

For testing, we used a freshly generated environment that is

not in the training set, and evaluate its performance by leing the

network inference the best partitioning solution, and assess its

performance with our value function.

1 0 -1 -1 -1 -1

var 1 var 2 var 3 var 4

partitioned

replicated

-1

undecided

current

Figure 8: State representation in operator partitioning task.

For real-world model inference, we do the same data pre-processing

described in section 4, and output the best network inference result.

4 IMPLEMENTATION

4.1 Overview

All distributed execution plans can be unied into the same DQN

workow. In Auto-MAP, we select RAINBOW[

] as the DQN

framework built on PyTorch to go parallel search for all three cate-

gory of strategies. We leverage cost model to estimate the perfor-

mance of dierent plans so that the workow can produce the best

one among all candidates.

e key issues of DQN workow for dierent scenarios are

environment, state, action and reward. We introduce our imple-

mentation of those for operator partitioning parallelism, auto data

parallelism and pipeline parallelism, respectively.

4.2 Operator Partitioning Parallelism

State and Action

. In our current implementation, the state con-

tains a decision vector and acurrent position. Figure 8 shows the

representation of decision vector. All dimensions of trainable vari-

ables are concatenated into an one dimensional vector. e 1, 0, -1

stands for partitioned, replicated and undecided status, respectively.

e information of current position is an integer which indicates

the index in the decision vector that will be decided in the next

step.

Initially, the decision vector is lled with all -1, which means no

dimension is decided. en, each dimension will be decided step

by step in one episode until encountering an propagation conict

or all dimensions status have been decided safely. Figure 9 shows

one complete episode.

e action is implemented as a binary value, which the positive

and negative represent to partition and to replicate, respectively,

and the decision result will take eect on the current position.

When all dimensions of one variable are marked with -1, it means

this variable should be replicated across all devices.

Reward

We assign +0.4 and +0.1 reward to the case of partition-

ing and replication. A -1 reward will be given as the punishment

when the conict case is encountered caused by propagation in the

entire HLO, while terminating current episode.

-1 -1 -1 -1 -1 -1

1 -1 -1 -1 -1 -1

1 0 -1 -1 -1 -1

1 0 0 -1 -1 -1

1 0 0 1 -1 -1

1 0 0 1 1 -1

1 0 0 1 1 0

var 1 var 2 var 3 var 4

Figure 9: Partitioning variables in one episode.

Linkage Group

Linkage group should be extracted at the be-

ginning of the DQN training task. e extracting procedure is

displayed in gure 10.

name: shape

----------------

v0: [32, 64]

v1: [64]

v2: [128]

1. Pick one variable 'X' to partition

3. Get linkage group based on 'X' and record

multiply

......

add

matmul

V0 data

2. Do propagate

Figure 10: Linkage group extraction procedure.

Linkage groups are formed by propagating each variable and its

possible decision in the entire HLO. Specically, we pick only one

variable with its decision and send this pair into the propagation

module to infer other variables’ decision. Since propagation by only

one variable and its decision cannot make deterministic decisions

for every tensor, we only extract those deterministic ones. Aer

all linkage groups have bee, the decision order of every dimension

in DQN task will be sorted according to the size of linkage group

from large to small.

With linkage groups, the reward is calculated according to the

actual numbers of partitioned and replicated dimensions caused by

current step if some decisions trigger more than one dimensions to

be decided.

4.3 Auto Data Parallelism

State and Action

. e philosophy of designing the state and ac-

tion are the same with the case in operator partitioning parallelism.

Since the trainable variables, hyper-parameters and training data

with labels are all in the input list, we need to lter trainable vari-

ables and hyper-parameters out as much as possible. ere is a

heuristic that the constant tensors are denitely hyper-parameters

and the trainable variables are marked outside HLO, so it is not

dicult to nd all possible candidate tensors.

We construct all candidate tensors into an one dimensional vec-

tor. Moreover, the current position index is also needed. And the

action and reward are the same as we design in searching operator

partitioning parallelism plan so that the Q network is guided to

partition tensors as much as possible. is is always consistent

with the reality that the greater the number we partition, the more

intermediate tensors will be aected. Above all, the only dierence

is that we do not have any linkage groups.

Reward

. We use exactly the same reward as in operator parti-

tioning parallelism problem for guiding the Q network to partition

variables as much as possible.

4.4 Pipeline Parallelism Exploration by Online

Training

In order to reasonably simplify the placement problem which maps

from HLO-cuts to device-cuts, we treat the hierarchical computation

topology as linear model which starts from the rst device in rst

node. However, the search space is still huge and contains lots of

solutions that are unnecessary to be explored. From a practical

perspective, the solutions of beer quality always happened when

cuing on the pivots that exactly maps to the network boundaries

or their nearby. Moreover, each stage contains at least one variable

is also required in our implementation for the objective that to

balance variables loading on dierent devices. We apply these

two heuristic pruning methods to lter out some candidates pivots

before training in our implementation.

Firstly, we take the device-cuts which performs cuing on net-

work boundary as center solution. en, A threshold number is

specied as radius to represent the available range around each

device-cut in the center solution. irdly, the device-cuts are l-

tered out according to the center solution and radius. Finally, All

possible pivots which maps from the device-cuts will be le as our

candidate pivots.

State and Action

. We pre-compute three features at each stage

to encode state representation. We pre-compute the gradients all-

reduce time of entire pipeline if we cut at any pivot in HLO at cur-

rent step. is feature is very useful when there is a non-negligible

bandwidth gap between devices within one node and cross nodes.

e gradients reduction will be time-consuming when some stages

cross nodes caused by cuing on the inappropriate position. Figure

11 shows an example when cuing a deep model into four stages in

one episode and the corresponding time cost of gradients reduction

in each stage at each step.

e maximum activation transmission time is also required

among all stages if we cut at any pivot at current step. To guide the

time cost of each state towards balance, the computation balance

ratios between minimum stage and maximum stage at any pivots

are pre-computed.

Masking the unnecessary pivots is necessary when making cut-

ting decisions on HLO. Actually, there are two kinds of pivots

should be applied with mask. One is the pivot that we have ltered

out in pruning stage, the other is the pivot that in an previous stage.

In order to mask them on the output of Q network, we set their

0 1 2 last

(a) Initial

0 1 2 last

(b) After 1st Cut

0 1 2 last

(c) After 2nd Cut

0 1 2 last

(d) After last Cut

Figure 11: e change of AllReduce time cost for each stage

when cutting deep model in one episode.

100

150

200

250

Figure 12: e input C array, containing performance infor-

mation of more than 50000 instructions.

Q values to

−inf

that represents the lowest expectation on that

action.

Reward

. e pipeline length

of a deep model can be calculated

when given the pipeline parallelism plan by our cost model. As

mentioned in 3.4, the pipeline reward is designed as

. Moreover,

the memory constraint should also be taken into consideration

because some cuing strategy may encounter out of device memory.

We give an −

√

to punish this case.

4.5 Pipeline Parallelism Exploration by Online

Inference

We rst describe data processing procedure, then introduce the

DQN workow.

4.5.1 Data Processing.

•

Data Coarsening & Normalization Given a real-world model

, we normalize the data into the same scale and size as

data generated in the next section. is process is done

in three steps: 1) building prex sum, 2) coarsening array,

and 3) normalization into [0, 1].

–

Step 1: Prex Sum From proling and DFG analysis

, we can get the proling data

, and

. We

rst build the prex sum array for computational data

10000

20000

30000

40000

Figure 13: e input C array, aer building prex sum.

100

200

300

400

500

Figure 14: e C prex sum, coarsened to granularity 128.

and parameter size

= pre f ix s um(C)

pre f ix sum(W ).

e two updated array

and

accessed at index

now represents the computational time / AllReduce

size of the rst

instructions.

array is le untouched

because it does not make sense to sum up all the

cross-stage communication before a specic instruc-

tion.

A[i]

still represents the estimated cross-stage

communication if the model was cut at instruction i.

–

Step 2: Coarsening In order to adapt to models of

dierent sizes, we need to scale the proling data to

a xed number, which we empirically set to 128. For

the above three arrays, we evenly take 128 points to

form the new arrays: C

, A

We can do this to

and

because they are already

in prex sum form, and

also because the cross-stage

communication is specic to each instruction. Aer

the coarsening, we lost the possibility to partition into

instructions that are not in those 128 points, but the

problem is now irrelevant to the input model size.

–

Step 3: Normalization Since we want to generalize

across dierent models, and to generate large number

of randomized data, some form of normalization is

needed to keep all the data under a similar scale. In

practice, we scale the three arrays simultaneously to

[0, 1]:

MAX = max(C

, A

)

∗

= C

/MAX

0.00

0.25

0.50

0.75

1.00

Figure 15: e C 128-length prex sum, renormalized to [0,

1].

∗

= A

/MAX

∗

= W

/MAX

Aer this step, regardless of what the original model

is, the resulting arrays

∗

each has length 128,

and the elements are all within

[

]

. ese three ar-

rays essentially describe

the distribution of compu-

tational times, activation sizes, and parameters

throughout the model in the time dimension.

•

Data Generation To complement our existing model data-

base, we choose to generate random data of dierent dis-

tributions that satises the requirement presented above.

During the data generation, we use random number gen-

erator with optional distribution parameter (e.g. uniform,

normal, binomial) to generate three arrays of oat numbers

ranging from 0 to 1, and then do the same transformation

described above: build the prex sum, coarsening and nor-

malizing the array to get the generated C

∗

, A

∗

In the actual training process, we will need to generate

hundreds of thousands of these array groups to construct

the training set and test set.

4.5.2 DQN workflow.

State

. For our state representation, we have the following data

fed into the network:

• Computational Times C

∗

• Activation Sizes A

∗

• AllReduce Sizes W

∗

•

Device Topology (square matrix describing the intercon-

nect speed between any to device)

• Intermediate Partition

All of them are resized to one-dimension tensor, scaled to

[

]

and concatenated into one single array to form the state.

Action

. For this approach, we consider both HLO partitioning

and device assignment as actions, and they share the same action

space.

Reward.

with

being the end-to-end training time for one batch, so that

maximizing the reward means minimizing the end-to-end training

time. is is the same as we used in 4.4.

Table 1: Benchmark models for each experiments.

Task Model Params

Language Model BERT-48[29] 640M

Machine Translation

T5-base[12]

T5-3B[12]

T5-11B[12]

51M

11B

Image Classication VGG-19[30] 137M

Table 2: Simulated hardware congurations.

Cong Servers

GPU(s) per

server(N

)

Intra-server

connnections

Inter-server

connections

A 2 8x V100 NVLink 25 Gbps

B 3 8x V100 NVLink 25 Gbps

C 4 8x V100 NVLink 25 Gbps

5 EXPERIMENTS

5.1 Experimental Setup

Benchmarks. We evaluate workloads for each distributed exe-

cution plan. Table 1 summarizes all the ve representative DNN

models that we use as benchmarks in this section.

HLO of workload. We feed HLO Json les and trainable variable

list of each workload as inputs into Auto-MAP framework. Another

HLO text le is also provided for debugging in our experiments.

Simulated Hardware Congurations. Table 2 summarizes three

hardware environments in our experiments. In our observation,

the resources of 4 servers with 8 cards each are enough for training

tasks. erefore, we will give our execution plans with less than 4

servers.

Hyper-parameters. We xed the training batch size to 64 and

use the Adam optimizer[

] with dierent initial learning rate to

optimize dierent exploration tasks. For pipeline tasks, the initial

learning rate would be set to 0.001. But for operator partitioning

and auto data parallelism tasks, we set a smaller learning rate to

0.0005.

As for the specic hyper-parameters in DQN, we xed the

with 0.6 to all training tasks. We decay the exploration coecient

from 1.0 to 0.1 for all tasks, but the decay speed is totally dierent

with respect to task type, which decay to the minimum aer 2000,

500 and 10000 iterations for operator partitioning parallelism, auto

data parallelism and pipeline parallelism, respectively.

Some general tricks for improving DQN convergence are also

integrated in our training tasks. Specically, we select the priori-

tized replay buer[

] and double DQN[

] in rainbow and xed

the alpha and beta to 0.2 and 0.6 respectively. e frequency for

updating target network is set to 100 and the replay buer size is

xed to 2000 in all training tasks.

Table 3: T5 family partition results for each variable. We

use -1 to represent to replicate the variable, and the positive

number means the partition index of variable.

Block or Layer Variable Partition Strategy

Self-aention {q=1, k=1, v=1, o=0}

MLP

{conv1/kernel=1, conv1/bias=0,

conv2/kernel=0, conv2/bias=-1}

Embedding { embedding weights=0 }

Layer normalization {scale=-1, bias=-1}

Table 4: VGG-19 partition results for each variable . We use -

1 to represent to replicate the variable, and the positive num-

ber means the partition index of variable.

Block or Layer Variable Partition Strategy

Conv layers -1 for all conv layers

FC Layer

{fc1/kernel=1, fc1/bias=0,

fc2/kernel=0, fc2/bias=0}

Somax Layer predictions/kernel=1, predictions/bias=0

Table 5: e performance for searching OPP with cong B

and C. PC is short for partitioning count.

Model PC target

PC in

stage 1

1st stage

time cost

PC in

stage 2

time cost

VGG-19 38 5 30s - -

T5-base 111 111 0.5h - -

T5-3B 432 397 0.74h 432 0.2h

T5-11B 432 386 1h 432 0.45h

5.2 Evaluation Results and Analysis

5.2.1 Operator Partitioning Parallelism. ere are already some

partitioning strategies for transformer models[

][

]. It features to

partition each aention block and the following MLP layer and

all embedding variables while replicating other trainable variables,

which is the same as the objective of Auto-MAP. For VGG-19, the

eective way is to partition the last MLP block when given an hi-

erarchical hardware conguration like Cong B or Cong C[

Table 3 and Table 4 show our partitioning strategy of trainable

variables for T5 family and VGG-19, where

−

1 means replication

and the number greater than 0 represents the index of partitioned

dimension. ese partition strategies are consistent with our expec-

tation. We have already known the ground truth of these workloads

so that the quality of strategies could be measured in our exper-

iments. We count the variables that should be partitioned as the

target for each workload and to observe time cost to approach it. It

is should be noted that some workloads need a netuning stage to

explore solutions of beer quality.

We give the convergence for exploring operator partitioning

parallelism on T5-base in gure 16. T5-base has 314 dimensions to

be decided in total and 111 of them need to be partitioned according

-1.5

-1

-0.5

0.5

1.5

2.5

3.5

100

150

200

250

300

350

0 100 200 300 400 500 600 700 800

Scores

Progress

Episode

progress for T5-base

tot al scores for T5-bas e

Figure 16: e convergence of T5-base in operator partition-

ing parallelism exploration task.

to the ground truth. With the help of linkage groups, DQN learns

to avoid making conicting decision quickly. It reaches the peak

propagation progress and behaves more stable with higher scores

as the time grows.

Table 5 shows our searching performance on all benchmark

models. We pay aention to the time cost to partition all variables

which is required to be partitioned. We divide the search process

into two stages. e rst stage will search from scratch and may

converge into a local minimum, while the second stage is to netune

from that result. Some workloads like VGG-19 and T5-base may not

need netuning stage mainly because the state space is relatively

smaller than others, so it is easy to nd the partition strategy as

quickly as possible in rst stage. However, some workloads like

T5-3B and T5-11B with more trainable variables should involve

with a netuning stage. Specically, when the partition strategy is

stable in rst stage, the program will stop current training phase

and backtrace some variables which are marked with replication

according to the linkage groups and start a netuning stage. We

found that even with the complicated case like T5-11B, the expected

strategy could be found in two hours.

VGG-19

As shown in Table 4, the solution that OPP algorithm

found is to replicate VGG-19’s convolution layers while to partition

the fully connected layers. is approach makes sense that for

VGG-19 the last two FC layers occupy 86% of the total parameters

while the corresponding calculation time only accounts for 5%. For

such FC layer we prefer partitioning to replication for reducing

gradients communication overhead in synchronous training. is

desired distribution strategy as described above occurs in 30s (Ta-

ble 5) while our DQN scores keep oscillating slowly and cannot

converge quickly. One reasonable explanation is that our reward

func encourages spliing more variables while for VGG-19 is not

the case as explained above. is implies that we need a more

general reward function for models with dierent calculations and

parameter distributions.

T5-base

. e nal solution is to split 111 variables and the

partitioning results is the same with table 3. It is observed that

our T5-base takes 0.5 hour to nd the expected solution without a

netuning stage.

Table 6: e experiment result of auto data parallelism. We

use -1 to represent to replicate the variable, and the positive

number means the partition index of variable.

Model Candidate count Partition results Time cost

T5-base 10

{arg0.1=0, arg1.2=0,

arg2.3=0, arg3.4=0,

arg12.13=0,

arg17.18=0

arg22.23=0}

0.27h

VGG-19 4

{arg0.1=0, arg1.2=0}

70s

T5-3B and T5-11B

. 3B and 11B has the same layers and vari-

ables counts but the variables size and the propagation time cost.

e expected variables to be partitioned are 432 and the netuning

stages are required, which take 0.94 and 1.45 hour for 3B and 11B,

respectively.

We infer that the DQN searching behaves beer than enumera-

tion. For example, T5-base has 188 trainable variables with at most

two dimensions each, leads to a 376 binary vector which contains

336

solutions in total and T5-3B and T5-11B contains 2

1116

solu-

tions to search. It is impossible for searching the expected solution

within a limited time, while the DQN method could reach within 2

hours.

5.2.2 Auto Data Parallelism. We rst lter out all trainable vari-

ables and constant tensors in input list in HLO IR to nd the candi-

date tensors that possible to be training data.

T5-3B and T5-11B are not available for data parallelism for the

memory issue. T5-3B needs at least two devices to load balance its

variables and T5-11B consumes more devices, thus we display the

results of T5-base in this part. Table 6 shows the results of auto

data parallelism and all the tensor names in the table can be found

in HLO text le.

VGG-19.

ere are only 4 candidate tensors (with at most 4

dimensions each) need to be partitioned for VGG-19 as shown in

Table 6. Our ADP algorithm can converge steadily in 70s to the

rst dimension of two tensors (namely arg0.1 and arg0.2). Aer

manual verication, it is found that these two tensors are exactly

the two inputs of the model: labels and features tensor respectively,

and their rst dimensions are exactly the batch size dimension in

the traditional sense.

T5-base.

In our observation, we found that this procedure could

be nished in half an hour. e search space is much less than

in the operator partitioning problem. Specically, there are 10

candidates with at most 4 dimensions each, leads to 2

solutions.

e DQN found the exact ground truth within 0.27 hour, while the

enumeration would behave worse not only for the relative large

solution space but also aected by the propagation time cost.

As the results shown in 6, there are 7 tensors need to be parti-

tioned in total and all of them choose to partition the rst dimension,

which is consistent with our intuition. In machine learning training

task, we feed them with some sequence and other format data. e

batch dimension is always at the rst rank for each tensor.

100

200

300

400

500

600

100

120

0 500 1000 1500 2000 2500

Scores

Loss

Episodes

loss for T5-base

tot al scor es for T5-bas e

moving average of total scores

Figure 17: e convergence of T5-base in pipeline paral-

lelism by online training task.

5.2.3 Pipeline Parallelism Exploration by Online Training. We

xed all micro-batch sizes with 16 in all experiments. en we do

strategy search on Cong A, B and C respectively. e number

of stages to cut is depend on the number of servers under each

hardware conguration. Both the strategy produced by online

training and inference will be displayed.

In online training experiments, we set the center solution which

performs the device-cuts on network boundary and set the radius

to 3. Table 7 shows the online training experiments for searching

pipeline parallelism. In order to make the experiments results

human readable, we report not only the pivots cut on HLOs but

also the corresponding layers nearby. Since each instruction in

HLO produces a new tensor named with a % prex, we display that

tensor to indicates our HLO pivots. e device-cuts is displayed

with an array which is lled with the cuing index of device. To

cut the network boundary in the hierarchical topology hardware

conguration like table 2, the index should be the multiple of 8

because there are 8 cards within one server.

We address that the time cost of DQN method is far beer than

enumeration, especially when we increase the stage number. at

is mainly because each HLO contains at least thousands of instruc-

tions. Although we have ltered out some unexpected pivots and

get a more concise candidates set, the search space is still large

which costs more time by enumeration.

We take the convergence of exploring pipeline parallelism by

online training on T5-base as an example to show the training

procedure with DQN. In gure 17, we e total scores is smoothed

by applying moving average in order to show its trend. e gure

shows that the loss drops very fast at the beginning and the trend

of the total scores rises overall although the jier is large.

Bert-48 and T5-3B

. e two models are very similar from the

results. All strategies on Cong A, B and C proved that the cut-

ting should be happened on the network boundaries, which are

consistent with our expectation. Moreover, the pivots mapping

to corresponding layer lead to almost uniform stages so that the

computation on each stage are balanced. It takes about no more

than 5 minutes to nd them all.

T5-base

. e strategies on Cong A and B are similar with the

case of Bert-48 and T5-3B. e time cost to converge is no more

than 4 minutes. e strategy on Cong C is dierent for the last

cut happens on the 22th device, which is a NVLink boundary. is

is because the constraint that each stage contains one trainable

variable at least. T5-model is too small for cuing 4 stages that the

last cut should not happen beyond the 22th index.

T5-11B

. is model is huge enough so that it will cause OOM

if it is cut less than 4 stages. erefore, the DQN cannot nd even

one available strategy on Cong A and Cong B. For Cong C, the

result is consistent with our expectation for cuing on the network

boundaries of device topology. e time cost order of magnitude is

the same with other models.

5.2.4 Pipeline Parallelism Exploration by Online Inference. Here

we also present the results of our online inference approach. We

trained our for 10

episodes of environments constructed by random

number generated with uniform and normal distribution. Our

model is able to output the best hybrid parallelism solution for

the NLP family models like BERT and Transformer-11B. For the

CNN family, we need to netune the model with the corresponding

distribution for those models for another 10

episodes before the

model could correctly inference the best pipeline partitioning.

e detailed parallelism plan is presented in Table 8.

6 RELATED WORKS

Large DNN models are increasingly computational intensive and

seriously consumption on device memory. It is a common practice

to parallelize training by leveraging multiple GPUs[

]. Data par-

allelism, operator partitioning parallelism and pipeline parallelism

are common approaches for distributed training of DNN models.

Auto Data Parallelism

. ere are some high level frameworks

aim at reducing the burden of users to automatically parallelizeing

deep models using data parallelism[36].

Operator Partitioning Parallelism

. For NLP models with at-

tention blocks, some heuristic operator partitioning approaches[

] have already been proposed in recent years. For some convolu-

tional networks like VGG-19 and AlexNet, it is a common practice

to partition the last linear layers[5, 34].

Some prior works and studies[

] focus on nding optimal

distribution strategies over DNN layers.

Pipeline Parallelism

. [

–

] has been proposed to train

DNN by pipelining DNN models. GPipe[

] explores synchronous

pipeline approach to train large models while PipeDream[

] ex-

plores the hybrid approach of data and pipeline parallelism for

asynchronous training. e RL approach has been proposed to nd

optimal placement strategy for a given DNN[40].

Rainbow DQN.

Reinforcement learning (RL) is a general frame-

work where agents learn to perform actions in an environment

so as to maximize a reward. DQN[

] is a RL algorithm that com-

bines Q-Learning with deep neural networks to let RL work for

complex, high-dimensional environments, like video games, or ro-

botics. Double DQN[

], Dueling DQN[

], Noisy DQN[

] and

DQN with Prioritized Experience Replay[

] are these four impor-

tant supplements which each of them handle a dierent aspect of

an agent. Rainbow DQN[

] is an o-policy deep reinforcement

learning algorithm that is the state-of-the-art technique in the eld

of reinforcement learning.

Table 7: e experiment result for searching pipeline parallelism by online training.

Model Cong Pivots on HLO Corresponding Layer Nearby Device cuts Time cost

Bert-48

(%fusion.4004)

(%dot.17952, %fusion.3779)

(%dot.16644, %dot.22128, %dot.27627)

(layer23)

(layer17, layer32)

(layer14, layer27, layer40)

(8)

(8, 16)

(8, 16, 24)

95s

157s

262s

T5-base

(%fusion.1039)

(%reshape.6314, %transpose.12159)

(%reshape.5622, %fusion.1015, %fusion.909)

(dec/layer2)

(enc/layer5, dec/layer5)

(enc/layer4/conv1,

dec/layer3, somax)

(8)

(8, 16)

(8, 16, 22)

42s

50s

198s

T5-3B

(%dot.2204)

(%reshape.13395, %multiply.15030)

(%reshape.13379,

%multiply.15009, %multiply.15049)

(dec/layer2)

(enc/layer19, dec/layer11)

(enc/layer14,

dec/layer5, dec/layer17)

(8)

(8, 16)

(8, 16, 24)

115s

202s

262s

T5-11B

(%fusion.4585, %fusion.4241, %dot.40105)

(enc/layer13, dec/layer3, dec/layer15)

(8, 16, 24)

280s

Table 8: e experiment result for searching pipeline parallelism by online inference.

Model Cong Partition Boundary Corresponding Layer Nearby Device cuts

BERT-48

C (34, 66, 98), granularity=128 (layer14, layer27, layer40) (8, 16, 24)

T5-base

C (34, 66, 98), granularity=128 (enc/layer4/conv1, dec/layer3, somax) (8, 16, 24)

T5-11B

C (34, 66, 98), granularity=128 (enc/layer13, dec/layer3, dec/layer15) (8, 16, 24)

7 CONCLUSION

7.1 Summary

We introduce Auto-MAP, a framework for exploring distribution

strategies based on model architectures, which works on HLO

IR and automatically discovers fast parallelization strategies with

optimized DQN algorithm. Data parallelism, operation partition-

ing parallelism and pipelined parallelism are all included in the

exploration space. We leverage DQN with task-specic pruning

strategies to help eciently explore the search space including

optimized strategies. Auto-MAP fully simplies the users burden

in the selection and implementation of distribution strategies. Our

experiments show that Auto-MAP can nd the optimal solution

within two hours while achieving beer throughput on several NLP

and convolution models.

7.2 Future Work

Combination of HLO IR and DQN algorithm show convincing con-

vergence results and performance. ere are still some interesting

works to follow. First of all, replacing discrete DQN states with con-

tinues one for operation partitioning task for beer interpretation

and convergence. Secondly, currently our Auto-MAP framework

can only give a single parallelization strategy automatically (i.e.,

DP, PP, operation partitioning), which may result in sub-optimal

runtime performance in large-scale distributed training. In the

future we will support exploring hybrid of these three strategies

automatically. Auto-MAP is open-source and will be made available

to the public.

REFERENCES

[1]

D. H. D. Amodei, AI-and-compute, 2019, hps://openai.com/blog/

ai-and-compute/.

[2]

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimization

towards training a trillion parameter models,” arXiv preprint arXiv:1910.02054,

2019.

[3]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,

“Megatron-lm: Training multi-billion parameter language models using gpu

model parallelism,” arXiv preprint arXiv:1909.08053, 2019.

[4]

N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool,

P. Hawkins, H. Lee, M. Hong, C. Young et al., “Mesh-tensorow: Deep learning

for supercomputers,” in Advances in Neural Information Processing Systems, 2018,

pp. 10 414–10 423.

[5]

Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring the hidden dimension in acceler-

ating convolutional neural networks,” 2018.

[6]

J. Geng, D. Li, and S. Wang, “Horizontal or vertical? a hybrid approach to

large-scale distributed machine learning,” in Proceedings of the 10th Workshop on

Scientic Cloud Computing, 2019, pp. 1–4.

[7]

N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen, “Chan-

nel and lter parallelism for large-scale cnn training,” in Proceedings of the

International Conference for High Performance Computing, Networking, Storage

and Analysis, 2019, pp. 1–20.

[8]

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer,

and Z. Chen, “Gshard: Scaling giant models with conditional computation and

automatic sharding,” arXiv preprint arXiv:2006.16668, 2020.

[9]

Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam,

Q. V. Le, Y. Wu et al., “Gpipe: Ecient training of giant neural networks using

pipeline parallelism,” in Advances in neural information processing systems, 2019,

pp. 103–112.

[10]

D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R.

Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline paral-

lelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating

Systems Principles, 2019, pp. 1–15.

[11]

A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger,

and P. Gibbons, “Pipedream: Fast and ecient pipeline parallel dnn training,”

arXiv preprint arXiv:1806.03377, 2018.

[12]

C. Rael, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,

and P. J. Liu, “Exploring the limits of transfer learning with a unied text-to-text

transformer,” arXiv preprint arXiv:1910.10683, 2019.

[13]

Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep

neural networks,” arXiv preprint arXiv:1807.05358, 2018.

[14]

A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar,

M. Norouzi, S. Bengio, and J. Dean, “Device placement optimization with rein-

forcement learning,” arXiv preprint arXiv:1706.04972, 2017.

[15]

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and

S. Wanderman-Milne, “JAX: composable transformations of Python+NumPy

programs,” 2018. [Online]. Available: hp://github.com/google/jax

[16]

T. T. authors, Trax  Deep Learning with Clear Code and Speed, 2020, hps:

//github.com/google/trax.

[17]

XLA: Optimizing Compiler for Machine LearningOperation Semantics, 2019,

hps://www.tensorow.org/xla/operation semantics.

[18]

S. Muchnick et al., Advanced compiler design implementation. Morgan kaufmann,

1997.

[19]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des-

maison, L. Antiga, and A. Lerer, “Automatic dierentiation in pytorch,” 2017.

[20]

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and

M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint

arXiv:1312.5602, 2013.

[21]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly

learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

[22]

S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang,

L. Xia et al., “Dapple: A pipelined data parallel approach for training large models,”

arXiv preprint arXiv:2007.01045, 2020.

[23]

M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Hor-

gan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in

deep reinforcement learning,” in irty-Second AAAI Conference on Articial

Intelligence, 2018.

[24]

Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Duel-

ing network architectures for deep reinforcement learning,” in International

conference on machine learning, 2016, pp. 1995–2003.

[25]

M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih,

R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks for exploration,” arXiv

preprint arXiv:1706.10295, 2017.

[26]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,

and Z. Zhang, “Mxnet: A exible and ecient machine learning library for

heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.

[27] NVDIA DGX-1, 2019, hps://www.nvidia.com/en-us/data-center/dgx-1/.

[28] NCCL, 2019, hps://developer.nvidia.com/nccl.

[29]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of

deep bidirectional transformers for language understanding,” arXiv preprint

arXiv:1810.04805, 2018.

[30]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-

scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[31]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv

preprint arXiv:1412.6980, 2014.

[32]

T. Schaul, J. an, I. Antonoglou, and D. Silver, “Prioritized experience replay,”

arXiv preprint arXiv:1511.05952, 2015.

[33]

H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double

q-learning,” in irtieth AAAI conference on articial intelligence, 2016.

[34]

A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,”

arXiv preprint arXiv:1404.5997, 2014.

[35]

S. Pal, E. Ebrahimi, A. Zulqar, Y. Fu, V. Zhang, S. Migacz, D. Nellans, and

P. Gupta, “Optimizing multi-gpu parallelization strategies for deep learning

training,” IEEE Micro, vol. 39, no. 5, pp. 91–101, 2019.

[36]

H.-T. Cheng, Z. Haque, L. Hong, M. Ispir, C. Mewald, I. Polosukhin, G. Roumpos,

D. Sculley, J. Smith, D. Soergel et al., “Tensorow estimators: Managing simplicity

vs. exibility in high-level machine learning frameworks,” in Proce edings of the

23rd ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, 2017, pp. 1763–1771.

[37]

J. Zhan and J. Zhang, “Pipe-torch: Pipeline-based distributed deep learning in

a gpu cluster with heterogeneous networking,” in 2019 Seventh International

Conference on Advanced Cloud and Big Data (CBD). IEEE, 2019, pp. 55–60.

[38]

J. Geng, D. Li, and S. Wang, “Elasticpipe: An ecient and dynamic model-parallel

solution to dnn training,” in Proceedings of the 10th Workshop on Scientic Cloud

Computing, 2019, pp. 5–9.

[39]

B. Yang, J. Zhang, J. Li, C. R

e, C. R. Aberger, and C. De Sa, “Pipemare: Asynchro-

nous pipeline parallel dnn training,” arXiv preprint arXiv:1910.05124, 2019.

[40]

A. Goldie and A. Mirhoseini, “Placement optimization with deep reinforcement

learning,” in Proceedings of the 2020 International Symposium on Physical Design,

2020, pp. 3–7.

[41]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,

A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level

control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp.

529–533, 2015.