Auto-MAP: A DQN Framework for Exploring
Distributed Execution Plans for DNN Workloads
Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng,
LanSong Diao, Guoping Long, Jun Yang, Xiaoyong Liu, Wei Lin
Alibaba Group
{siyu.wsy, rongyi.ry, shiqing.fsq, james.zz,
lansong.dls, guopinglong.lgp, muzhuo.yj, xiaoyong.liu, weilin.lw}@alibaba-inc.com
ABSTRACT
e last decade has witnessed growth in the computational require-
ments for training deep neural networks. Current approaches (e.g.,
data/model parallelism, pipeline parallelism) parallelize training
tasks onto multiple devices. However, these approaches always rely
on specic deep learning frameworks and requires elaborate man-
ual design, which make it dicult to maintain and share between
dierent type of models. In this paper, we propose Auto-MAP,
a framework for exploring distributed execution plans for DNN
workloads, which can automatically discovering fast paralleliza-
tion strategies through reinforcement learning on IR level of deep
learning models. Ecient exploration remains a major challenge
for reinforcement learning. We leverage DQN with task-specic
pruning strategies to help eciently explore the search space in-
cluding optimized strategies. Our evaluation shows that Auto-MAP
can nd the optimal solution in two hours, while achieving beer
throughput on several NLP and convolution models.
KEYWORDS
deep learning, data parallelism, pipeline parallelism, model paral-
lelism, DQN algorithm, HLO IR.
1 INTRODUCTION
Deep learning (DL) models has become increasingly complicated
in articial intelligence (AI) community to acquire beer accuracy.
Training deep models is extremely both time and resources con-
suming. Distributed training with multiple devices is a irreversible
trend for training especially for large models. [
1
,
2
]. To harness
computing power to achieve beer throughput, a critical challenge
is how to map diversied workloads to hardware accelerators auto-
matically and eciently.
Existing solutions like data parallelism, model parallelism and
pipeline parallelism make trade-os between computation, com-
munication, and development eciency. Data Parallelism (DP) is
workload-neutral to models that could be t into single device while
facing the problem of memory footprint pressure for large models.
Model parallelism (MP) [
3
8
] and pipeline parallelism (PP) [
9
,
10
]
are eective way for alleviating the memory issue of large models,
which split the model among processes, in vertical and horizontal
way respectively. But the experts experiences are required to de-
sign a specic strategy to fully utilize hardware under the limited
computation resources.
Some previous works[
11
13
] oriented to exploring distributed
plans which combine dynamic programming and heuristic methods
have been proposed as promising approaches for training complex
models. But these approaches are designed for specic category
of parallelism strategy. [
11
] aims to nding the best solutions of
PP in a synchronous way and [
12
,
13
] tries to search the OPP.
is leads to the limited scenarios mainly because the optimal
strategies for diversied workloads are very dierent, and all these
planners do not contain solution space of DP, operator partitioning
parallelism (OPP) and PP at the same time. e heuristic method in
[
12
] results in the lack of generalization in NLP models. Another
issue is that the planners are coupled with the specic APIs and
deep learning frameworksso that they only take eect in a limited
usage. Moreover, the coarse granularity exploration on layer- or
operator-level loses the potentials for beer solutions. Moreover,
to integrate the planner to other frameworks is unfeasible in real-
word.
Recently, a trend of machine learning oriented approaches to
optimize performance of systems has been receiving increasingly
aention in AI research community. [
14
] adopts reinforcement
learning to learn the proper parallelism strategies, which inspired
researchers to use learning approaches to extract features of deep
models. However, it only search for simple model parallelism strat-
egy without OPP and PP solution space for the given workload and
clusters at the expense of huge time- and resources-consuming,
which leads to non-applicable in industry.
Above all, we conclude the following deciencies of these ap-
proaches: (1) Limited applicable scenarios, which lacks the coverage
of convolution, language model (LM), search/recommendation mod-
els at the same time. (2) Limited parallelism scenarios. None of
these eorts have achieved the support of model parallelism, data
parallelism, and pipeline parallelism on a unied computing layer
(e.g, TF graph). (3) Inevitable code intrusion. e planners only
take eect when specic APIs are called. It fails to to shield users
from low-level distributed details.
We propose Auto-MAP, a unied framework for exploring dis-
tributed execution plans, which works on HLO IR via DQN method
for DNN workloads.
Auto-MAP works on HLO IR instead of operators or layers. HLO
IR is an intermediate representation which produced by XLA (Ac-
celerated Linear Algebra) from TensorFlow framework, which de-
scribes the entire training task with more general and expressive
computation Instructions instead of the operation like GraphDef in
TensorFlow. Each instruction contains all necessary information for
computation. Some extra information such as the corresponding
operator name it belongs to is also recorded. ere are two reasons
for choosing HLO IR as the operational level of Auto-MAP. One
is that to explore distributed plans on HLO IR can achieve beer
performance benets from its ner granularity than operators. e
arXiv:2007.04069v1 [cs.DC] 8 Jul 2020
other is XLA can exist independently from TensorFlow and it has
the ability to support other front ends like Jax[
15
] and Trax[
16
],
which leads to no invasion to user codes.
Figure 1 gives the high-level design of TF’XLA compiler. As
the gure shows, the XLA compiler compiles a TF graph (an ML
network in TF) into executable machine code through a sequence
of stages. e TF graph is rst transformed into HLO IR by a
front-end (e.g., the
xla.compile
API[
17
]). Optimizations, such as
operator fusion and common-subexpression elimination [
18
] are
performed on HLO before the graph is transformed into a lower-
level representation for a target hardware architecture.
Deep Q network (DQN) is a reinforcement learning (RL) is the
approach to teach machines to interact with the environments and
receive rewards for performing the right actions until they suc-
cessfully meet their goals. It is adopted in Auto-MAP to learn the
features of deep models and provide workload-neutral distributed
plans on given computation resources. It should be noted that the
solution space is still huge even with DQN. erefore, some heuris-
tic pruning methods is also integrated in our approach. As far as we
know, there is no previous work focusing on exploring strategies in-
cluding the three category of parallelism simultaneously mentioned
above with DQN.
As shown in gure 2, Auto-MAP performs distributed plans ex-
ploration at
HLO
layer. Compared with previous approach, this has
the following advantages: (1) Free user code intrusion. e user only
needs to provide a single-device model and the distributed details
generated by our Auto-MAP framework are absolutely shielded.
(1) Rich and unied parallelism and application scenarios. Unify
DP/MP/PP for CNN/LM/Recommendation models. (3) Diverse pro-
gramming abstractions over HLO IR. Popular AI frameworks such
Tensorow, PyTorch[
19
], Flax[
15
]/Trax[
16
] can all map to
HLO
layer. In this work, we leverage DQN algorithm [
20
] to automati-
cally explore the search space of operator partitioning parallelism,
auto data parallelism and pipeline parallelism over
HLO
with device
and network interconnect topology specied.
In this paper, we focus on solving the two main challenges of dis-
tributing diverse and complex models to distributed heterogeneous
hardware platforms: leverage DQN algorithm to build a search
space including optimized strategies over
HLO IR
, and leverage
task-specic pruning method for more eciently exploration of
search space.
To summarize, our contributions are:
(1)
We propose a unied framework named Auto-MAP for
three typical parallelism strategies (i.e., operation parti-
tioning, auto data parallel and pipeline) and two typical
model types (i.e., CNN and language models);
(2)
We leverage DQN with task-specic pruning strategies to
help eciently explore the search space including opti-
mized strategies;
(3)
We fully simplies the burden of users in the selection and
implementation of distributed execution plans. With our
framework, users only need to provide a single-card graph,
and our framework automatically explores the distributed
execution plans that is compatible with the hardware com-
puting power/interconnection topology;
(4)
We show that our framework can nd the optimal solution
in a limited time compared to enumeration.
TensorFlow Graph
xla.compile
HLO IR
XLA Backend
Figure 1: Illustration of the high-level design of Tensor-
ow’s XLA compiler.
Python Code
GraphDef
HLO IR
XLA Backend
LLVM
Python Code
GraphDef
HLO IR
XLA Backend
LLVM
Parallelism
Exploration
Figure 2: Illustration of our approach over TF XLA com-
piler’s work.
2 PROBLEM FORMULATION AND
PRELIMINARIES
Data and model parallelism
have been widely used by existing
deep learning frameworks to distribute the models across devices.
Data parallelism is parallelization across multiple devices in paral-
lel computing environments, which allows to operate on the data
in parallel. For large models which cannot t on single device,
model parallelism turns out to be a good choice. Model parallelism
(MP) [
21
] partitions a DNN into disjoint subsets and trains each
subset on a dedicated device, which reduces communication costs
for synchronizing network parameters in a DNN but exposes lim-
ited parallelism as well as extra communication between model
partitions.
Pipeline parallelism
(PP) [
9
,
11
,
22
] goes beyond DP and MP,
mixing inter-batch and intra-batch parallelism. In pipeline scheme,
2
Input
Output
(a)
(b)
(1) Data Parallel
(2) Pipeline Parallel
(4) Hybrid of above
three approaches.
(3) Model Parallel
Figure 3: Typical types of parallelism.
one or more consecutive layers are grouped into stages and pro-
cessed with separate GPU(s), and both the forward pass and back-
ward pass of all the layers are scheduled in one stage.
Planner
[
10
,
22
] in PP is responsible for cuing model layers into stages and this
approach improves device utilization through pipelining multiple
micro-batches. Figure 3 shows the schematic of those three parallel
strategies.
Deep RL
has been proven to be successful with Deep Q-Learning
(DQN)[
20
] introducing the idea of using neural networks as a Q-
function approximator.
Rainbow DQN
[
23
] combining improve-
ments in deep RL, and has been shown to be promising for further
improvements of deep RL agents in benchmark environments. Al-
though not so straightforward, We try to leverage rainbow agent to
assist the
automatic
search of massive distributed strategies space.
e Rainb ow agent.
Following the methodology from [
23
],
we extend the DQN algorithm with prioritized experience replay,
double DQN, and dueling network architecture[
24
]. Furthermore
in contrast to [
23
], we apply the following changes to successfully
train the Rainbow agent: (1) we discard the noisy linear layers [
25
],
relying on
ϵ
-greedy exploration instead. Since the agent was already
required to learn environmental noises from the user simulator, a
possible explanation could be that the inclusion of a second noise
distribution might have been too dicult to learn. (2) We adjust
the number of DNN layers for dierent tasks. As the greater the
number of layers, the stronger the network learning ability. Figure
4 shows the workow of our leveraged DQN method.
Problem formulation for DQN algorithm on HLO IR.
For-
mally, we dene our learning task as follows. In reinforcement
learning, the sequential decision-making problem is modeled us-
ing the Markov Decision Process formulation dened by the tuple
< S, A, R, S
0
>
. For any Q-learning task, we need to dene the
following ve aspects: state space, actions, rewards, policy and
termination.
We will illustrate our framework solving three optimization
problems over directed HLO graphs. Let
G(V , E)
denotes a directed
HLO graph, where
V
is the set of nodes,
E
the set of edges. In
our seings, each HLO instruction refers to one node of
V
and the
data-ow between producer instruction and consumer instruction
refers to the corresponding edge of
V
. Specially, we refer the nodes
with no inputs as
SourceNodes
, those nodes with no outputs as
SinkNodes
and the others as
ComputeNodes
. Give device topology
D, these optimization problems are:
Environment
Q Network
Target
Q Network
DQN Loss
Replay
Memory
Store (s, a, r, s’)
s
argmax
a
Q(s,a;θ)
Copy Every
N updates
(s,a)
Gradient
wrt loss
Q(s,a;θ)
max
a’
Q(s’,a’;θ
-
)
s’
r
Reward
Function
State
Transition
Function
Figure 4: Workow of leveraged DQN method.
Auto Data Parallelism (ADP)
: Given a graph
G
, nd a
subset of dimensions of
Sourcenodes
from
V
such that com-
munication overhead of the propagation graph from the
selected slicing dimension of
SourceNodes
to
SinkNodes
is minimized.
Operator Partitioning Parallelism (OPP)
: Given a graph
G
, nd a slicing strategy of all dimensions of all
trainable variables
of
G
, such that the average device utilization is maximized.
Pipeline Parallelism (PP)
: Given a graph
G
and the num-
ber of stages (
K
) expected to be split, nd a subset of nodes
S V
such that the pipeline length with cross-stage com-
munication overlap considered is minimized.
3 AUTO-MAP APPROACH
1. xla.compile
User Input
Data Parallelism
Explorer
Operation Partitioning
Explorer
Pipelined Explorer
2. Plans Explorer
computation
resources
Available
Strategies
HLO
deep model
Figure 5: Workow of our approach.
3.1 Exploration Workow
In order to decouple the distributed plans from the APIs of spe-
cic deep learning framework, the exploration module should be
constructed on an intermediate representation layer designed for
describing computation ow of deep learning task. Specically,
we build our search algorithm over HLO borrowed from Tensor-
Flow XLA. Figure 5 shows the workow of our approach. Taken
deep models wrien by any framework (e.g. TensorFlow, PyTorch,
MXNet[
26
]), the XLA compiles and transfers the original exible
computation graph into HLO IR. e Plans Explorer will search
3
three dierent categories of plans including data parallelism, op-
erator partitioning parallelism and pipeline parallelism over HLO
based on given computation resources.
For pipeline parallelism, we only do cut on the forward com-
putation subgraph in HLO, which can be detected by the meta
information of each instruction. Both the online inference and on-
line training approach are provided respectively to explore pipeline
parallelism. Users need to specify the number of stages in advance
for both approaches. Finally, the workow produces the best one
among all available candidate plans.
We explore these three dierent categories of plans separately.
To cope with the huge solution space and provide totally work-
load neutral plan, we use DQN approach combined with heuristic
pruning instead of ordinary heuristic algorithms to search. For a
specied workload, the corresponding solution would be found
during training stage or inferred from models that has been trained
oine. To adapt to the reinforcement learning training ow, state,
action and reward should be carefully designed according to their
objectives. We briey introduce our approach in the following
subsections, and the details of design and implementations will be
discussed in 4.
3.2 Operator Partitioning Parallelism
3.2.1 DQN flow Setup. Since the trend goes to increase the
size of deep learning models, the on-device memory is a scarce
resource for training tasks. Fortunately, the memory issue can
be alleviated through model parallelism. In practice, an eective
way to parallelize deep models is to partition operators, which
not only alleviates the memory pressure but also parallelizes the
computations. With operator partitioning, the saved memory can
be used for injecting larger batch size to improve the GPU cluster
utilization.
Each instruction produces a unique variable in HLO. erefore,
to partition operators is identical to partition variables of each in-
struction. e derivation rules of each instruction are designed
carefully for inferring partitioning decisions of unknown variables
or parameters from the known ones. Obviously, some partition-
ing plans are invalid because some derivation rules are violated.
is can only be detected during the procedure called propagation,
which performs derivation rules for each instruction when given
the known partitioning decisions of variables or parameters. e
propagation terminates when encountering the following three
situations. (1) ere is no enough information to derive the remains
variables. (2) A conict case is encountered for the violation of
derivation rules. (3) All variables have been inferred without any
conict.
We agree that only trainable variables which respect to model
parameters may be partitioned in our approach. We also set the
heuristic objective for operator partitioning parallelism to partition
trainable variables as much as possible.
In Auto-MAP, since each trainable variable may has dierent
dimension size, we make decisions for each dimension of each
trainable variable about whether to be replicated or partitioned
across all devices. ese dimension status of all trainable variables
are viewed as one strategy and the feasibility need to be veried by
doing propagation in HLO.
a (0)
c (0)
d (0)
b (-1)
e (0)
a (0)
d (0)
%a = parameter()
%b = constant
%c = dot(%a, %b)
%d = constant()
%e = add(%c, %d)
%f = broadcast(%e)
Extract
Figure 6: Linkage group example.
State and Action
. We dene state as one dimension vector
which concatenates all dimensions partition status for trainable
variables and there are three possible values at each position. And
the action is a binary ag which is True for partitioning across all
devices and False for replicating among all devices.
Reward
. According to the objective mentioned above, we en-
courage partitioning by giving higher reward than replicating and
punish the conict case by giving negative reward.
3.2.2 Linkage Group. e search space of operator partitioning
is so huge that even DQN requires lots of time based on the above
setup, thus we introduce an heuristic pruning technique called link-
age groups. Linkage group exists for each trainable variable, which
records the deterministic partitioning decisions for other trainable
variables caused by itself. Figure 6 illustrates the concept of link-
age group. When the partition status of one dimension has been
decided, the linkage group will be detected that whether current di-
mension with its partition exists. All the deterministic decisions of
caused by current decision should be inferred via linkage group so
that the search process can be greatly pruned to avoid unnecessary
exploration.
Due to the termination conditions of propagation mentioned
above, the linkage group does contain only parts of partitioning
decisions of other trainable variables. at is because the propaga-
tion procedure driven by one trainable variable with its decision
always stops early when no enough information is given. However,
larger linkage groups always perform beer pruning eect.
3.3 Auto Data Parallelism
Implementing data parallelism over HLO is not intuitive because
variables representing the input batch size cannot be easily identi-
ed. It is observed that the batch size dimensions will follow the
data ow throughout the entire model. As a result, most variables
expected to be inuenced when the partition happens on the batch
size dimension. With the help of propagation procedure, the vari-
ables represented training data and labels with their batch-size
dimensions can be easily detected.
More formally, the objective is to nd the partition strategy for
all input tensors, which results in the largest number of tensors
to be partitioned. Moreover, the more tensors to be partitioned
on the input tensor under the propagation rule, the closer to our
objective. In Auto-MAP, the action and reward is almost the same
4
compared to the operator partitioning task except the state. Speci-
cally, we dene state as one dimension vector which concatenates
all dimensions partition status for all input tensors.
3.4 Pipeline Parallelism Exploration by Online
Training
ere are two key issues in pipeline partitioning. One is to cut the
model into multiple stages, and the other is to place them onto a
given GPU cluster. In industry, GPU clusters are always hierarchi-
cal which has relatively higher communication bandwidth within
each node than across nodes[
27
]. In Auto-MAP, we highlight that
the exploration should be performed only on HLO. e main idea
is that the distributed plan should allocate computation resources
according to the computation ratio among all stages and the stage
that allocated with more than one devices should be replicated. Fig-
ure 7 shows the common mapping from HLO to devices. Stage 0 is
assigned with two devices with NVLink connection so that the gra-
dients reduction could achieve good performance with NCCL[
28
].
e activation between stage 0 and stage 1 are transmied via
Ethernet.
Softmax
Encoder Layer 4
Encoder Layer 3
Encoder Layer 2
Encoder Layer 1
Encoder Layer 0
Embedding
Stage 0
Stage 133.3%
66.7%
GPU 0 GPU 1
NVLink
GPU 2
Network
Network
Three devices clusterModel
Computation
Ratio
Ring AllReduce for
gradients reduction
Figure 7: Cuts mapping from HLO to devices.
State and Action
. Pipeline length is an eective way to estimate
its performance and is inuenced by the activation size across stage,
the computation time and gradients all-reduce time in each stage. In
Auto-MAP, we pre-compute these features at every possible pivot
and encode them into one vector before applying the nal cuts at
the current step. e action outputs the pivot at each step. If one
cut has been applied on HLO, the model will be further split into
two stages and we limit the next cuing point should not happened
at the previous stage.
reward
. For a pipeline model, we can calculate pipeline length
L
to estimate performance. In this case, we use
1
L
as reward for
the higher performance could be achieved when L is shorter.
3.5 Pipeline Parallelism by Online Inference
3.5.1 Motivation. For Pipeline Parallelism Planning, we also
present an alternative approach for a faster and generalizable way
of inferencing an optimal hybrid parallelism strategy. is would
allow us to train at a large generated dataset and inference on a
real-world model.
In order to nd the optimal partitioning solution that yields max-
imal performance under our pipeline parallelism strategy, we need
to 1) partition the model into dierent stages, 2) decide the replica-
tion factor for each stage, and lastly 3) map the stages to underlying
hardware devices. In the following section, we will formulate our
pipeline partitioning problem into a pure mathematical problem
whose data can be randomly generated.
3.5.2 Problem Formulation. e original problem states: Given
a/an HLO module
H
, and the number of stages
S
, nd the optimal
pipeline partition
P
that minimizes the end-to-end time of one-batch
L with pipeline parallelism.
And with our proler, we can get the per-instruction perfor-
mance data
C
, which represents the execution time on a given
proling device for each instruction in milliseconds. For communi-
cation, we use our DFG analyzer on
H
to calculate the parameter
sizes
W
later used for allreduce time calculation, and activation
communication
A
for each instruction if we partition the model at
that specic instruction.
So this problem is now equivalent to: given three arrays of
proling data of an HLO model
C
,
A
and
W
, each of length the
number of instruction in the original model
H
, nd a partition
P
that minimizes the end-to-end time
L
, which we can calculate with
our value function: L = V (P | C, A,W ).
Since the number of instructions would certainly vary between
models, and their proling data might not even be close, we pro-
ceed with a round of data normalization described in the following
section to ensure the training data has a consistent size and a
reasonably close measure. And this this problem is now a array
partitioning problem irrelevant to the input model, and the three
arrays C, A, W can be generated on large scales.
Our rst approach presented above uses DQN to search through
the solution space of
P
for proling data generated by each given
model
H
. is approach tries to train our DQN with generated data
for this abstract array partitioning problem that could apply to real
models upon inference.
3.5.3 DQN Workflow.
State and Action
. We use the three
performance metrics mentioned above (
C
,
A
and
W
), and process
the data along with device topology metrics to form the nal state
representation. e data processing will be detailed in section 4.
Reward
. Since we want to minimize the time of completing one
global batch, we use 1/L as our reward.
Training and inference
. First we use the data generation
method detailed above to generate the training dataset, which will
be then used to create a large number of environments ready to be
interacted with. During the training process, since each environ-
ment represents a distribution of performance data, we will restrict
the number of interactions with one environment to a very small
number. In practice, we set each environment to be explored and
exploited 50 times.
For testing, we used a freshly generated environment that is
not in the training set, and evaluate its performance by leing the
network inference the best partitioning solution, and assess its
performance with our value function.
5
1 0 -1 -1 -1 -1
var 1 var 2 var 3 var 4
1
partitioned
0
replicated
-1
undecided
current
Figure 8: State representation in operator partitioning task.
For real-world model inference, we do the same data pre-processing
described in section 4, and output the best network inference result.
4 IMPLEMENTATION
4.1 Overview
All distributed execution plans can be unied into the same DQN
workow. In Auto-MAP, we select RAINBOW[
23
] as the DQN
framework built on PyTorch to go parallel search for all three cate-
gory of strategies. We leverage cost model to estimate the perfor-
mance of dierent plans so that the workow can produce the best
one among all candidates.
e key issues of DQN workow for dierent scenarios are
environment, state, action and reward. We introduce our imple-
mentation of those for operator partitioning parallelism, auto data
parallelism and pipeline parallelism, respectively.
4.2 Operator Partitioning Parallelism
State and Action
. In our current implementation, the state con-
tains a decision vector and acurrent position. Figure 8 shows the
representation of decision vector. All dimensions of trainable vari-
ables are concatenated into an one dimensional vector. e 1, 0, -1
stands for partitioned, replicated and undecided status, respectively.
e information of current position is an integer which indicates
the index in the decision vector that will be decided in the next
step.
Initially, the decision vector is lled with all -1, which means no
dimension is decided. en, each dimension will be decided step
by step in one episode until encountering an propagation conict
or all dimensions status have been decided safely. Figure 9 shows
one complete episode.
e action is implemented as a binary value, which the positive
and negative represent to partition and to replicate, respectively,
and the decision result will take eect on the current position.
When all dimensions of one variable are marked with -1, it means
this variable should be replicated across all devices.
Reward
We assign +0.4 and +0.1 reward to the case of partition-
ing and replication. A -1 reward will be given as the punishment
when the conict case is encountered caused by propagation in the
entire HLO, while terminating current episode.
Figure 9: Partitioning variables in one episode.
Linkage Group
Linkage group should be extracted at the be-
ginning of the DQN training task. e extracting procedure is
displayed in gure 10.
name: shape
----------------
v0: [32, 64]
v1: [64]
v2: [128]
1. Pick one variable 'X' to partition
3. Get linkage group based on 'X' and record
multiply
V2
......
add
matmul
V1
V0 data
2. Do propagate
Figure 10: Linkage group extraction procedure.
Linkage groups are formed by propagating each variable and its
possible decision in the entire HLO. Specically, we pick only one
variable with its decision and send this pair into the propagation
module to infer other variables’ decision. Since propagation by only
one variable and its decision cannot make deterministic decisions
for every tensor, we only extract those deterministic ones. Aer
all linkage groups have bee, the decision order of every dimension
in DQN task will be sorted according to the size of linkage group
from large to small.
With linkage groups, the reward is calculated according to the
actual numbers of partitioned and replicated dimensions caused by
current step if some decisions trigger more than one dimensions to
be decided.
4.3 Auto Data Parallelism
State and Action
. e philosophy of designing the state and ac-
tion are the same with the case in operator partitioning parallelism.
Since the trainable variables, hyper-parameters and training data
with labels are all in the input list, we need to lter trainable vari-
ables and hyper-parameters out as much as possible. ere is a
heuristic that the constant tensors are denitely hyper-parameters
6
and the trainable variables are marked outside HLO, so it is not
dicult to nd all possible candidate tensors.
We construct all candidate tensors into an one dimensional vec-
tor. Moreover, the current position index is also needed. And the
action and reward are the same as we design in searching operator
partitioning parallelism plan so that the Q network is guided to
partition tensors as much as possible. is is always consistent
with the reality that the greater the number we partition, the more
intermediate tensors will be aected. Above all, the only dierence
is that we do not have any linkage groups.
Reward
. We use exactly the same reward as in operator parti-
tioning parallelism problem for guiding the Q network to partition
variables as much as possible.
4.4 Pipeline Parallelism Exploration by Online
Training
In order to reasonably simplify the placement problem which maps
from HLO-cuts to device-cuts, we treat the hierarchical computation
topology as linear model which starts from the rst device in rst
node. However, the search space is still huge and contains lots of
solutions that are unnecessary to be explored. From a practical
perspective, the solutions of beer quality always happened when
cuing on the pivots that exactly maps to the network boundaries
or their nearby. Moreover, each stage contains at least one variable
is also required in our implementation for the objective that to
balance variables loading on dierent devices. We apply these
two heuristic pruning methods to lter out some candidates pivots
before training in our implementation.
Firstly, we take the device-cuts which performs cuing on net-
work boundary as center solution. en, A threshold number is
specied as radius to represent the available range around each
device-cut in the center solution. irdly, the device-cuts are l-
tered out according to the center solution and radius. Finally, All
possible pivots which maps from the device-cuts will be le as our
candidate pivots.
State and Action
. We pre-compute three features at each stage
to encode state representation. We pre-compute the gradients all-
reduce time of entire pipeline if we cut at any pivot in HLO at cur-
rent step. is feature is very useful when there is a non-negligible
bandwidth gap between devices within one node and cross nodes.
e gradients reduction will be time-consuming when some stages
cross nodes caused by cuing on the inappropriate position. Figure
11 shows an example when cuing a deep model into four stages in
one episode and the corresponding time cost of gradients reduction
in each stage at each step.
e maximum activation transmission time is also required
among all stages if we cut at any pivot at current step. To guide the
time cost of each state towards balance, the computation balance
ratios between minimum stage and maximum stage at any pivots
are pre-computed.
Masking the unnecessary pivots is necessary when making cut-
ting decisions on HLO. Actually, there are two kinds of pivots
should be applied with mask. One is the pivot that we have ltered
out in pruning stage, the other is the pivot that in an previous stage.
In order to mask them on the output of Q network, we set their
0 1 2 last
(a) Initial
0 1 2 last
(b) After 1st Cut
0 1 2 last
(c) After 2nd Cut
0 1 2 last
(d) After last Cut
Figure 11: e change of AllReduce time cost for each stage
when cutting deep model in one episode.
0
50
100
150
200
250
Figure 12: e input C array, containing performance infor-
mation of more than 50000 instructions.
Q values to
inf
that represents the lowest expectation on that
action.
Reward
. e pipeline length
L
of a deep model can be calculated
when given the pipeline parallelism plan by our cost model. As
mentioned in 3.4, the pipeline reward is designed as
1
L
. Moreover,
the memory constraint should also be taken into consideration
because some cuing strategy may encounter out of device memory.
We give an
1
L
to punish this case.
4.5 Pipeline Parallelism Exploration by Online
Inference
We rst describe data processing procedure, then introduce the
DQN workow.
4.5.1 Data Processing.
Data Coarsening & Normalization Given a real-world model
H
, we normalize the data into the same scale and size as
data generated in the next section. is process is done
in three steps: 1) building prex sum, 2) coarsening array,
and 3) normalization into [0, 1].
Step 1: Prex Sum From proling and DFG analysis
on
H
, we can get the proling data
C
,
A
, and
W
. We
rst build the prex sum array for computational data
7
0
10000
20000
30000
40000
Figure 13: e input C array, aer building prex sum.
0
100
200
300
400
500
Figure 14: e C prex sum, coarsened to granularity 128.
C
and parameter size
W
:
C
0
= pre f ix s um(C)
,
W
0
=
pre f ix sum(W ).
e two updated array
C
0
and
W
0
accessed at index
i
now represents the computational time / AllReduce
size of the rst
i
instructions.
A
array is le untouched
because it does not make sense to sum up all the
cross-stage communication before a specic instruc-
tion.
A[i]
still represents the estimated cross-stage
communication if the model was cut at instruction i.
Step 2: Coarsening In order to adapt to models of
dierent sizes, we need to scale the proling data to
a xed number, which we empirically set to 128. For
the above three arrays, we evenly take 128 points to
form the new arrays: C
00
, A
00
,W
00
.
We can do this to
C
0
and
W
0
because they are already
in prex sum form, and
A
also because the cross-stage
communication is specic to each instruction. Aer
the coarsening, we lost the possibility to partition into
instructions that are not in those 128 points, but the
problem is now irrelevant to the input model size.
Step 3: Normalization Since we want to generalize
across dierent models, and to generate large number
of randomized data, some form of normalization is
needed to keep all the data under a similar scale. In
practice, we scale the three arrays simultaneously to
[0, 1]:
MAX = max(C
00
, A
00
,W
00
)
C
= C
00
/MAX
0.00
0.25
0.50
0.75
1.00
Figure 15: e C 128-length prex sum, renormalized to [0,
1].
A
= A
00
/MAX
W
= W
00
/MAX
Aer this step, regardless of what the original model
is, the resulting arrays
C
,
W
,
A
each has length 128,
and the elements are all within
[
0
,
1
]
. ese three ar-
rays essentially describe
the distribution of compu-
tational times, activation sizes, and parameters
throughout the model in the time dimension.
Data Generation To complement our existing model data-
base, we choose to generate random data of dierent dis-
tributions that satises the requirement presented above.
During the data generation, we use random number gen-
erator with optional distribution parameter (e.g. uniform,
normal, binomial) to generate three arrays of oat numbers
ranging from 0 to 1, and then do the same transformation
described above: build the prex sum, coarsening and nor-
malizing the array to get the generated C
, A
,W
.
In the actual training process, we will need to generate
hundreds of thousands of these array groups to construct
the training set and test set.
4.5.2 DQN workflow.
State
. For our state representation, we have the following data
fed into the network:
Computational Times C
Activation Sizes A
AllReduce Sizes W
Device Topology (square matrix describing the intercon-
nect speed between any to device)
Intermediate Partition
All of them are resized to one-dimension tensor, scaled to
[
0
,
1
]
and concatenated into one single array to form the state.
Action
. For this approach, we consider both HLO partitioning
and device assignment as actions, and they share the same action
space.
Reward.
1
L
with
L
being the end-to-end training time for one batch, so that
maximizing the reward means minimizing the end-to-end training
time. is is the same as we used in 4.4.
8
Table 1: Benchmark models for each experiments.
Task Model Params
Language Model BERT-48[29] 640M
Machine Translation
T5-base[12]
T5-3B[12]
T5-11B[12]
51M
3B
11B
Image Classication VGG-19[30] 137M
Table 2: Simulated hardware congurations.
Cong Servers
GPU(s) per
server(N
s
)
Intra-server
connnections
Inter-server
connections
A 2 8x V100 NVLink 25 Gbps
B 3 8x V100 NVLink 25 Gbps
C 4 8x V100 NVLink 25 Gbps
5 EXPERIMENTS
5.1 Experimental Setup
Benchmarks. We evaluate workloads for each distributed exe-
cution plan. Table 1 summarizes all the ve representative DNN
models that we use as benchmarks in this section.
HLO of workload. We feed HLO Json les and trainable variable
list of each workload as inputs into Auto-MAP framework. Another
HLO text le is also provided for debugging in our experiments.
Simulated Hardware Congurations. Table 2 summarizes three
hardware environments in our experiments. In our observation,
the resources of 4 servers with 8 cards each are enough for training
tasks. erefore, we will give our execution plans with less than 4
servers.
Hyper-parameters. We xed the training batch size to 64 and
use the Adam optimizer[
31
] with dierent initial learning rate to
optimize dierent exploration tasks. For pipeline tasks, the initial
learning rate would be set to 0.001. But for operator partitioning
and auto data parallelism tasks, we set a smaller learning rate to
0.0005.
As for the specic hyper-parameters in DQN, we xed the
γ
with 0.6 to all training tasks. We decay the exploration coecient
ϵ
from 1.0 to 0.1 for all tasks, but the decay speed is totally dierent
with respect to task type, which decay to the minimum aer 2000,
500 and 10000 iterations for operator partitioning parallelism, auto
data parallelism and pipeline parallelism, respectively.
Some general tricks for improving DQN convergence are also
integrated in our training tasks. Specically, we select the priori-
tized replay buer[
32
] and double DQN[
33
] in rainbow and xed
the alpha and beta to 0.2 and 0.6 respectively. e frequency for
updating target network is set to 100 and the replay buer size is
xed to 2000 in all training tasks.
Table 3: T5 family partition results for each variable. We
use -1 to represent to replicate the variable, and the positive
number means the partition index of variable.
Block or Layer Variable Partition Strategy
Self-aention {q=1, k=1, v=1, o=0}
MLP
{conv1/kernel=1, conv1/bias=0,
conv2/kernel=0, conv2/bias=-1}
Embedding { embedding weights=0 }
Layer normalization {scale=-1, bias=-1}
Table 4: VGG-19 partition results for each variable . We use -
1 to represent to replicate the variable, and the positive num-
ber means the partition index of variable.
Block or Layer Variable Partition Strategy
Conv layers -1 for all conv layers
FC Layer
{fc1/kernel=1, fc1/bias=0,
fc2/kernel=0, fc2/bias=0}
Somax Layer predictions/kernel=1, predictions/bias=0
Table 5: e performance for searching OPP with cong B
and C. PC is short for partitioning count.
Model PC target
PC in
stage 1
1st stage
time cost
PC in
stage 2
stage 2
time cost
VGG-19 38 5 30s - -
T5-base 111 111 0.5h - -
T5-3B 432 397 0.74h 432 0.2h
T5-11B 432 386 1h 432 0.45h
5.2 Evaluation Results and Analysis
5.2.1 Operator Partitioning Parallelism. ere are already some
partitioning strategies for transformer models[
3
][
4
]. It features to
partition each aention block and the following MLP layer and
all embedding variables while replicating other trainable variables,
which is the same as the objective of Auto-MAP. For VGG-19, the
eective way is to partition the last MLP block when given an hi-
erarchical hardware conguration like Cong B or Cong C[
34
].
Table 3 and Table 4 show our partitioning strategy of trainable
variables for T5 family and VGG-19, where
1 means replication
and the number greater than 0 represents the index of partitioned
dimension. ese partition strategies are consistent with our expec-
tation. We have already known the ground truth of these workloads
so that the quality of strategies could be measured in our exper-
iments. We count the variables that should be partitioned as the
target for each workload and to observe time cost to approach it. It
is should be noted that some workloads need a netuning stage to
explore solutions of beer quality.
We give the convergence for exploring operator partitioning
parallelism on T5-base in gure 16. T5-base has 314 dimensions to
be decided in total and 111 of them need to be partitioned according
9
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
0
50
100
150
200
250
300
350
0 100 200 300 400 500 600 700 800
Scores
Progress
Episode
progress for T5-base
tot al scores for T5-bas e
Figure 16: e convergence of T5-base in operator partition-
ing parallelism exploration task.
to the ground truth. With the help of linkage groups, DQN learns
to avoid making conicting decision quickly. It reaches the peak
propagation progress and behaves more stable with higher scores
as the time grows.
Table 5 shows our searching performance on all benchmark
models. We pay aention to the time cost to partition all variables
which is required to be partitioned. We divide the search process
into two stages. e rst stage will search from scratch and may
converge into a local minimum, while the second stage is to netune
from that result. Some workloads like VGG-19 and T5-base may not
need netuning stage mainly because the state space is relatively
smaller than others, so it is easy to nd the partition strategy as
quickly as possible in rst stage. However, some workloads like
T5-3B and T5-11B with more trainable variables should involve
with a netuning stage. Specically, when the partition strategy is
stable in rst stage, the program will stop current training phase
and backtrace some variables which are marked with replication
according to the linkage groups and start a netuning stage. We
found that even with the complicated case like T5-11B, the expected
strategy could be found in two hours.
VGG-19
As shown in Table 4, the solution that OPP algorithm
found is to replicate VGG-19’s convolution layers while to partition
the fully connected layers. is approach makes sense that for
VGG-19 the last two FC layers occupy 86% of the total parameters
while the corresponding calculation time only accounts for 5%. For
such FC layer we prefer partitioning to replication for reducing
gradients communication overhead in synchronous training. is
desired distribution strategy as described above occurs in 30s (Ta-
ble 5) while our DQN scores keep oscillating slowly and cannot
converge quickly. One reasonable explanation is that our reward
func encourages spliing more variables while for VGG-19 is not
the case as explained above. is implies that we need a more
general reward function for models with dierent calculations and
parameter distributions.
T5-base
. e nal solution is to split 111 variables and the
partitioning results is the same with table 3. It is observed that
our T5-base takes 0.5 hour to nd the expected solution without a
netuning stage.
Table 6: e experiment result of auto data parallelism. We
use -1 to represent to replicate the variable, and the positive
number means the partition index of variable.
Model Candidate count Partition results Time cost
T5-base 10
{arg0.1=0, arg1.2=0,
arg2.3=0, arg3.4=0,
arg12.13=0,
arg17.18=0
arg22.23=0}
0.27h
VGG-19 4
{arg0.1=0, arg1.2=0}
70s
T5-3B and T5-11B
. 3B and 11B has the same layers and vari-
ables counts but the variables size and the propagation time cost.
e expected variables to be partitioned are 432 and the netuning
stages are required, which take 0.94 and 1.45 hour for 3B and 11B,
respectively.
We infer that the DQN searching behaves beer than enumera-
tion. For example, T5-base has 188 trainable variables with at most
two dimensions each, leads to a 376 binary vector which contains
2
336
solutions in total and T5-3B and T5-11B contains 2
1116
solu-
tions to search. It is impossible for searching the expected solution
within a limited time, while the DQN method could reach within 2
hours.
5.2.2 Auto Data Parallelism. We rst lter out all trainable vari-
ables and constant tensors in input list in HLO IR to nd the candi-
date tensors that possible to be training data.
T5-3B and T5-11B are not available for data parallelism for the
memory issue. T5-3B needs at least two devices to load balance its
variables and T5-11B consumes more devices, thus we display the
results of T5-base in this part. Table 6 shows the results of auto
data parallelism and all the tensor names in the table can be found
in HLO text le.
VGG-19.
ere are only 4 candidate tensors (with at most 4
dimensions each) need to be partitioned for VGG-19 as shown in
Table 6. Our ADP algorithm can converge steadily in 70s to the
rst dimension of two tensors (namely arg0.1 and arg0.2). Aer
manual verication, it is found that these two tensors are exactly
the two inputs of the model: labels and features tensor respectively,
and their rst dimensions are exactly the batch size dimension in
the traditional sense.
T5-base.
In our observation, we found that this procedure could
be nished in half an hour. e search space is much less than
in the operator partitioning problem. Specically, there are 10
candidates with at most 4 dimensions each, leads to 2
40
solutions.
e DQN found the exact ground truth within 0.27 hour, while the
enumeration would behave worse not only for the relative large
solution space but also aected by the propagation time cost.
As the results shown in 6, there are 7 tensors need to be parti-
tioned in total and all of them choose to partition the rst dimension,
which is consistent with our intuition. In machine learning training
task, we feed them with some sequence and other format data. e
batch dimension is always at the rst rank for each tensor.
10
0
100
200
300
400
500
600
0
20
40
60
80
100
120
0 500 1000 1500 2000 2500
Scores
Loss
Episodes
loss for T5-base
tot al scor es for T5-bas e
moving average of total scores
Figure 17: e convergence of T5-base in pipeline paral-
lelism by online training task.
5.2.3 Pipeline Parallelism Exploration by Online Training. We
xed all micro-batch sizes with 16 in all experiments. en we do
strategy search on Cong A, B and C respectively. e number
of stages to cut is depend on the number of servers under each
hardware conguration. Both the strategy produced by online
training and inference will be displayed.
In online training experiments, we set the center solution which
performs the device-cuts on network boundary and set the radius
to 3. Table 7 shows the online training experiments for searching
pipeline parallelism. In order to make the experiments results
human readable, we report not only the pivots cut on HLOs but
also the corresponding layers nearby. Since each instruction in
HLO produces a new tensor named with a % prex, we display that
tensor to indicates our HLO pivots. e device-cuts is displayed
with an array which is lled with the cuing index of device. To
cut the network boundary in the hierarchical topology hardware
conguration like table 2, the index should be the multiple of 8
because there are 8 cards within one server.
We address that the time cost of DQN method is far beer than
enumeration, especially when we increase the stage number. at
is mainly because each HLO contains at least thousands of instruc-
tions. Although we have ltered out some unexpected pivots and
get a more concise candidates set, the search space is still large
which costs more time by enumeration.
We take the convergence of exploring pipeline parallelism by
online training on T5-base as an example to show the training
procedure with DQN. In gure 17, we e total scores is smoothed
by applying moving average in order to show its trend. e gure
shows that the loss drops very fast at the beginning and the trend
of the total scores rises overall although the jier is large.
Bert-48 and T5-3B
. e two models are very similar from the
results. All strategies on Cong A, B and C proved that the cut-
ting should be happened on the network boundaries, which are
consistent with our expectation. Moreover, the pivots mapping
to corresponding layer lead to almost uniform stages so that the
computation on each stage are balanced. It takes about no more
than 5 minutes to nd them all.
T5-base
. e strategies on Cong A and B are similar with the
case of Bert-48 and T5-3B. e time cost to converge is no more
than 4 minutes. e strategy on Cong C is dierent for the last
cut happens on the 22th device, which is a NVLink boundary. is
is because the constraint that each stage contains one trainable
variable at least. T5-model is too small for cuing 4 stages that the
last cut should not happen beyond the 22th index.
T5-11B
. is model is huge enough so that it will cause OOM
if it is cut less than 4 stages. erefore, the DQN cannot nd even
one available strategy on Cong A and Cong B. For Cong C, the
result is consistent with our expectation for cuing on the network
boundaries of device topology. e time cost order of magnitude is
the same with other models.
5.2.4 Pipeline Parallelism Exploration by Online Inference. Here
we also present the results of our online inference approach. We
trained our for 10
7
episodes of environments constructed by random
number generated with uniform and normal distribution. Our
model is able to output the best hybrid parallelism solution for
the NLP family models like BERT and Transformer-11B. For the
CNN family, we need to netune the model with the corresponding
distribution for those models for another 10
4
episodes before the
model could correctly inference the best pipeline partitioning.
e detailed parallelism plan is presented in Table 8.
6 RELATED WORKS
Large DNN models are increasingly computational intensive and
seriously consumption on device memory. It is a common practice
to parallelize training by leveraging multiple GPUs[
5
,
35
]. Data par-
allelism, operator partitioning parallelism and pipeline parallelism
are common approaches for distributed training of DNN models.
Auto Data Parallelism
. ere are some high level frameworks
aim at reducing the burden of users to automatically parallelizeing
deep models using data parallelism[36].
Operator Partitioning Parallelism
. For NLP models with at-
tention blocks, some heuristic operator partitioning approaches[
3
,
4
] have already been proposed in recent years. For some convolu-
tional networks like VGG-19 and AlexNet, it is a common practice
to partition the last linear layers[5, 34].
Some prior works and studies[
5
,
13
] focus on nding optimal
distribution strategies over DNN layers.
Pipeline Parallelism
. [
9
,
11
,
37
39
] has been proposed to train
DNN by pipelining DNN models. GPipe[
9
] explores synchronous
pipeline approach to train large models while PipeDream[
11
] ex-
plores the hybrid approach of data and pipeline parallelism for
asynchronous training. e RL approach has been proposed to nd
optimal placement strategy for a given DNN[40].
Rainbow DQN.
Reinforcement learning (RL) is a general frame-
work where agents learn to perform actions in an environment
so as to maximize a reward. DQN[
41
] is a RL algorithm that com-
bines Q-Learning with deep neural networks to let RL work for
complex, high-dimensional environments, like video games, or ro-
botics. Double DQN[
33
], Dueling DQN[
24
], Noisy DQN[
25
] and
DQN with Prioritized Experience Replay[
32
] are these four impor-
tant supplements which each of them handle a dierent aspect of
an agent. Rainbow DQN[
23
] is an o-policy deep reinforcement
learning algorithm that is the state-of-the-art technique in the eld
of reinforcement learning.
11
Table 7: e experiment result for searching pipeline parallelism by online training.
Model Cong Pivots on HLO Corresponding Layer Nearby Device cuts Time cost
Bert-48
A
B
C
(%fusion.4004)
(%dot.17952, %fusion.3779)
(%dot.16644, %dot.22128, %dot.27627)
(layer23)
(layer17, layer32)
(layer14, layer27, layer40)
(8)
(8, 16)
(8, 16, 24)
95s
157s
262s
T5-base
A
B
C
(%fusion.1039)
(%reshape.6314, %transpose.12159)
(%reshape.5622, %fusion.1015, %fusion.909)
(dec/layer2)
(enc/layer5, dec/layer5)
(enc/layer4/conv1,
dec/layer3, somax)
(8)
(8, 16)
(8, 16, 22)
42s
50s
198s
T5-3B
A
B
C
(%dot.2204)
(%reshape.13395, %multiply.15030)
(%reshape.13379,
%multiply.15009, %multiply.15049)
(dec/layer2)
(enc/layer19, dec/layer11)
(enc/layer14,
dec/layer5, dec/layer17)
(8)
(8, 16)
(8, 16, 24)
115s
202s
262s
T5-11B
A
B
C
-
-
(%fusion.4585, %fusion.4241, %dot.40105)
-
-
(enc/layer13, dec/layer3, dec/layer15)
-
-
(8, 16, 24)
-
-
280s
Table 8: e experiment result for searching pipeline parallelism by online inference.
Model Cong Partition Boundary Corresponding Layer Nearby Device cuts
BERT-48
C (34, 66, 98), granularity=128 (layer14, layer27, layer40) (8, 16, 24)
T5-base
C (34, 66, 98), granularity=128 (enc/layer4/conv1, dec/layer3, somax) (8, 16, 24)
T5-11B
C (34, 66, 98), granularity=128 (enc/layer13, dec/layer3, dec/layer15) (8, 16, 24)
7 CONCLUSION
7.1 Summary
We introduce Auto-MAP, a framework for exploring distribution
strategies based on model architectures, which works on HLO
IR and automatically discovers fast parallelization strategies with
optimized DQN algorithm. Data parallelism, operation partition-
ing parallelism and pipelined parallelism are all included in the
exploration space. We leverage DQN with task-specic pruning
strategies to help eciently explore the search space including
optimized strategies. Auto-MAP fully simplies the users burden
in the selection and implementation of distribution strategies. Our
experiments show that Auto-MAP can nd the optimal solution
within two hours while achieving beer throughput on several NLP
and convolution models.
7.2 Future Work
Combination of HLO IR and DQN algorithm show convincing con-
vergence results and performance. ere are still some interesting
works to follow. First of all, replacing discrete DQN states with con-
tinues one for operation partitioning task for beer interpretation
and convergence. Secondly, currently our Auto-MAP framework
can only give a single parallelization strategy automatically (i.e.,
DP, PP, operation partitioning), which may result in sub-optimal
runtime performance in large-scale distributed training. In the
future we will support exploring hybrid of these three strategies
automatically. Auto-MAP is open-source and will be made available
to the public.
REFERENCES
[1]
D. H. D. Amodei, AI-and-compute, 2019, hps://openai.com/blog/
ai-and-compute/.
[2]
S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimization
towards training a trillion parameter models, arXiv preprint arXiv:1910.02054,
2019.
[3]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,
“Megatron-lm: Training multi-billion parameter language models using gpu
model parallelism, arXiv preprint arXiv:1909.08053, 2019.
[4]
N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool,
P. Hawkins, H. Lee, M. Hong, C. Young et al., “Mesh-tensorow: Deep learning
for supercomputers, in Advances in Neural Information Processing Systems, 2018,
pp. 10 414–10 423.
[5]
Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring the hidden dimension in acceler-
ating convolutional neural networks, 2018.
[6]
J. Geng, D. Li, and S. Wang, “Horizontal or vertical? a hybrid approach to
large-scale distributed machine learning, in Proceedings of the 10th Workshop on
Scientic Cloud Computing, 2019, pp. 1–4.
[7]
N. Dryden, N. Maruyama, T. Moon, T. Benson, M. Snir, and B. Van Essen, “Chan-
nel and lter parallelism for large-scale cnn training, in Proceedings of the
International Conference for High Performance Computing, Networking, Storage
and Analysis, 2019, pp. 1–20.
[8]
D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer,
and Z. Chen, “Gshard: Scaling giant models with conditional computation and
automatic sharding, arXiv preprint arXiv:2006.16668, 2020.
[9]
Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam,
Q. V. Le, Y. Wu et al., “Gpipe: Ecient training of giant neural networks using
pipeline parallelism, in Advances in neural information processing systems, 2019,
pp. 103–112.
[10]
D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R.
Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: generalized pipeline paral-
lelism for dnn training, in Proceedings of the 27th ACM Symposium on Operating
Systems Principles, 2019, pp. 1–15.
[11]
A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger,
and P. Gibbons, “Pipedream: Fast and ecient pipeline parallel dnn training,
arXiv preprint arXiv:1806.03377, 2018.
[12]
C. Rael, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,
and P. J. Liu, “Exploring the limits of transfer learning with a unied text-to-text
transformer, arXiv preprint arXiv:1910.10683, 2019.
12
[13]
Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep
neural networks, arXiv preprint arXiv:1807.05358, 2018.
[14]
A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar,
M. Norouzi, S. Bengio, and J. Dean, “Device placement optimization with rein-
forcement learning, arXiv preprint arXiv:1706.04972, 2017.
[15]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and
S. Wanderman-Milne, “JAX: composable transformations of Python+NumPy
programs, 2018. [Online]. Available: hp://github.com/google/jax
[16]
T. T. authors, Trax Deep Learning with Clear Code and Speed, 2020, hps:
//github.com/google/trax.
[17]
XLA: Optimizing Compiler for Machine LearningOperation Semantics, 2019,
hps://www.tensorow.org/xla/operation semantics.
[18]
S. Muchnick et al., Advanced compiler design implementation. Morgan kaufmann,
1997.
[19]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des-
maison, L. Antiga, and A. Lerer, “Automatic dierentiation in pytorch, 2017.
[20]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller, “Playing atari with deep reinforcement learning, arXiv preprint
arXiv:1312.5602, 2013.
[21]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate, arXiv preprint arXiv:1409.0473, 2014.
[22]
S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang,
L. Xia et al., “Dapple: A pipelined data parallel approach for training large models,
arXiv preprint arXiv:2007.01045, 2020.
[23]
M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Hor-
gan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in
deep reinforcement learning, in irty-Second AAAI Conference on Articial
Intelligence, 2018.
[24]
Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Duel-
ing network architectures for deep reinforcement learning, in International
conference on machine learning, 2016, pp. 1995–2003.
[25]
M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih,
R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks for exploration, arXiv
preprint arXiv:1706.10295, 2017.
[26]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
and Z. Zhang, “Mxnet: A exible and ecient machine learning library for
heterogeneous distributed systems, arXiv preprint arXiv:1512.01274, 2015.
[27] NVDIA DGX-1, 2019, hps://www.nvidia.com/en-us/data-center/dgx-1/.
[28] NCCL, 2019, hps://developer.nvidia.com/nccl.
[29]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding, arXiv preprint
arXiv:1810.04805, 2018.
[30]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition, arXiv preprint arXiv:1409.1556, 2014.
[31]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization, arXiv
preprint arXiv:1412.6980, 2014.
[32]
T. Schaul, J. an, I. Antonoglou, and D. Silver, “Prioritized experience replay,
arXiv preprint arXiv:1511.05952, 2015.
[33]
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double
q-learning, in irtieth AAAI conference on articial intelligence, 2016.
[34]
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,
arXiv preprint arXiv:1404.5997, 2014.
[35]
S. Pal, E. Ebrahimi, A. Zulqar, Y. Fu, V. Zhang, S. Migacz, D. Nellans, and
P. Gupta, “Optimizing multi-gpu parallelization strategies for deep learning
training, IEEE Micro, vol. 39, no. 5, pp. 91–101, 2019.
[36]
H.-T. Cheng, Z. Haque, L. Hong, M. Ispir, C. Mewald, I. Polosukhin, G. Roumpos,
D. Sculley, J. Smith, D. Soergel et al., “Tensorow estimators: Managing simplicity
vs. exibility in high-level machine learning frameworks, in Proce edings of the
23rd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2017, pp. 1763–1771.
[37]
J. Zhan and J. Zhang, “Pipe-torch: Pipeline-based distributed deep learning in
a gpu cluster with heterogeneous networking, in 2019 Seventh International
Conference on Advanced Cloud and Big Data (CBD). IEEE, 2019, pp. 55–60.
[38]
J. Geng, D. Li, and S. Wang, “Elasticpipe: An ecient and dynamic model-parallel
solution to dnn training, in Proceedings of the 10th Workshop on Scientic Cloud
Computing, 2019, pp. 5–9.
[39]
B. Yang, J. Zhang, J. Li, C. R
´
e, C. R. Aberger, and C. De Sa, “Pipemare: Asynchro-
nous pipeline parallel dnn training, arXiv preprint arXiv:1910.05124, 2019.
[40]
A. Goldie and A. Mirhoseini, “Placement optimization with deep reinforcement
learning, in Proceedings of the 2020 International Symposium on Physical Design,
2020, pp. 3–7.
[41]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level
control through deep reinforcement learning, nature, vol. 518, no. 7540, pp.
529–533, 2015.
13