[13]
Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deep
neural networks,” arXiv preprint arXiv:1807.05358, 2018.
[14]
A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar,
M. Norouzi, S. Bengio, and J. Dean, “Device placement optimization with rein-
forcement learning,” arXiv preprint arXiv:1706.04972, 2017.
[15]
J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and
S. Wanderman-Milne, “JAX: composable transformations of Python+NumPy
programs,” 2018. [Online]. Available: hp://github.com/google/jax
[16]
T. T. authors, Trax Deep Learning with Clear Code and Speed, 2020, hps:
//github.com/google/trax.
[17]
XLA: Optimizing Compiler for Machine LearningOperation Semantics, 2019,
hps://www.tensorow.org/xla/operation semantics.
[18]
S. Muchnick et al., Advanced compiler design implementation. Morgan kaufmann,
1997.
[19]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Des-
maison, L. Antiga, and A. Lerer, “Automatic dierentiation in pytorch,” 2017.
[20]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint
arXiv:1312.5602, 2013.
[21]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly
learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[22]
S. Fan, Y. Rong, C. Meng, Z. Cao, S. Wang, Z. Zheng, C. Wu, G. Long, J. Yang,
L. Xia et al., “Dapple: A pipelined data parallel approach for training large models,”
arXiv preprint arXiv:2007.01045, 2020.
[23]
M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Hor-
gan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in
deep reinforcement learning,” in irty-Second AAAI Conference on Articial
Intelligence, 2018.
[24]
Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Duel-
ing network architectures for deep reinforcement learning,” in International
conference on machine learning, 2016, pp. 1995–2003.
[25]
M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih,
R. Munos, D. Hassabis, O. Pietquin et al., “Noisy networks for exploration,” arXiv
preprint arXiv:1706.10295, 2017.
[26]
T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
and Z. Zhang, “Mxnet: A exible and ecient machine learning library for
heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
[27] NVDIA DGX-1, 2019, hps://www.nvidia.com/en-us/data-center/dgx-1/.
[28] NCCL, 2019, hps://developer.nvidia.com/nccl.
[29]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[30]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[31]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv
preprint arXiv:1412.6980, 2014.
[32]
T. Schaul, J. an, I. Antonoglou, and D. Silver, “Prioritized experience replay,”
arXiv preprint arXiv:1511.05952, 2015.
[33]
H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double
q-learning,” in irtieth AAAI conference on articial intelligence, 2016.
[34]
A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,”
arXiv preprint arXiv:1404.5997, 2014.
[35]
S. Pal, E. Ebrahimi, A. Zulqar, Y. Fu, V. Zhang, S. Migacz, D. Nellans, and
P. Gupta, “Optimizing multi-gpu parallelization strategies for deep learning
training,” IEEE Micro, vol. 39, no. 5, pp. 91–101, 2019.
[36]
H.-T. Cheng, Z. Haque, L. Hong, M. Ispir, C. Mewald, I. Polosukhin, G. Roumpos,
D. Sculley, J. Smith, D. Soergel et al., “Tensorow estimators: Managing simplicity
vs. exibility in high-level machine learning frameworks,” in Proce edings of the
23rd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2017, pp. 1763–1771.
[37]
J. Zhan and J. Zhang, “Pipe-torch: Pipeline-based distributed deep learning in
a gpu cluster with heterogeneous networking,” in 2019 Seventh International
Conference on Advanced Cloud and Big Data (CBD). IEEE, 2019, pp. 55–60.
[38]
J. Geng, D. Li, and S. Wang, “Elasticpipe: An ecient and dynamic model-parallel
solution to dnn training,” in Proceedings of the 10th Workshop on Scientic Cloud
Computing, 2019, pp. 5–9.
[39]
B. Yang, J. Zhang, J. Li, C. R
´
e, C. R. Aberger, and C. De Sa, “Pipemare: Asynchro-
nous pipeline parallel dnn training,” arXiv preprint arXiv:1910.05124, 2019.
[40]
A. Goldie and A. Mirhoseini, “Placement optimization with deep reinforcement
learning,” in Proceedings of the 2020 International Symposium on Physical Design,
2020, pp. 3–7.
[41]
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level
control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp.
529–533, 2015.
13