DAPPLE PPoPP ’21, February 27-March 3, 2021, Virtual Event, Republic of Korea
in neural information processing systems. 1223–1231.
[15]
Julien Demouth. 2015. CUDA Pro Tip: Minimize the Tail Eect. hps:
//devblogs.nvidia.com/cuda-pro-tip-minimize-the-tail-eect/
[16]
Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir,
and Brian Van Essen. 2019. Channel and lter parallelism for large-
scale CNN training. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis. 1–20.
[17]
Jinkun Geng, Dan Li, and Shuai Wang. 2019. Elasticpipe: An ecient
and dynamic model-parallel solution to dnn training. In Proceedings of
the 10th Workshop on Scientic Cloud Computing. ACM, 5–9.
[18]
Jinkun Geng, Dan Li, and Shuai Wang. 2019. Horizontal or Vertical?:
A Hybrid Approach to Large-Scale Distributed Machine Learning. In
Proceedings of the 10th Workshop on Scientic Cloud Computing. ACM,
1–4.
[19]
Jinkun Geng, Dan Li, and Shuai Wang. 2019. Rima: An RDMA-
Accelerated Model-Parallelized Solution to Large-Scale Matrix Factor-
ization. In 2019 IEEE 35th International Conference on Data Engineering
(ICDE). IEEE, 100–111.
[20]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming
He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677 (2017).
[21]
Juncheng Gu, Mosharaf Chowdhury, Kang G Shin, Yibo Zhu, Myeong-
jae Jeon, Junjie Qian, Hongqiang Liu, and Chuanxiong Guo. 2019.
Tiresias: A
{
GPU
}
cluster manager for distributed deep learning. In
16th
{
USENIX
}
Symposium on Networked Systems Design and Imple-
mentation ({NSDI} 19). 485–500.
[22]
Lei Guan, Wotao Yin, Dongsheng Li, and Xicheng Lu. 2019. XPipe:
Ecient Pipeline Model Parallelism for Multi-GPU DNN Training.
arXiv preprint arXiv:1911.04610 (2019).
[23]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri,
Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream:
Fast and ecient pipeline parallel dnn training. arXiv preprint
arXiv:1806.03377 (2018).
[24]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao
Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui
Wu, et al
.
2019. Gpipe: Ecient training of giant neural networks
using pipeline parallelism. In Advances in Neural Information Processing
Systems. 103–112.
[25]
Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. 2018. Decoupled
parallel backpropagation with convergence guarantee. arXiv preprint
arXiv:1804.10574 (2018).
[26]
Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter
Abbeel, Joseph Gonzalez, Kurt Keutzer, and Ion Stoica. 2020. Check-
mate: Breaking the memory wall with optimal tensor rematerialization.
Proceedings of Machine Learning and Systems 2 (2020), 497–511.
[27]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and
Gennady Pekhimenko. 2019. Priority-based parameter propagation
for distributed DNN training. arXiv preprint arXiv:1905.03960 (2019).
[28]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong,
Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu,
et al
.
2018. Highly scalable deep learning training system with
mixed-precision: Training imagenet in four minutes. arXiv preprint
arXiv:1807.11205 (2018).
[29]
Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring the
Hidden Dimension in Accelerating Convolutional Neural Networks.
(2018).
[30]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data
and model parallelism for deep neural networks. arXiv preprint
arXiv:1807.05358 (2018).
[31]
Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek,
Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020.
torchgpipe: On-the-y Pipeline Parallelism for Training Giant Models.
arXiv preprint arXiv:2004.09910 (2020).
[32]
Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional
neural networks. arXiv preprint arXiv:1404.5997 (2014).
[33]
Tung D Le, Taro Sekiyama, Yasushi Negishi, Haruki Imai, and Kiyokuni
Kawachiya. 2018. Involving CPUs into Multi-GPU Deep Learning. In
Proceedings of the 2018 ACM/SPEC International Conference on Perfor-
mance Engineering. ACM, 56–67.
[34]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen,
Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and
Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional
computation and automatic sharding. arXiv preprint arXiv:2006.16668
(2020).
[35]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus
Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy
Bengio, and Je Dean. 2017. Device placement optimization with rein-
forcement learning. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70. JMLR. org, 2430–2439.
[36]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri,
Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei
Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN
training. In Proceedings of the 27th ACM Symposium on Operating
Systems Principles. ACM, 1–15.
[37]
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and
Matei Zaharia. 2020. Memory-Ecient Pipeline-Parallel DNN Training.
arXiv preprint arXiv:2006.09503 (2020).
[38]
Saptadeep Pal, Eiman Ebrahimi, Arslan Zulqar, Yaosheng Fu, Victor
Zhang, Szymon Migacz, David Nellans, and Puneet Gupta. 2019. Opti-
mizing multi-GPU parallelization strategies for deep learning training.
IEEE Micro 39, 5 (2019), 91–101.
[39]
Colin Rael, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
Narang, Michael Matena, Yanqi Zhou, Wei Li, and J. Peter Liu. 2019.
Exploring the Limits of Transfer Learning with a Unied Text-to-Text
Transformer. https://arxiv.org/abs/1910.10683 (2019).
[40]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You
Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint
arXiv:1806.03822 (2018).
[41]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev
Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
Michael Bernstein, et al
.
2015. Imagenet large scale visual recogni-
tion challenge. International journal of computer vision 115, 3 (2015),
211–252.
[42]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Edinburgh
neural machine translation systems for WMT 16. arXiv preprint
arXiv:1606.02891 (2016).
[43]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast
and easy distributed deep learning in TensorFlow. arXiv preprint
arXiv:1802.05799 (2018).
[44]
Rich Sutton. 2019. The Bitter Lesson. hp://www.incompleteideas.net/
IncIdeas/BierLesson.html.
[45]
Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao,
and Dik Lun Lee. 2018. Billion-scale Commodity Embedding for E-
commerce Recommendation in Alibaba. In Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining. ACM, 839–848.
[46]
Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting
very large models using automatic dataow graph partitioning. In
Proceedings of the Fourteenth EuroSys Conference 2019. 1–17.
[47]
Mengdi Wang, Chen Meng, Guoping Long, Chuan Wu, Jun Yang, Wei
Lin, and Yangqing Jia. 2019. Characterizing Deep Learning Training
Workloads on Alibaba-PAI. arXiv preprint arXiv:1910.05930 (2019).
[48]
Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guoping
Long, Jun Yang, Xiaoyong Liu, and Wei Lin. 2020. Auto-MAP: A
DQN Framework for Exploring Distributed Execution Plans for DNN
Workloads. arXiv preprint arXiv:2007.04069 (2020).