In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2024)

research-article

Public Access

In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (1)In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2)In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (3)

Authors: Tyler Allen and Rong Ge

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2021

Article No.: 64, Pages 1 - 15

Published: 13 November 2021 Publication History

Related Artifact: Implementation of the article "In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing" November 2021softwarehttps://doi.org/10.5281/zenodo.5148930

  • 22citation
  • 2,270
  • Downloads

Metrics

Total Citations22Total Downloads2,270

Last 12 Months1,151

Last 6 weeks125

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • PDFeReader

      • View Options
      • References
      • Media
      • Tables
      • Share

    Abstract

    The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for the ease of use provided by systems-managed memory space with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is presently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for a novel in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocation for UVM and HMM motivates the improvement of the underlying system. We focus on a UVM-based system and investigate the root causes of the UVM overhead, which is a non-trivial task due to the complex interactions of multiple hardware and software constituents and the requirement of targeted analysis methodology.

    In this paper, we take a deep dive into the UVM system architecture and the internal behaviors of page fault generation and servicing. We reveal specific GPU hardware limitations using targeted benchmarks to uncover driver functionality as a real-time system when processing the resultant workload. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. We find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. We determine that the cost of host OS components is significant and present across implementations, warranting close attention. This study serves as a proxy for future shared memory systems such as those that interface with HMM.

    Supplementary Material

    MP4 File (In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing.mp4.mp4)

    Presentation video

    • Download
    • 218.89 MB

    References

    [1]

    [n.d.]. High Performance Geometric Multigrid. Retrieved July 13, 2021 from "https://crd.lbl.gov/departments/computer-science/par/research/hpgmg/"

    [2]

    Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 141--150.

    [3]

    Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 '17). Association for Computing Machinery, New York, NY, USA, 136--150.

    Digital Library

    [4]

    T. Baruah, Y. Sun, A. T. Dinçer, S. A. Mojumder, J. L. Abellán, Y. Ukidave, A. Joshi, N. Rubin, J. Kim, and D. Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 596--609.

    [5]

    Natalie Beams, Ahmad Abdelfattah, Stan Tomov, Jack Dongarra, Tzanio Kolev, and Yohann Dudouit. 2020. High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs. In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). 53--60.

    [6]

    D. A. Beckingsale, J. Burmark, R. Hornung, H. Jones, W. Killian, A. J. Kunen, O. Pearce, P. Robinson, B. S. Ryujin, and T. R. Scogland. 2019. RAJA: Portable Performance for Large-Scale Scientific Applications. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 71--81.

    [7]

    Manuel Birke, Bobby Philip, Zhen Wang, and Mark Berrill. 2019. Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs. arXiv:1208.1975 [cs.DC]

    [9]

    Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. ArXiv abs/1410.0759 (2014).

    [10]

    Steven Chien, Ivy Peng, and Stefano Markidis. 2019. Performance Evaluation of Advanced Features in CUDA Unified Memory. 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC) (Nov 2019).

    [11]

    Linux Kernel Development Community. [n.d.]. Heterogeneous Memory Management (HMM). Retrieved May 25, 2021 from https://www.kernel.org/doc/html/latest/vm/hmm.html

    [12]

    Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models. In High Performance Computing, Michela Taufer, Bernd Mohr, and Julian M. Kunkel (Eds.). Springer International Publishing, Cham, 489--507.

    [13]

    Jack Dongarra, Michael A Heroux, and Piotr Luszczek. 2016. High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems. The International Journal of High Performance Computing Applications 30, 1 (2016), 3--10. arXiv:https://doi.org/10.1177/1094342015593158

    Digital Library

    [14]

    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay Between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA '19). ACM, New York, NY, USA, 224--235.

    Digital Library

    [15]

    Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2020. Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.

    [16]

    R. Gayatri, K. Gott, and J. Deslippe. 2019. Comparing Managed Memory and ATS with and without Prefetching on NVIDIA Volta GPUs. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 41--46.

    [17]

    Prasun Gera, Hyojong Kim, Piyush Sao, Hyesoon Kim, and David Bader. 2020. Traversing Large Graphs on GPUs with Unified Memory. Proc. VLDB Endow. 13, 7 (March 2020), 1119--1133.

    Digital Library

    [18]

    Yongbin Gu, Wenxuan Wu, Yunfan Li, and Lizhong Chen. 2020. UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs. arXiv:2007.09822.

    [19]

    Michael A. Heroux, Roscoe A. Bartlett, Vicki E. Howle, Robert J. Hoekstra, Jonathan J. Hu, Tamara G. Kolda, Richard B. Lehoucq, Kevin R. Long, Roger P. Pawlowski, Eric T. Phipps, Andrew G. Salinger, Heidi K. Thornquist, Ray S. Tuminaro, James M. Willenbring, Alan Williams, and Kendall S. Stanley. 2005. An Overview of the Trilinos Project. ACM Trans. Math. Softw. 31, 3 (Sept. 2005), 397--423.

    Digital Library

    [20]

    John Hubbard and Jerome Glisee. 2017. GPUs: HMM: Heterogeneous Memory Management. https://www.redhat.com/files/summit/session-assets/2017/S104078-hubbard.pdf

    [21]

    Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. Batch-Aware Unified Memory Management in GPUs for Irregular Workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM.

    Digital Library

    [22]

    Marcin Knap and Paweł Czarnul. 2019. Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. The Journal of Supercomputing 75 (Nov. 2019), 7625--7645.

    [23]

    R. Landaverde, Tiansheng Zhang, A. K. Coskun, and M. Herbordt. 2014. An investigation of Unified Memory Access performance in CUDA. In 2014 IEEE High Performance Extreme Computing Conference (HPEC). 1--6.

    [24]

    Sheng Lin, Ning Liu, Mahdi Nazemi, Hongjia Li, Caiwen Ding, Yanzhi Wang, and Massoud Pedram. 2018. FFT-based deep learning deployment in embedded systems. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE). 1045--1050.

    [25]

    K. V. Manian, A. A. Ammar, A. Ruhela, C.-H. Chu, H. Subramoni, and D. K. Panda. 2019. Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures. In Proceedings of the 12th Workshop on General Purpose Processing Using GPUs (Providence, RI, USA) (GPGPU '19). ACM, New York, NY, USA, 43--52.

    Digital Library

    [26]

    Seung Won Min, Vikram Sharma Mailthody, Zaid Qureshi, Jinjun Xiong, Eiman Ebrahimi, and Wen mei Hwu. 2020. EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs. arXiv:2006.06890.

    [27]

    Saiful A. Mojumder, Yifan Sun, Leila Delshadtehrani, Yenai Ma, Trinayan Baruah, José L. Abellán, John Kim, David Kaeli, and Ajay Joshi. 2020. MGPU-TSM: A Multi-GPU System with Truly Shared Memory. arxiv:2008.02300.

    [28]

    J. M. Nadal-Serrano and M. Lopez-Vallejo. 2016. A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction Model. IEEE Transactions on Parallel and Distributed Systems 27, 6 (2016), 1579--1588.

    Digital Library

    [29]

    NVIDIA. [n.d.]. Open GPU Documentation. Retrieved May 25, 2021 from https://nvidia.github.io/open-gpu-doc/

    [30]

    Steve Plimpton. 1995. Fast Parallel Algorithms for Short-Range Molecular Dynamics. J. Comput. Phys. 117, 1 (1995), 1--19. http://lammps.sandia.gov.

    Digital Library

    [31]

    Steve Plimpton. 2017. FFTs for (mostly) Particle Codes within the DOE Exascale Computing Program. "https://www.osti.gov/servlets/purl/1483229"

    [32]

    Nikolay Sakharnykh. 2016. High-Performance Geometric Multi-Grid with GPU Acceleration. Retrieved May 25, 2021 from https://developer.nvidia.com/blog/high-performance-geometric-multi-grid-gpu-acceleration/

    [33]

    Nikolay Sakharnykh. 2019. Memory Management on Modern GPU Architectures. https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf

    [34]

    Jaewook Shin, Mary W. Hall, Jacqueline Chame, Chun Chen, Paul F. Fischer, and Paul D. Hovland. 2010. Speeding up Nek5000 with Autotuning and Specialization. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan) (ICS '10). Association for Computing Machinery, New York, NY, USA, 253--262.

    Digital Library

    [35]

    Stanimire Tomov, Azzam Haidar, Daniel Schultz, and Jack Dongarra. 2018. Evaluation and Design of FFT for Distributed Accelerated Systems. ECP WBS 2.3.3.09 Milestone Report FFT-ECP ST-MS-10-1216. Innovative Computing Laboratory, University of Tennessee. revision 10-2018.

    [36]

    Q. Yu, B. Childers, L. Huang, C. Qian, H. Guo, and Z. Wang. 2020. Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 472--482.

    [37]

    Qi Yu, Bruce Childers, Libo Huang, Cheng Qian, and Zhiying Wang. 2019. A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing 76, 4 (nov 2019), 2958--2985.

    Digital Library

    [38]

    Q. Yu, B. Childers, L. Huang, C. Qian, and Z. Wang. 2020. HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2020), 2461--2474.

    [39]

    T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 345--357.

    Cited By

    View all

    • Cooper BScogland TGe R(2024)Shared Virtual Memory: Its Design and Performance Implications for Diverse ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656608(26-37)Online publication date: 30-May-2024

      https://dl.acm.org/doi/10.1145/3650200.3656608

    • Wagley BMarkthub PCrea JWu BBelviranli M(2024)Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN ExecutionProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649413(7-12)Online publication date: 2-Mar-2024

      https://dl.acm.org/doi/10.1145/3649411.3649413

    • Elis BPearce OBoehme DBurmark JSchulz M(2024)Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU ParallelismProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635036(1-11)Online publication date: 18-Jan-2024

      https://dl.acm.org/doi/10.1145/3635035.3635036

    • Show More Cited By

    Index Terms

    1. In-depth analyses of unified virtual memory system for GPU accelerated computing

      1. Computer systems organization

        1. General and reference

          1. Cross-computing tools and techniques

            1. Performance

          2. Hardware

            1. Integrated circuits

              1. Semiconductor memory

                1. Dynamic memory

            2. Software and its engineering

              1. Software organization and properties

                1. Contextual software domains

            Index terms have been assigned to the content through auto-classification.

            Recommendations

            • On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

              SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

              The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...

              Read More

            • Architecture-Aware Mapping and Optimization on a 1600-Core GPU

              ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems

              The graphics processing unit (GPU) continues to make in-roads as a computational accelerator for high-performance computing (HPC). However, despite its increasing popularity, mapping and optimizing GPU code remains a difficult task, it is a multi-...

              Read More

            • GPU virtualization for high performance general purpose computing on the ESX hypervisor

              HPC '14: Proceedings of the High Performance Computing Symposium

              Graphics Processing Units (GPU) have become important components in high performance computing (HPC) systems for their massively parallel computing capability and energy efficiency. Virtualization technologies are increasingly applied to HPC to reduce ...

              Read More

            Comments

            Information & Contributors

            Information

            Published In

            In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (6)

            SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

            November 2021

            1493 pages

            ISBN:9781450384421

            DOI:10.1145/3458817

            • General Chair:
            • Bronis R. de Supinski,
            • Program Chairs:
            • Mary Hall,
            • Todd Gamblin

            Copyright © 2021 ACM.

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].

            Sponsors

            • SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

            In-Cooperation

            • IEEE CS

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 13 November 2021

            Permissions

            Request permissions for this article.

            Check for updates

            Badges

            Author Tags

            1. GPGPU
            2. GPU
            3. HMM
            4. NVIDIA
            5. UVM
            6. virtual memory

            Qualifiers

            • Research-article

            Funding Sources

            Conference

            SC '21

            Sponsor:

            • SIGHPC

            Acceptance Rates

            Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

            Contributors

            In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (13)

            Other Metrics

            View Article Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • 22

              Total Citations

              View Citations
            • 2,270

              Total Downloads

            • Downloads (Last 12 months)1,151
            • Downloads (Last 6 weeks)125

            Other Metrics

            View Author Metrics

            Citations

            Cited By

            View all

            • Cooper BScogland TGe R(2024)Shared Virtual Memory: Its Design and Performance Implications for Diverse ApplicationsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656608(26-37)Online publication date: 30-May-2024

              https://dl.acm.org/doi/10.1145/3650200.3656608

            • Wagley BMarkthub PCrea JWu BBelviranli M(2024)Exploring Page-based RDMA for Irregular GPU Workloads. A case study on NVMe-backed GNN ExecutionProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649413(7-12)Online publication date: 2-Mar-2024

              https://dl.acm.org/doi/10.1145/3649411.3649413

            • Elis BPearce OBoehme DBurmark JSchulz M(2024)Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU ParallelismProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635036(1-11)Online publication date: 18-Jan-2024

              https://dl.acm.org/doi/10.1145/3635035.3635036

            • Choi JJung SYeom HHong JPark J(2024)GPU Memory Reallocation Techniques in Fully hom*omorphic Encryption WorkloadsProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing10.1145/3605098.3636037(1525-1532)Online publication date: 8-Apr-2024

              https://dl.acm.org/doi/10.1145/3605098.3636037

            • Kang P(2023)Programming for High-Performance Computing on Edge AcceleratorsMathematics10.3390/math1104105511:4(1055)Online publication date: 20-Feb-2023
            • Allen TCooper BGe R(2023)Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual MemoryACM Transactions on Architecture and Code Optimization10.1145/363295321:1(1-24)Online publication date: 14-Nov-2023

              https://dl.acm.org/doi/10.1145/3632953

            • Iwata SArpaci-Dusseau RKasagi A(2023)An Analysis of Graph Neural Network Memory Access PatternsProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624168(915-921)Online publication date: 12-Nov-2023

              https://dl.acm.org/doi/10.1145/3624062.3624168

            • Zhang HZhou YXue YLiu YHuang J(2023)G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor MigrationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614309(395-410)Online publication date: 28-Oct-2023

              https://dl.acm.org/doi/10.1145/3613424.3614309

            • Li BGuo YWang YJaleel AYang JTang X(2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023

              https://dl.acm.org/doi/10.1145/3613424.3614269

            • Huang WDu YLiu M(2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023

              https://dl.acm.org/doi/10.1145/3605573.3605593

            • Show More Cited By

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Get Access

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            Get this Publication

            Media

            Figures

            Other

            Tables

            In-depth analyses of unified virtual memory system for GPU accelerated computing | Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2024)
            Top Articles
            Latest Posts
            Article information

            Author: Edmund Hettinger DC

            Last Updated:

            Views: 5794

            Rating: 4.8 / 5 (78 voted)

            Reviews: 85% of readers found this page helpful

            Author information

            Name: Edmund Hettinger DC

            Birthday: 1994-08-17

            Address: 2033 Gerhold Pine, Port Jocelyn, VA 12101-5654

            Phone: +8524399971620

            Job: Central Manufacturing Supervisor

            Hobby: Jogging, Metalworking, Tai chi, Shopping, Puzzles, Rock climbing, Crocheting

            Introduction: My name is Edmund Hettinger DC, I am a adventurous, colorful, gifted, determined, precious, open, colorful person who loves writing and wants to share my knowledge and understanding with you.