Тип работы:
Предмет:
Язык работы:


Generalized sparse linear algebra library with vendor-agnostic GPUs acceleration

Работа №142042

Тип работы

Магистерская диссертация

Предмет

информатика

Объем работы63
Год сдачи2023
Стоимость5500 руб.
ПУБЛИКУЕТСЯ ВПЕРВЫЕ
Просмотрено
27
Не подходит работа?

Узнай цену на написание


Introduction 5
1. Problem statement 7
2. Background of study 8
2.1. Related Work 8
2.2. GraphBLAS concepts 11
2.3. GraphBLAS limitations 14
2.4. GPU computations 16
2.5. GPU architecture 18
2.6. OpenCL concepts 20
2.7. Implementation challenges on GPUs 22
3. Proposed solution description 24
3.1. Design Principles 24
3.2. Architecture overview 25
3.3. Data Containers 26
3.4. Algebraic Operations 27
3.5. Differences with GraphBLAS standard 29
4. Implementation details 31
4.1. Project structure 31
4.2. Compile-time dependencies 32
4.3. Development automation 33
4.4. Library interface 35
4.5. Algorithms registry 38
4.6. Storage formats 41
4.7. OpenCL backend 43
4.8. Linear Algebra Operations 44
4.9. Graph Algorithms 45
4.10. Running example 46
5. Evaluation 48
5.1. Research questions 48
5.2. Evaluation setup 48
5.3. Methodology 49
5.4. Graph algorithms 49
5.5. Dataset 50
5.6. Results Summary 51
6. Results 59
References 62

Graph model is a natural way to structure data in a number of a real practical tasks, such as graph queries, graph databases, social net­works analysis, RDF data analysis, bioinformatics and static code analysis. In the graph model the entity is represented as a graph vertex. Relations between entities are directed labeled edge. This notation allows one to model the domain of the analysis with a little effort, preserv­ing complex relationships between objects. What is not easy and clear to do, for example, in a classic relational model, based on tables.
Practical real-world graph data tends to be sparse and counts dozens of millions of vertices and billions of edges. Thus, scalable high- performance graph analysis is demanding area and an actual challenge for a research. There is a big number of ways to attack this challenge and the first promising idea is to utilize general-purpose graphic processing units. Such existing solutions, as CuSha and Gunrock show that utiliza­tion of GPUs can improve the performance of graph analysis, moreover it is shown that solutions may be scaled to multi-GPU systems. However, low flexibility and high complexity of API are problems of these solutions.
The second promising thing which provides a user-friendly API for high- performance graph analysis algorithms creation is a GraphBLAS API, which provides linear algebra based building blocks to create graph analy­sis algorithms. The idea of GraphBLAS is based on a well-known fact that linear algebra operations can be efficiently implemented on parallel hard­ware. Along with that, a graph can be natively represented using matrices: adjacency matrix, incidence matrix, etc. While reference CPU-based imple­mentation of GraphBLAS, SuiteSparse, demonstrates good performance in real-world tasks, GPU-based implementation is challenging.
One of the challenges in this way is that real data are often sparse, thus underlying matrices and vectors are also sparse, and, as a result, classical data structures and respective algorithms are inefficient. So, it is necessary to use advanced data structures and procedures to implement sparse linear algebra, but the efficient implementation of them on GPUs is hard due to the irregularity of workload and data access patterns. Though such well- known libraries as cuSPARSE, clSPARSE, bhSPARSE, Cusp show that sparse linear algebra operations can be efficiently implemented for GPUs, it is not so trivial to implement GraphBLAS on GPU. First of all, it requires generalized sparse linear algebra, thus it is impossible just to reuse existing libraries which are almost all specified for operations over floats with classical element-wise functions. The second problem is specific optimizations, such as masking fusion, which can not be natively implemented on top of existing kernels. Nevertheless, there is a number of implementations of GraphBLAS on GPU, such as GraphBLAST, GBTL, which show that GPUs utilization can improve the performance of GraphBLAS-based graph analysis solutions. But these solutions are not portable across different device vendors because they are based on Nvidia Cuda stack.
Although GraphBLAS is solid and mature standard with a number of implementation, it has limitations and shortcomings discussed in a talk given by John R. Gilbert. Some of them are lack of interoperability and introspection, what is an obstacle on the way of GraphBLAS integra­tion into real-world data analysis pipelines. Implicit zeroes mechanism and masking, which uses mix of engineering and math, leads to unpredictable memory usage in some cases, keeping API complex for both implementation and usage.
Summarizing, there is still no portable and high-performance library, which provides generalized linear-algebra based building blocks for real- world large sparse graph analysis problems. Such a library can potentially solve a number of problems, such as complex user API for particular graph algorithms implementation, GPUs performance and portability issues. Al­though GraphBLAS standard is a good reference point for implementation, it is still possible to design a bit different API. The solution can solve some of technical GraphBLAS limitations while still being coherent with the standard for further co-operation.

Возникли сложности?

Нужна помощь преподавателя?

Помощь в написании работ!


The following results were achieved in this work.
• The survey of the field was conducted. Model for a graph analy­sis were shown. Also, the concept of the linear algebra based approach was described in a great detail with a respect to a graph traversal and existing solutions. Introduction into a GraphBLAS standard was pro­vided. Existing implementations, frameworks and most significant contributions for a graph analytic were studied. Their limitations were highlighted. General-purpose GPU computations concept was covered. Different APIs for GPU programming were presented. Their advantages and disadvantages were covered. General GPU program­ming challenges and pitfalls were highlighted.
• The architecture of the library for a generalized sparse linear algebra for GPU computations was developed. The architec­ture and library design was based on a project requirements, as well as on a limitation and experience of the existing solutions. The dif­ferences with GraphBLAS standard were stated.
• The library was implemented accordingly to the developed architecture. The core of the library, interface, data containers, built-in scalar data types, built-in element-wise functions, expressions processing, OpenCL backend functionality, common linear algebra operations implementations were covered. Several graph algorithms for a graph analysis were implemented using developed library API.
• The preliminary experimental study of the proposed arti­facts was conducted. Obtained results allowed to conclude, that the chosen method of the library development is a promising way to a high-performance graph analysis in terms of the linear algebra on a wide family of GPU devices. The proposed solution showed compara­ble performance to the GraphBLAST, reaching up to x36 speedup in some cases, and showed constantly better performance, than SuiteS- parse, reaching up to x20 speedup in some cases. The developed library showed a scalability on different device vendors GPUs. Also, the proposed solutions got an acceptable performance on integrated GPUs in some cases even if compared with highly-optimized multi­core CPU frameworks.
All in all, there are still a plenty of research questions and directions for improvement. Some of them are listed bellow.
• Performance tuning. There is a still space for optimizations. Bet­ter workload balancing must be done. Performance must be improved on AMD and Intel devices. More optimized algorithms must be imple­mented, such as SpGEMM algorithm proposed by Nagasaka et al. [21] for general mxm operation.
• Operations. Additional linear algebra operations must be imple­mented as well as useful subroutines for filtering, joining, loading, saving data, and other manipulations involved in typical graphs anal­ysis.
• Graph streaming. The next important direction of the study is streaming of data from CPU to GPU for out-of-memory graphs anal­ysis. CuSha adopt data partitioning techniques for graphs processing which do not fit single GPU. Modern GPUs have a limited VRAM. Even high-end devices allow only a moderate portion of the memory to be addressed by the kernel at the same time. Thus, manual streaming of the data from CPU to GPU is required in order to support analysis of extremely large graphs, which count billions of edges to process.
• Multi-GPU. Finally, scaling of the library to multiple GPUs must be implemented. Gunrock shows, that such approach can increase overall throughput and speedup processing of really dense graph. In connection with a streaming, it can be an ultimate solution for a large real-world graphs analysis.


[1] Barcelo Baeza Pablo. Querying Graph Databases // Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. — New York, NY, USA : Association for Comput­ing Machinery. — 2013. — PODS ’13. — P. 175-188. — Access mode: https://doi.org/10.1145/2463664.2465216.
[2] Buluq Aydin and Gilbert John R. The Combinatorial BLAS: Design, Implementation, and Applications // Int. J. High Perform. Comput. Appl. — 2011. — nov. — Vol. 25, no. 4. — P. 496-509. — Access mode: https://doi.org/10.1177/1094342011403516.
[3] Yzelman A. N., Di Nardo D., Nash J. M., and Suijlen W. J. A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation. — 2020.—Preprint. Access mode: http://albert-jan. yzelman.net/PDFs/yzelman20.pdf.
[4] Greathouse Joseph L., Knox Kent, Pola Jakub, Varaganti Kiran, and Daga Mayank. ClSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library // Proceedings of the 4th International Workshop on OpenCL. — New York, NY, USA : Association for Computing Ma­chinery. — 2016. — IWOCL ’16. — Access mode: https://doi.org/ 10.1145/2909437.2909442.
[5] Coimbra Miguel E., Francisco Alexandre P., and Veiga Luis. An anal­ysis of the graph processing landscape // Journal of Big Data. — 2021. — Apr.—Vol. 8, no. 1.—Access mode: https://doi.org/10. 1186/s40537-021-00443-9.
[6] Zhang Xiaowang, Feng Zhiyong, Wang Xin, Rao Guozheng, and Wu Wenrui. Context-Free Path Queries on RDF Graphs // CoRR. — 2015. — Vol. abs/1506.00743. —1506.00743.
[7] Khorasani Farzad, Vora Keval, Gupta Rajiv, and Bhuyan Laxmi N. CuSha: Vertex-Centric Graph Processing on GPUs // Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. — New York, NY, USA : Association for Computing Machinery. — 2014. — HPDC ’14. — P. 239-252. — Access mode: https://doi.org/10.1145/2600212.2600227.
[8] Dalton Steven, Bell Nathan, Olson Luke, and Garland Michael. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computa­tions.— 2014.—Version 0.5.0. Access mode: http://cusplibrary. github.io/.
[9] Davis Timothy A. Algorithm 1000: SuiteSparse:GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra // ACM Trans. Math. Softw. — 2019. — Dec. — Vol. 45, no. 4. — Access mode: https: //doi.org/10.1145/3322125.
[10] Davis Timothy A. and Hu Yifan. The University of Florida Sparse Matrix Collection // ACM Trans. Math. Softw. — 2011. — dec.— Vol. 38, no. 1.—Access mode: https://doi.org/10.1145/2049662. 2049663.
[11] Zhang Qirun, Lyu Michael R., Yuan Hao, and Su Zhendong. Fast Al­gorithms for Dyck-CFL-Reachability with Applications to Alias Anal­ysis // SIGPLAN Not.— 2013.— June.—Vol. 48, no. 6. — P. 435­446. — Access mode: https://doi.org/10.1145/2499370.2462159.
[12] Fu Zhisong, Personick Michael, and Thompson Bryan. MapGraph: A High Level API for Fast Development of High Performance Graph An­alytics on GPUs // Proceedings of Workshop on GRAph Data Man­agement Experiences and Systems. — New York, NY, USA : Associa­tion for Computing Machinery. — 2014. — GRADES’14. — P. 1-6. — Access mode: https://doi.org/10.1145/2621934.2621936.
[13] Zhang Peter, Zalewski Marcin, Lumsdaine Andrew, Misurda Saman­tha, and McMillan Scott. GBTL-CUDA: Graph Algorithms and Prim­itives for GPUs // 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). — 2016. — P. 912-920.
[14] Gilbert John R. What did the GraphBLAS get wrong? // HPEC GraphBLAS BoF. — 2022. — Access mode: https://sites.cs.ucsb. edu/~gilbert/talks/talks.htm.
[15] Shi Xuanhua, Zheng Zhigao, Zhou Yongluan, Jin Hai, He Ligang, Liu Bo, and Hua Qiang-Sheng. Graph Processing on GPUs: A Sur­vey // ACM Comput. Surv. — 2018.—jan.—Vol. 50, no. 6.—Access mode: https://doi.org/10.1145/3128571.
... всего 31 источник


Работу высылаем на протяжении 30 минут после оплаты.



Подобные работы


©2025 Cервис помощи студентам в выполнении работ