Graph model is a natural way to structure data in a number of a real practical tasks, such as graph queries, graph databases, social networks analysis, RDF data analysis, bioinformatics and static code analysis. In the graph model the entity is represented as a graph vertex. Relations between entities are directed labeled edge. This notation allows one to model the domain of the analysis with a little effort, preserving complex relationships between objects. What is not easy and clear to do, for example, in a classic relational model, based on tables.
Practical real-world graph data tends to be sparse and counts dozens of millions of vertices and billions of edges. Thus, scalable high- performance graph analysis is demanding area and an actual challenge for a research. There is a big number of ways to attack this challenge and the first promising idea is to utilize general-purpose graphic processing units. Such existing solutions, as CuSha and Gunrock show that utilization of GPUs can improve the performance of graph analysis, moreover it is shown that solutions may be scaled to multi-GPU systems. However, low flexibility and high complexity of API are problems of these solutions.
The second promising thing which provides a user-friendly API for high- performance graph analysis algorithms creation is a GraphBLAS API, which provides linear algebra based building blocks to create graph analysis algorithms. The idea of GraphBLAS is based on a well-known fact that linear algebra operations can be efficiently implemented on parallel hardware. Along with that, a graph can be natively represented using matrices: adjacency matrix, incidence matrix, etc. While reference CPU-based implementation of GraphBLAS, SuiteSparse, demonstrates good performance in real-world tasks, GPU-based implementation is challenging.
One of the challenges in this way is that real data are often sparse, thus underlying matrices and vectors are also sparse, and, as a result, classical data structures and respective algorithms are inefficient. So, it is necessary to use advanced data structures and procedures to implement sparse linear algebra, but the efficient implementation of them on GPUs is hard due to the irregularity of workload and data access patterns. Though such well- known libraries as cuSPARSE, clSPARSE, bhSPARSE, Cusp show that sparse linear algebra operations can be efficiently implemented for GPUs, it is not so trivial to implement GraphBLAS on GPU. First of all, it requires generalized sparse linear algebra, thus it is impossible just to reuse existing libraries which are almost all specified for operations over floats with classical element-wise functions. The second problem is specific optimizations, such as masking fusion, which can not be natively implemented on top of existing kernels. Nevertheless, there is a number of implementations of GraphBLAS on GPU, such as GraphBLAST, GBTL, which show that GPUs utilization can improve the performance of GraphBLAS-based graph analysis solutions. But these solutions are not portable across different device vendors because they are based on Nvidia Cuda stack.
Although GraphBLAS is solid and mature standard with a number of implementation, it has limitations and shortcomings discussed in a talk given by John R. Gilbert. Some of them are lack of interoperability and introspection, what is an obstacle on the way of GraphBLAS integration into real-world data analysis pipelines. Implicit zeroes mechanism and masking, which uses mix of engineering and math, leads to unpredictable memory usage in some cases, keeping API complex for both implementation and usage.
Summarizing, there is still no portable and high-performance library, which provides generalized linear-algebra based building blocks for real- world large sparse graph analysis problems. Such a library can potentially solve a number of problems, such as complex user API for particular graph algorithms implementation, GPUs performance and portability issues. Although GraphBLAS standard is a good reference point for implementation, it is still possible to design a bit different API. The solution can solve some of technical GraphBLAS limitations while still being coherent with the standard for further co-operation.
The following results were achieved in this work.
• The survey of the field was conducted. Model for a graph analysis were shown. Also, the concept of the linear algebra based approach was described in a great detail with a respect to a graph traversal and existing solutions. Introduction into a GraphBLAS standard was provided. Existing implementations, frameworks and most significant contributions for a graph analytic were studied. Their limitations were highlighted. General-purpose GPU computations concept was covered. Different APIs for GPU programming were presented. Their advantages and disadvantages were covered. General GPU programming challenges and pitfalls were highlighted.
• The architecture of the library for a generalized sparse linear algebra for GPU computations was developed. The architecture and library design was based on a project requirements, as well as on a limitation and experience of the existing solutions. The differences with GraphBLAS standard were stated.
• The library was implemented accordingly to the developed architecture. The core of the library, interface, data containers, built-in scalar data types, built-in element-wise functions, expressions processing, OpenCL backend functionality, common linear algebra operations implementations were covered. Several graph algorithms for a graph analysis were implemented using developed library API.
• The preliminary experimental study of the proposed artifacts was conducted. Obtained results allowed to conclude, that the chosen method of the library development is a promising way to a high-performance graph analysis in terms of the linear algebra on a wide family of GPU devices. The proposed solution showed comparable performance to the GraphBLAST, reaching up to x36 speedup in some cases, and showed constantly better performance, than SuiteS- parse, reaching up to x20 speedup in some cases. The developed library showed a scalability on different device vendors GPUs. Also, the proposed solutions got an acceptable performance on integrated GPUs in some cases even if compared with highly-optimized multicore CPU frameworks.
All in all, there are still a plenty of research questions and directions for improvement. Some of them are listed bellow.
• Performance tuning. There is a still space for optimizations. Better workload balancing must be done. Performance must be improved on AMD and Intel devices. More optimized algorithms must be implemented, such as SpGEMM algorithm proposed by Nagasaka et al. [21] for general mxm operation.
• Operations. Additional linear algebra operations must be implemented as well as useful subroutines for filtering, joining, loading, saving data, and other manipulations involved in typical graphs analysis.
• Graph streaming. The next important direction of the study is streaming of data from CPU to GPU for out-of-memory graphs analysis. CuSha adopt data partitioning techniques for graphs processing which do not fit single GPU. Modern GPUs have a limited VRAM. Even high-end devices allow only a moderate portion of the memory to be addressed by the kernel at the same time. Thus, manual streaming of the data from CPU to GPU is required in order to support analysis of extremely large graphs, which count billions of edges to process.
• Multi-GPU. Finally, scaling of the library to multiple GPUs must be implemented. Gunrock shows, that such approach can increase overall throughput and speedup processing of really dense graph. In connection with a streaming, it can be an ultimate solution for a large real-world graphs analysis.
[1] Barcelo Baeza Pablo. Querying Graph Databases // Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. — New York, NY, USA : Association for Computing Machinery. — 2013. — PODS ’13. — P. 175-188. — Access mode: https://doi.org/10.1145/2463664.2465216.
[2] Buluq Aydin and Gilbert John R. The Combinatorial BLAS: Design, Implementation, and Applications // Int. J. High Perform. Comput. Appl. — 2011. — nov. — Vol. 25, no. 4. — P. 496-509. — Access mode: https://doi.org/10.1177/1094342011403516.
[3] Yzelman A. N., Di Nardo D., Nash J. M., and Suijlen W. J. A C++ GraphBLAS: specification, implementation, parallelisation, and evaluation. — 2020.—Preprint. Access mode: http://albert-jan. yzelman.net/PDFs/yzelman20.pdf.
[4] Greathouse Joseph L., Knox Kent, Pola Jakub, Varaganti Kiran, and Daga Mayank. ClSPARSE: A Vendor-Optimized Open-Source Sparse BLAS Library // Proceedings of the 4th International Workshop on OpenCL. — New York, NY, USA : Association for Computing Machinery. — 2016. — IWOCL ’16. — Access mode: https://doi.org/ 10.1145/2909437.2909442.
[5] Coimbra Miguel E., Francisco Alexandre P., and Veiga Luis. An analysis of the graph processing landscape // Journal of Big Data. — 2021. — Apr.—Vol. 8, no. 1.—Access mode: https://doi.org/10. 1186/s40537-021-00443-9.
[6] Zhang Xiaowang, Feng Zhiyong, Wang Xin, Rao Guozheng, and Wu Wenrui. Context-Free Path Queries on RDF Graphs // CoRR. — 2015. — Vol. abs/1506.00743. —1506.00743.
[7] Khorasani Farzad, Vora Keval, Gupta Rajiv, and Bhuyan Laxmi N. CuSha: Vertex-Centric Graph Processing on GPUs // Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing. — New York, NY, USA : Association for Computing Machinery. — 2014. — HPDC ’14. — P. 239-252. — Access mode: https://doi.org/10.1145/2600212.2600227.
[8] Dalton Steven, Bell Nathan, Olson Luke, and Garland Michael. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations.— 2014.—Version 0.5.0. Access mode: http://cusplibrary. github.io/.
[9] Davis Timothy A. Algorithm 1000: SuiteSparse:GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra // ACM Trans. Math. Softw. — 2019. — Dec. — Vol. 45, no. 4. — Access mode: https: //doi.org/10.1145/3322125.
[10] Davis Timothy A. and Hu Yifan. The University of Florida Sparse Matrix Collection // ACM Trans. Math. Softw. — 2011. — dec.— Vol. 38, no. 1.—Access mode: https://doi.org/10.1145/2049662. 2049663.
[11] Zhang Qirun, Lyu Michael R., Yuan Hao, and Su Zhendong. Fast Algorithms for Dyck-CFL-Reachability with Applications to Alias Analysis // SIGPLAN Not.— 2013.— June.—Vol. 48, no. 6. — P. 435446. — Access mode: https://doi.org/10.1145/2499370.2462159.
[12] Fu Zhisong, Personick Michael, and Thompson Bryan. MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs // Proceedings of Workshop on GRAph Data Management Experiences and Systems. — New York, NY, USA : Association for Computing Machinery. — 2014. — GRADES’14. — P. 1-6. — Access mode: https://doi.org/10.1145/2621934.2621936.
[13] Zhang Peter, Zalewski Marcin, Lumsdaine Andrew, Misurda Samantha, and McMillan Scott. GBTL-CUDA: Graph Algorithms and Primitives for GPUs // 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). — 2016. — P. 912-920.
[14] Gilbert John R. What did the GraphBLAS get wrong? // HPEC GraphBLAS BoF. — 2022. — Access mode: https://sites.cs.ucsb. edu/~gilbert/talks/talks.htm.
[15] Shi Xuanhua, Zheng Zhigao, Zhou Yongluan, Jin Hai, He Ligang, Liu Bo, and Hua Qiang-Sheng. Graph Processing on GPUs: A Survey // ACM Comput. Surv. — 2018.—jan.—Vol. 50, no. 6.—Access mode: https://doi.org/10.1145/3128571.
... всего 31 источник