Тема: Generalized sparse linear algebra library with vendor-agnostic GPUs acceleration
Закажите новую по вашим требованиям
Представленный материал является образцом учебного исследования, примером структуры и содержания учебного исследования по заявленной теме. Размещён исключительно в информационных и ознакомительных целях.
Workspay.ru оказывает информационные услуги по сбору, обработке и структурированию материалов в соответствии с требованиями заказчика.
Размещение материала не означает публикацию произведения впервые и не предполагает передачу исключительных авторских прав третьим лицам.
Материал не предназначен для дословной сдачи в образовательные организации и требует самостоятельной переработки с соблюдением законодательства Российской Федерации об авторском праве и принципов академической добросовестности.
Авторские права на исходные материалы принадлежат их законным правообладателям. В случае возникновения вопросов, связанных с размещённым материалом, просим направить обращение через форму обратной связи.
📋 Содержание
1. Problem statement 7
2. Background of study 8
2.1. Related Work 8
2.2. GraphBLAS concepts 11
2.3. GraphBLAS limitations 14
2.4. GPU computations 16
2.5. GPU architecture 18
2.6. OpenCL concepts 20
2.7. Implementation challenges on GPUs 22
3. Proposed solution description 24
3.1. Design Principles 24
3.2. Architecture overview 25
3.3. Data Containers 26
3.4. Algebraic Operations 27
3.5. Differences with GraphBLAS standard 29
4. Implementation details 31
4.1. Project structure 31
4.2. Compile-time dependencies 32
4.3. Development automation 33
4.4. Library interface 35
4.5. Algorithms registry 38
4.6. Storage formats 41
4.7. OpenCL backend 43
4.8. Linear Algebra Operations 44
4.9. Graph Algorithms 45
4.10. Running example 46
5. Evaluation 48
5.1. Research questions 48
5.2. Evaluation setup 48
5.3. Methodology 49
5.4. Graph algorithms 49
5.5. Dataset 50
5.6. Results Summary 51
6. Results 59
References 62
📖 Введение
Practical real-world graph data tends to be sparse and counts dozens of millions of vertices and billions of edges. Thus, scalable high- performance graph analysis is demanding area and an actual challenge for a research. There is a big number of ways to attack this challenge and the first promising idea is to utilize general-purpose graphic processing units. Such existing solutions, as CuSha and Gunrock show that utilization of GPUs can improve the performance of graph analysis, moreover it is shown that solutions may be scaled to multi-GPU systems. However, low flexibility and high complexity of API are problems of these solutions.
The second promising thing which provides a user-friendly API for high- performance graph analysis algorithms creation is a GraphBLAS API, which provides linear algebra based building blocks to create graph analysis algorithms. The idea of GraphBLAS is based on a well-known fact that linear algebra operations can be efficiently implemented on parallel hardware. Along with that, a graph can be natively represented using matrices: adjacency matrix, incidence matrix, etc. While reference CPU-based implementation of GraphBLAS, SuiteSparse, demonstrates good performance in real-world tasks, GPU-based implementation is challenging.
One of the challenges in this way is that real data are often sparse, thus underlying matrices and vectors are also sparse, and, as a result, classical data structures and respective algorithms are inefficient. So, it is necessary to use advanced data structures and procedures to implement sparse linear algebra, but the efficient implementation of them on GPUs is hard due to the irregularity of workload and data access patterns. Though such well- known libraries as cuSPARSE, clSPARSE, bhSPARSE, Cusp show that sparse linear algebra operations can be efficiently implemented for GPUs, it is not so trivial to implement GraphBLAS on GPU. First of all, it requires generalized sparse linear algebra, thus it is impossible just to reuse existing libraries which are almost all specified for operations over floats with classical element-wise functions. The second problem is specific optimizations, such as masking fusion, which can not be natively implemented on top of existing kernels. Nevertheless, there is a number of implementations of GraphBLAS on GPU, such as GraphBLAST, GBTL, which show that GPUs utilization can improve the performance of GraphBLAS-based graph analysis solutions. But these solutions are not portable across different device vendors because they are based on Nvidia Cuda stack.
Although GraphBLAS is solid and mature standard with a number of implementation, it has limitations and shortcomings discussed in a talk given by John R. Gilbert. Some of them are lack of interoperability and introspection, what is an obstacle on the way of GraphBLAS integration into real-world data analysis pipelines. Implicit zeroes mechanism and masking, which uses mix of engineering and math, leads to unpredictable memory usage in some cases, keeping API complex for both implementation and usage.
Summarizing, there is still no portable and high-performance library, which provides generalized linear-algebra based building blocks for real- world large sparse graph analysis problems. Such a library can potentially solve a number of problems, such as complex user API for particular graph algorithms implementation, GPUs performance and portability issues. Although GraphBLAS standard is a good reference point for implementation, it is still possible to design a bit different API. The solution can solve some of technical GraphBLAS limitations while still being coherent with the standard for further co-operation.
✅ Заключение
• The survey of the field was conducted. Model for a graph analysis were shown. Also, the concept of the linear algebra based approach was described in a great detail with a respect to a graph traversal and existing solutions. Introduction into a GraphBLAS standard was provided. Existing implementations, frameworks and most significant contributions for a graph analytic were studied. Their limitations were highlighted. General-purpose GPU computations concept was covered. Different APIs for GPU programming were presented. Their advantages and disadvantages were covered. General GPU programming challenges and pitfalls were highlighted.
• The architecture of the library for a generalized sparse linear algebra for GPU computations was developed. The architecture and library design was based on a project requirements, as well as on a limitation and experience of the existing solutions. The differences with GraphBLAS standard were stated.
• The library was implemented accordingly to the developed architecture. The core of the library, interface, data containers, built-in scalar data types, built-in element-wise functions, expressions processing, OpenCL backend functionality, common linear algebra operations implementations were covered. Several graph algorithms for a graph analysis were implemented using developed library API.
• The preliminary experimental study of the proposed artifacts was conducted. Obtained results allowed to conclude, that the chosen method of the library development is a promising way to a high-performance graph analysis in terms of the linear algebra on a wide family of GPU devices. The proposed solution showed comparable performance to the GraphBLAST, reaching up to x36 speedup in some cases, and showed constantly better performance, than SuiteS- parse, reaching up to x20 speedup in some cases. The developed library showed a scalability on different device vendors GPUs. Also, the proposed solutions got an acceptable performance on integrated GPUs in some cases even if compared with highly-optimized multicore CPU frameworks.
All in all, there are still a plenty of research questions and directions for improvement. Some of them are listed bellow.
• Performance tuning. There is a still space for optimizations. Better workload balancing must be done. Performance must be improved on AMD and Intel devices. More optimized algorithms must be implemented, such as SpGEMM algorithm proposed by Nagasaka et al. [21] for general mxm operation.
• Operations. Additional linear algebra operations must be implemented as well as useful subroutines for filtering, joining, loading, saving data, and other manipulations involved in typical graphs analysis.
• Graph streaming. The next important direction of the study is streaming of data from CPU to GPU for out-of-memory graphs analysis. CuSha adopt data partitioning techniques for graphs processing which do not fit single GPU. Modern GPUs have a limited VRAM. Even high-end devices allow only a moderate portion of the memory to be addressed by the kernel at the same time. Thus, manual streaming of the data from CPU to GPU is required in order to support analysis of extremely large graphs, which count billions of edges to process.
• Multi-GPU. Finally, scaling of the library to multiple GPUs must be implemented. Gunrock shows, that such approach can increase overall throughput and speedup processing of really dense graph. In connection with a streaming, it can be an ultimate solution for a large real-world graphs analysis.





