Linear algebra is an excellent instrument for solving a wide variety of problems by utilizing matrices and vectors for data representation and analysis with the help of highly optimized routines. And whilst the matrices involved in a vast diversity of modern applications, e.g., recommender systems [22, 3] and graph analysis [34, 4], consist of a large number of elements, the major part of them are zeros. For example, the matrix representing YouTube’s social network connectivity contains only 2.31% non-zeros [21]. Such a high sparsity incurs both computational and storage inefficiencies, requiring unnecessarily large storage occupied by zero elements and many operations on zeroes, where the result is obviously known beforehand. The traditional approach to address these inefficiencies is to compress the matrix, store only the non-zero elements, and then operate only on non-zero values. Thus, the effect of matrices tending to be sparse in many applications makes the techniques of matrix compressed representation and sparse linear algebra to be an effective way of tackling problems in areas including but not limited to graph analysis [20], computational biology [9] and machine learning [10].
GraphBLAS [14] standard defines sparse linear algebra building blocks useful to express algorithms for already mentioned areas uniformly in terms of sparse matrix and vector operations over some semiring. These include, for instance, matrix/vector multiplication, element-wise operations (e-wise for short), Kronecker product, masking, i.e., taking a subset of elements that satisfies the mask or its complement, etc.. These are sufficient to express a lot of algorithms, e.g. PageRank, Breadth-First-Search, Sparse Deep Neural Network [7].
However, sparse computations appear to have a low arithmetic-to-me- mory operations intensity, meaning that the main bottleneck of sparse algorithms is the sparse representation itself. It induces pointer-chasing and presents irregularity of memory accesses. Thus, a number of optimizations were identified [46], whose aim is to reduce the intensity of memory accesses, and the one considered in this work is fusion. Fusion simply stands for gluing several functions into one to remove intermediate data structures,namely those that are first constructed and then deconstructed.
An example of fusion could be seen in figure 1. There a masked sparse matrix-spare vector multiplication (shown with 0) is depicted. The mask simply takes the subset of the result, but fusing the mask inside the multiplication reduces the number of memory accesses by avoiding the construction of an intermediate matrix that would store the result of multiplication.
In the case of loop-based programming, fusion simply stands for joining several loops into one to increase memory locality and reduce the number of required iterations. It is a crucial technique in dense applications and is usually implemented as a part of affine transformations and polyhedral analysis [2]. However, indexing in sparse applications is not an affine one.
Some general-purpose solutions exist that support fusion (e.g., [12]) which are based on map/reduce semantics. But in order to support sparse operations, they should be able to fuse across index arithmetic, which is not the case. Also, at the moment neither SuiteSparse [40] nor GraphBlast [46] have adopted automatic fusion in their implementations but address the issue, and these are widely used GraphBLAS implementations for CPU and GPU platforms, respectively.
Further, there are several implementations of GraphBLAS standard that show decent performance for both GPU [46] and CPU [6] backends, with one of them already speeding up graph database queries [28]. However, typical CPUs and GPUs were proven to be underutilized [8, 21, 24, 36], i.e., their computing units do not achieve peak performance, for tasks that involve sparsity, due to being too general-purpose by design and suffering from the irregularity of memory accesses incurred by sparsity.
The traditional approach for problems when existing hardware does not provide enough performance or appear to be energy-inefficient for a particular application is the design of an application-specific processor. Such an approach has found a successful application in image processing [15, 25], programmable networks [23], and machine learning [5, 45]. Thus, a possible way of addressing the underutilization issues of the current hardware running sparse algorithms is the design of an application-specific processor that could handle sparse operations more efficiently....
In the course of this work, the following results were obtained:
• Approaches for fusion in different area were studied. It was shown that distillation might be a good choice for fusion automation in the area of sparse linear algebra programs.
• Hardware generation was implemented by combining the distiller and FHW compiler. Both the distiller and the compiler were refined to have the needed functionality.
• The support for passing arguments to the main function in FHW was added. The passing itself works under AXI-Stream protocol and supports tree-like structures.
• A testbench was implemented by using SystemVerilog generation and TCL scripts. It allows to query execution times and collect memory usage statistics. Configurations and results of all benchmarks are stored on GitHub. The evaluation showed the success of fusion by means of distillation and slight performance degradation from hardware generation. Future work directions were outlined.
• Two posters were accepted at ICFP SRC 2021, and VPT 2022 respectively.
[1] ARM. AMBA®4 AXI4-Stream Protocol Specification. — 2010. —
URL: https://developer.arm.eom/documentation/ihi0051/a/
Introduction/About-the-AXI4-Stream-protocol (online; accessed: 2022-06-07).
[2] Acharya Aravind, Bondhugula Uday, Cohen Albert. Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs // ACM Trans. Archit. Code Optim.— 2020. — sep.— Vol. 17, no. 4.— 26 p. — URL: https://doi.org/10.1145/3416510.
[3] Gupta Udit, Wu Carole-Jean, Wang Xiaodong et al. The Architectural Implications of Facebook’s DNN-based Personalized Recommendation. — 2020. — 1906.03109.
[4] Brin Sergey, Page Lawrence. The anatomy of a large-scale hypertex- tual Web search engine // Computer Networks and ISDN Systems. — 1998.— Vol. 30, no. 1.— P. 107 - 117.— Proceedings of the Seventh International World Wide Web Conference. URL: http://www. sciencedirect.com/science/article/pii/S016975529800110X.
[5] Cass S. Taking AI to the edge: Google’s TPU now comes in a makerfriendly package // IEEE Spectrum. — 2019. — Vol. 56, no. 5. — P. 1617.
[6] Davis T. A. Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss // 2018 IEEE High Performance extreme Computing Conference (HPEC). — 2018. — P. 1-6.
[7] Davis Timothy A., Aznaveh Mohsen, Kolodziej Scott. Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLAS // 2019 IEEE High Performance Extreme Computing Conference (HPEC). — 2019. — P. 1-6.
[8] Davis Timothy A., Hu Yifan. The University of Florida Sparse Matrix Collection // ACM Trans. Math. Softw. — 2011. —. — Vol. 38, no. 1. — 25 p. — URL: https://doi.org/10.1145/2049662.2049663.
[9] Selvitopi Oguz, Ekanayake Saliya, Guidi Giulia et al. Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices. — 2020. — 2009.14467.
[10] Enabling massive deep neural networks with the GraphBLAS / Jeremy Kepner, Manoj Kumar, Jose Moreira et al. // 2017 IEEE High Performance Extreme Computing Conference (HPEC). — 2017. — Sep. — URL: http://dx.doi.org/10.1109/HPEC.2017.8091098.
[11] FHW Project : High-Level Hardware Synthesis from Haskell Programs / S. Edwards, Martha A. Kim, Richard Townsend et al. — 2019.
[12] Futhark: Purely Functional GPU-Programming with Nested Parallelism and in-Place Array Updates / Troels Henriksen, Niels G. W. Serup, Martin Elsman et al. // SIGPLAN Not.— 2017. —.— Vol. 52, no. 6.— P. 556-571.— URL: https://doi.org/10.1145/ 3140587.3062354.
[13] The Future is Big Graphs! A Community View on Graph Processing Systems / Sherif Sakr, Angela Bonifati, Hannes Voigt et al. // arXiv preprint arXiv:2012.06171. — 2020.
[14] The GraphBLAS C API Specification / Aydin Buluc, Timothy Mattson, Scott McMillan et al. // GraphBLAS. org, Tech. Rep. — 2017.
[15] Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines / Jonathan Ragan- Kelley, Connelly Barnes, Andrew Adams et al. // SIGPLAN Not. — 2013. — .— Vol. 48, no. 6.— P. 519-530.— URL: https://doi.org/ 10.1145/2499370.2462176.....46