I Introduction
A large class of applications in machine learning, scientific computing, and data analytics involve computations on dense symmetric positive definite (SPD) matrices. Many of these matrices such as the ones in kernel methods in statistical learning
[1], Gaussian processes [2], matrix factorizations [3], and integral equations [4] are structured (or lowrank, or datasparse). Identifying this structure and compressing the lowrank blocks can significantly reduce the storage and computation costs of dense matrix computations. Different lowrank representations have been studied to exploit the structure in dense matrices. Amongst them, Hierarchically SemiSeparable (HSS) algorithms are one of the most widely used.This paper presents MatRox, a novel algorithm and implementation that improves the performance of HSS matrix computations on multicore architectures. The HSS algorithm first constructs an HSS matrix from a dense matrix using lowrank approximation during a compression phase. One of the components in compression is a binary tree which we call the HSStree. As shown in Equation 1, is composed of a block diagonal matrix that stores diagonal blocks of and lowrank matrices and that approximate offdiagonal blocks. The results of compression are used to reduce the computation complexity of operations involving the HSS matrix in an evaluation
phase. This work focuses on SPD matrices and our evaluation is for matrixmatrix (and matrixvector) multiplication shown in Equation
2 where .(1)  
(2) 
Recent work has investigated efficient implementations of the HSS algorithm, specifically the evaluation phase, on parallel multicore architectures [5, 6, 7]. The HSStree represents the computation pattern and the dependency between matrix subblocks during evaluation. To improve the performance of HSS algorithms on parallel processors, most of existing HSS implementations, e.g., GOFMM [5] and STRUMPACK [6], use the HSStree as the input for dynamic task scheduling. Thus the parallelism strategies in these libraries differ based on the type of dynamic scheduling used. For example, STRUMPACK builds a queue of tasks by postorder traversal of the HSStree and then uses the OpenMP dynamic task scheduler. MatRox also uses the OpenMP dynamic task scheduler but combines it with a queue of tasks that is built by traversing the HSStree levelbylevel.
Eventhough investigating scheduling algorithms is important for improving the performance of HSS algorithms on parallel architectures, in this work, we demonstrate that an efficient data layout, that takes advantage of the inherent datasparsity in HSS matrices and the pattern of computation in evaluation, can significantly improve the performance and scalability of HSS evaluation on parallel processors. Also, existing libraries require tuning of parameters such as the HSStree depth for performance and portability which requires repeatedly executing HSS compression and evaluation phases. We build a theoretical model in MatRox that predicts an optimal depth for the HSStree, eliminating tuning overheads for this parameter.
MatRox Overview. MatRox consists of a performance model to analyze the memory bottleneck in HSS algorithms and determine an optimal tree depth () that coupled with a novel storage format, called Compressed Datasparse Storage (CDS), significantly improves the performance of the HSS algorithm on multicore architectures. The model is used with a heuristic to determine an optimal depth . This depth is used in the compression phase in MatRox where uniform sampling [8] with adaptive compression creates and stores it in the CDS format. MatRox is shown in Algorithm 1. ModelEval in line 1 of the algorithm uses information from the architecture to create a theoretical model of the evaluation phase and to determine the optimum HSStree depth. A userspecified accuracy along with the optimal depth are passed to CompresstoCDS to generate the compressed matrix. The matrix blocks created at compression are stored in the CDS format to follow the computation pattern of HSS evaluation. The compressed matrix in CDS is used in line 1 to efficiently compute the matrixmatrix multiplication in Equation 2.
The paper is organized to discuss preliminaries about the HSS matrix compression and the evaluation process in Section II. Section IIIA discusses the ModelEval phase. We will then introduce the CDS storage format in Section IIIB and discuss why this format follows the computation pattern of HSS evaluation and how it can improve locality of HSS evaluation and enhance its scalability. The compresstoCDS phase is discussed in section IV where we demonstrate how compression is implemented in MatRox to generate the compressed matrix in CDS to be used in the evaluation. Our contributions are as follows:

A novel storage format called CDS that follows the data sparsity and computation pattern of HSS algorithms to improve the performance and scalability of HSS matrix evaluations on shared memory multicore processors.

A mixedrank heuristic that finds the best performing HSStree depth for fast HSS evaluation eliminating tuning overheads. Our heuristic also comes with a theoretical performance model that demonstrates HSS matrixmatrix multiplications are memory bound.

Implementation of the improved HSS algorithm with the storage format and the mixedrank heuristic in MatRox. We show that MatRox outperforms GOFMM and STRUMPACK on average and respectively. The endtoend HSS algorithm (compression and evaluation) is accelerated with MatRox over GOFMM and STRUMPACK on average and respectively.
Ii Preliminaries
This section briefly explains HSS matrices and how an HSS matrix is built from to be used for approximating in Equation 2. We use notation similar to [6]. To construct , the matrix is partitioned hierarchically and its offdiagonal blocks are approximated using lowrank matrices as shown in Figure 1. A binary tree, which we call the HSStree, is used to represent this hierarchical partitioning. is used to show the depth of the HSStree. The root node depth is and the depth of the HSStree shown in Figure 1 is .
The index set belongs to the HSStree node and is the set for the root node where . If is an internal node with children and then . At the leafs of the HSStree, the diagonal blocks are stored as dense matrices. The offdiagonal blocks , with siblings , and are approximated using lowrank matrices as
(3) 
with tallskinny matrices and that are the bases for the row and column spaces of . Complex conjugation is denoted as . For the internal node , and are defined hierarchically as
(4) 
while as shown in Figure 1, for the leaf node , and . Figure 1 shows a twolevel HSS partitioning of a matrix along with it’s HSStree in which , , and are called generators for node and is the generator for nodes and . These generators on HSS will be used to construct matrix approximated by HSS. We use a symmetric HSS matrix in this work thus according to [9].
Compression. The HSS approximation of the matrix is constructed during a compression phase. At compression, a randomized sampling algorithm [10]
is often used along with a lowrank matrix factorization such as the interpolative decomposition (ID)
[11] to compute generators , and . We will briefly discuss ID and the importance of rank of approximation here; sampling will be elaborated in Section IV. ID approximates a matrix with rank by using a subset of columns of the matrix to interpolate other columns such that , where consists of column indices, is the coefficient matrix, is , and is the compression error tolerance. The number of columns in can be adjusted to reduce the compression error. During compression, as demonstrated in Equation 3, the offdiagonal blocks, e.g. , are approximated. If ID with rank is used for this approximation, the generators and will be tall skinny matrices with columns and will of size . In practice, the rank for approximating offdiagonal blocks for an accurate enough HSS approximation is unknown in advance. Adaptiverank methods [12], where the offdiagonal blocks are approximated with different ranks, or a fixedrank method, in which all the offdiagonal blocks are approximated with a tuned fixed rank , are used to reach the desired accuracy.Evaluation. The objective of the evaluation phase is to operate on the compressed matrix , line 1 of Algorithm 1. We organize the evaluation phase into four stages based on [4] as follows: (i) Diagonal Multiplication (DM), which performs exact computations on diagonal blocks associated with the HSS leaf nodes, i.e. ; (ii) Forward Transformation (FT), that operates on the generators with a tree uptraversal. This stage performs for leaf nodes and is computed for inner nodes using the results of and from the children; (ii) Multiplication Phase (MP) which uses the generators for leaf and inner nodes to compute ; (iv) Back Transformation (BT), that involves computations on generators with a tree downtraversal. is performed on an inner node with children and , and results are added to their children; For a leaf node : is computed and accumulated to the final result. Table I demonstrates all the stages.
Notation. Throughout the paper we use the following notions: is the number of rows (or columns) of and the number of rows in , is the number of columns in , is the HSStree depth, is the leaf node size, is the rank for approximation, is the number of leaf nodes, is the number of inner nodes which are all the nodes except the root and leaf nodes. The relationship between the depth of an HSStree and other variables is: , and .
Stages  DM  FT  MP  BT  

Ops 


Iii The Evaluation Performance Model and Evaluation with CDS
The objective of HSS algorithms is to reduce the execution time of operations involving structured dense matrices by compressing and performing the matrix operation on the compressed matrix . Improving the performance of the evaluation phase is the main objective of HSS matrix implementations which is also the primary objective of this work. In this section, we propose a novel data layout, called the compressed datasparse storage format (CDS), to improve the performance and scalability of HSS evaluations for matrixmatrix multiplications on multicore architectures. Prior to discussing CDS, we will derive a performance model for the evaluation phase and introduce MatRox’s mixedrank heuristic for finding an optimal HSStree depth for faster evaluation.
Iiia Performance model and optimal depth
The following discusses the performance model for evaluation and the heuristic used in the EvalModel phase of MatRox for determining best depth.
A memory bound evaluation phase. To model the performance and memory access behaviour of the HSS evaluation phase, we use operational intensity (OI) from the roofline model [13] which refers to operations per byte of DRAM access. To compute OI, we divide the number of floating point operations by memory accesses after they have been filtered by the cache hierarchy. Datareuse in the evaluation phase in MatRox only occurs in the FT and BT stages, as shown in table I, and on a small part of the data. We account for this reuse in our model by only once counting accesses to DRAM for these data. Table II and the following shows how we compute and :
(5)  
(6) 
For simplicity, this analysis assumes all the offdiagonal blocks are approximated in rank . We explain the process of computing and for FT as an example. At a leaf node , the FT stage computes , with floating point operations and memory accesses. For an inner node with children and , FT performs for inner nodes by using the computed and from its children. floating point operations and memory accesses occur for each inner node.
Stages  Floating point operations and memory accesses 

DM  
FT  
MP  
BT  
To analyze the behavior of the evaluation phase, we obtain which is the minimum operation intensity required to achieve maximum performance on the target architecture and compare it to the evaluation algorithm’s OI. and are the peak floating point performance and peak memory bandwidth (over all cores) of the architecture respectively. The algorithm is memorybound when it’s OI is less than and is computebound otherwise. As shown in Figure 2, the HSS evaluation phase easily becomes memory bound with increasing depth. In the following, we will use the above to determine the best depth for fast evaluation.
Finding the optimal depth. In the ModelEval phase of MatRox we execute a mixedrank heuristic, shown in Algorithm 2, that finds an optimal depth, , for fast evaluation. Line 5 in Algorithm 2 computes a theoretical execution time for evaluation using the following based on the roofline model:
(7) 
Lines 22 find the best performing depth for a specific fixedrank HSS tree with rank and store that depth in . The array is populated with the best performing depth for HSStrees with different fixedranks. Finally in Line 9, the depth that most frequently appears in the is returned as . In the algorithm is depth, is rank, is the maximum possible rank of offdiagonal blocks and is bounded by the dimension of these blocks, and is the maximum possible depth and is bounded by the matrix dimension. Eventhough MatRox is implemented based on adaptiverank, because the offdiagonal blocks in adaptiverank methods are approximated with different ranks, our heuristic uses theoretical times from multiple fixedrank HSStrees and then chooses the mostfrequent bestpreforming depth amongst all the fixedrank HSStrees. This heuristic works well in practice and finds an optimal rank in all our experiments. Figure 3 demonstrates this by showing the experimental time from MatRox’s evaluation phase and the theoretical time for four fixedrank trees. The circled points are the values stored in from Algorithm 2 when executed for these ranks. Depth 7 is the most frequent appearing depth in the array, thus will be 7. As shown in the figure, the experimental results also obtain best performance with depth 7, which shows that the heuristic closely follows the experiments.
We should note that the cost of Algorithm 2, i.e. the EvalModel stage in MatRox, is negligible because the algorithm only operates on scalars and the matrix data is never required for computations in the model.
IiiB Evaluation with the novel CDS storage format
As discussed in the preliminaries section, the compression of the datasparse HSS matrix creates submatrices to be accessed for evaluation. The existing implementations of HSS evaluation use a linked list to store these submatrices. However, a linked list does not guarantee consecutive memory accesses and will negatively effect the performance of the inherently memorybound HSS evaluation. In the following, we will first present a novel storage format for the evaluation phase with the objective to increase spatial locality. The computation pattern of the DM, FT, MP, and BT stages in evaluation will then be analyzed to demonstrate that evaluation benefits from CDS. MatRox uses this implementation to improve the data locality of the HSS evaluation on parallel architectures. We will also demonstrate that CDS improves the scalability of evaluation on multicore architectures.
The CDS storage format: We propose the CDS storage format where the HSS matrix, e.g. Figure 1, is stored from a linked list based storage to a layout that improves locality for evaluation. Figure 4(a) shows how the generators and the diagonal blocks of the HSS matrix in Figure 1 are stored in the CDS format. As shown, different diagonal blocks , and the generators and are stored consecutively in arrays Dia, Bgen, and Ugen respectively. Dptr, Bptr, and Uptr point to the starting point of a block in the Dia, Bgen, and Ugen arrays.
The matrix blocks are stored in CDS based on the order of computation to improve spatial locality. By order of computation we are referring to the order which , , and are accessed during parallel execution of the evaluation phase. The HSStree represents the dependencies between the submatrices of , thus, the execution pattern is a levelbylevel traversal of the HSStree and CDS stores the submatrices based on this levelbylevel traversal. For example, as shown in Figure 4, and in Figure 1 are stored in consecutive locations in the UGen array. Spatial locality in the evaluation phase is more important than temporal locality since the small submatrices accessed during evaluation are seldom reused. For example, while evaluation for an HSStree with depth accesses submatrices including diagonal blocks, only of these small matrix blocks are reused; data is only reused in the BT and FT stages.
Figure 4(b) shows how we use CDS in the evaluatewithCDS stage in MatRox. yoffset, woffset, moffset, and foffset are used for indexing of Y, W, MP, and FT respectively. The evaluation begins with the DM stage where as shown in lines 3–4 of Figure 4(b), diagonal blocks in CDS, the Dia array, are accessed. The iterations of this loop are independent of each other and the output of the stage is directly written to the final output Y array. The FT phase is shown in lines 6–13 where matrix is multiplied with the transpose of its corresponding matrix in the UGen array. Since all nodes in a level are independent, all iterations of the inner loop in line 8 run in parallel. Lines 16–17 show the MP stage where the generators are accessed from BGen via the BPtr array. The MP stage uses the output of the previous stage that is stored in the FT array. The iterations of this loop run in parallel. The last stage is FT where the matrices are again retrieved from UGen in reverse order. Since CDS is a continuous data layout based on the computation order, it works as an ordered ready queue for the scheduler, and thus using a generic scheduler such as omp parallel with dynamic scheduling provides a load balanced scheduling.
Scalability: CDS improves the local memory accesses with increasing number of cores to sustain scalability on multicore processors. Large number of main memory accesses can reduce performance on multicore processors [14] and becomes more critical in memorybound problems such as HSS matrixmatrix multiplication in the evaluation phase. As shown in Figure 5, we use the average memory access latency [15] of the evaluation phase for different number of cores (1 to 12) to demonstrate that CDS enhances scalability compared to existing data layouts such as [5]. The average memory access latency, which is a measure for locality, is computed from the L1 cache, TLB, and the last level cache (LLC) misses using PAPI. Figure 5 shows the direct relation between average memory access latency improvement in MatRox and the achieved speedup of the evaluation phase compared to GOFMM. The coefficient of determination or R is 0.988 that shows a perfect relation between speedup and average memory access latency. Since CDS follows the HSStree levelbylevel traversal pattern, the memory access time is reduced as the number of cores scale.
Storage and conversion cost: CDS has low storage and conversion costs. Storage cost refers to extra memory locations used in CDS and conversion cost is the time that is spent for converting a linked list based storage to the CDS format. Memory needed for storing the , , and blocks is the same as existing storage approaches such as [5]. The only storage cost specific to CDS is the memory required to store the pointer arrays Dptr, Bptr, and Uptr and the storage cost of each is equal to the number of nodes in the HSStree, i.e. . CDS has a conversion cost that occurs in the compression stage where the HSS matrix needs to be restored. Since the rank is unknown in adaptiverank compression, MatRox first stores the data in a temporary buffer, and after the rank of each block is known, it stores the lowrank matrices in the CDS format. The conversion cost for and is and respectively. The overhead of this conversion is small and is on average 10.8 percent of the compression time for the evaluated matrices in this paper. More details are in Section V.
Iv Compression in MatRox
Compression in HSS algorithms consists of (i) permutation, used to reveal the lowrank structure of the matrix, (ii) randomized sampling to reduce the computation overhead, and (iii) lowrank factorization of subsampled matrices. The permutation and lowrank factorization in MatRox’s CompresstoCDS is the same as GOFMM. However, MatRox uses a different sampling method, uniform sampling, which is discussed in this section. In MatRox, the compression phase also has an additional step at the end to store the compressed data into CDS; the overhead for this step was discussed in Section III.B as the conversion cost.
In the following we explain the process of uniform sampling in MatRox. As stated in Section II, compression applies ID to the matrix blocks in each node and computes a factorization of rank by expressing each matrix block as a set of its columns [11]. These tall and skinny blocks with rows ranging from to , can lead to a costly ID [6]. Therefore, randomized sampling methods are used a priori to reduce this cost. For a leaf node and its sibling , consider the matrix block where we show the set of all rows except with with index set . MatRox samples these matrix rows uniformly at random to generate a sampled matrix . The number of rows in depends on number of samples. Then, this subsampled matrix is approximated by ID as . By applying the same operation on , we get the approximation . Finally, we obtain the generator by . because of using symmetric matrices in MatRox. The number of samples may vary based on the dataset, but a sampling of size twice the number of columns for each block works well in practice. A similar approach is taken for internal nodes.
In Figure 6 we use leverage scores to show that uniform sampling performs well for the tested datasets. From [8] rowleverage scores, the Euclidean norm of rows of right singular vectors, can be used as a measure of importance for rows. A uniform distribution of leverage scores demonstrates that uniform sampling will perform well [8]. We compute the leverage scores for rows of random leaf nodes for matrices from both machine learning (susy) and scientific applications (hepmass, K03, K06, K08, and K09). The cumulative sum for normalized leverage scores is shown in Figure 6. As shown, the rowleverage scores form a uniform distribution thus uniform sampling will work well in practice. Prior work has also shown that uniform sampling performs well for many realworld datasets [16, 17, 18]; all our experiments which include 20 different matrices from different domains also demonstrate this. Uniform sampling in MatRox has a overhead which is less than the sampling overhead in STRUMPACK, which uses random projection methods with , and GOFMM that samples based on nearestneighbor information with . Uniform sampling does not need full access to all elements of the matrix.
V Experimental Results
We compare the performance of MatRox with STRUMPACK [6] and GOFMM [5], which are considered as two stateoftheart parallel implementations for HSS algorithms. For compression, the userspecified error tolerance is 1E5 and the maximum possible rank with adaptiverank is for all tools. Number of columns in is Q=2048 and the relative error in evaluation is in the Frobenius norm similar to GOFMM:
(8) 
Our dataset includes 20 matrices listed in Table IV. The K02 to K16 datasets are from [5] and the rest are kernel matrices generated using realworld datasets from the UCI repository [19]. K02, K03, K12K14 are from scientific applications and resemble inverse matrices and Hessian operators from optimization and uncertainty problems. All other matrices are kernel matrices assembled from a kernel function and data points from a real dataset where , in which is a kernel function and are data points from the dataset. The K04K10 are kernel matrices assembled from mostly scientific datasets. Table IV assigns an ID to each matrix. Matrices with ID 1620 are generated from covtype, susy, and hepmass; covtype and susy are machine learning datasets. We use a Gaussian kernel[1] with bandwidth for these datasets. Matrices with ID’s 118 have a dimension of , the dimension of matrices 19 and 20 is and respectively. To generate matrices 1620 we had to randomly sample from their application data points to fit in memory. GOFMM supports kernel matrices assembled from data points as well as general SPD matrices, thus we use all the datasets for comparison with GOFMM. The kernel function and data points are required inputs to STRUMPACK, thus we compare results from MatRox with STRUMPACK for datasets 1620 for which we had access to the data points.
MatRox is implemented in C++ with double precision (DP) and uses OpenMP 4.0 [20] for shared memory parallelism. All the code is compiled with icc/icpc 16.0.3. The STRUMPACK and GOFMM are installed using the recommended default configuration. We report the median of 5 executions for all the experiments. The tested architectures are listed in Table III.
Processor 




Microarchitecture  Haswell  KNL  
Frequency  2.5 GHz  1.4 GHz  
Physical cores  12  68  
Lastlevel cache  30 MB  34 MB  
Peak DP performance  480 Gflops  3264.4 Gflops  
Peak memory bandwidth  68 GB/s  115.2 GB/s 
Matrices  K02K16  covtype  susy  hepmass 

ID  115  16  17  1820 
In the following, we will first show endtoend (compression and evaluation) results from MatRox with the optimal depth recommended by the mixedrank heuristic and compare with STRUMPACK and GOFMM with their recommended tuned depths. To demonstrate the performance benefits provided by CDS for evaluation, we then compare MatRox with STRUMPACK and GOFMM using the same HSStree depth.
Va MatRox endtoend results with optimal depth
Table V provides a comprehensive comparison between MatRox, GOFMM, and STRUMPACK for both performance and accuracy, in which all the tools are using their recommended configurations. MatRox uses the mixedrank heuristic to use an optimal depth during evaluation. As shown in the table, will vary based on the architecture and the matrix dimensions because of the mixedrank heuristic. The on KNL and Haswell is 28.3 and 7.1 respectively, thus, based on Figure 2, because the minimum OI to reach the maximum performance on KNL is higher compared to Haswell, the evaluation algorithm will obtain its best performance with smaller depths on KNL.
As shown in Table V, MatRox evaluation is on average and times faster than GOFMM on Haswell and KNL respectively. Compared to STRUMPACK, MatRox evaluation obtains speedups of and on Haswell and KNL respectively. The performance improvement in the evaluation phase comes from the optimal depth provided by the mixedrank heuristic and the proposed storage format. Compression in MatRox is also faster than the other tools because of using uniform sampling. For example, compression with MatRox on Haswell is on average 2.2 and 3.5 times faster than GOFMM and STRUMPACK respectively. If we consider the endtoend execution of the HSS algorithm which includes both the compression (C) and evaluation (E), MatRox’s speedups are and on average over GOFMM and STRUMPACK on Haswell, and on average over GOFMM and STRUMPACK on KNL. The compression times reported in MatRox include the conversion time to convert to CDS.
Table V, also shows that the depth recommended by the mixedrank heuristic does not negatively affect accuracy. As shown in the table, the accuracy of MatRox is close to GOFMM’s accuracy and is more accurate than STRUMPACK, which fails to compress matrices and on both KNL and Haswell. MatRox uses the same permutation method [21] as GOFMM and then compresses the offdiagonal blocks with adaptiverank to reach the userspecified tolerance. STRUMPACK uses a different permutation method, i.e. recursive twomeans [22]
based on kmeans clustering. Thus, the permutation method is more likely to affect accuracy and the observed difference in depth between tools does not have a noticeable effect on accuracy because of using adaptiverank algorithms.
ID  Haswell  KNL  
MatRox  GOFMM  STRUMPACK  MatRox  GOFMM  STRUMPACK  
C  E  C  E  C  E  C  E:  C  E  C  E  
1  7  0.22  0.10  2E3  0.42  0.35  9E4  N/A  5  0.99  0.16  7E4  1.46  0.66  7E4  N/A  
2  7  0.45  0.08  4E8  0.23  0.31  5E8  5  1.15  0.11  4E8  0.72  0.51  5E8  
3  7  0.39  0.12  3E5  1.43  0.36  4E5  5  1.14  0.16  4E5  2.35  0.58  4E5  
4  7  0.41  0.20  2E4  1.47  0.44  9E5  5  1.35  0.21  7E5  2.33  0.69  1E4  
5  7  0.81  0.44  2E1  1.89  0.82  2E1  5  2.13  0.42  2E1  3.32  0.90  2E1  
6  7  0.38  0.13  4E6  1.42  0.36  3E6  5  1.10  0.16  6E6  2.33  0.56  5E6  
7  7  0.50  0.26  4E5  1.54  0.49  4E5  5  1.51  0.28  4E5  2.58  0.79  5E5  
8  7  0.46  0.26  6E5  1.53  0.51  5E5  5  1.57  0.28  5E5  2.69  0.78  6E5  
9  7  0.31  0.11  8E16  1.45  0.35  1E15  5  1.04  0.15  9E16  2.36  0.56  9E16  
10  7  0.18  0.07  7E5  0.29  0.31  3E5  5  0.64  0.10  2E5  0.95  0.56  2E5  
11  7  0.21  0.11  4E4  0.37  0.35  2E4  5  0.90  0.15  2E4  0.99  0.66  1E4  
12  7  0.33  0.14  3E4  0.56  0.37  3E4  5  1.12  0.14  2E4  1.15  0.63  2E4  
13  7  0.36  0.14  3E4  0.43  0.38  2E4  5  1.11  0.14  2E4  1.13  0.65  2E4  
14  7  0.90  0.60  2E+0  1.90  0.97  3E+0  5  1.94  0.38  2E+0  3.60  0.93  3E+0  
15  7  1.00  0.61  1E+0  2.19  0.76  1E+0  5  2.05  0.38  1E+0  3.97  0.88  1E+0  
16  7  0.79  0.15  7E6  0.60  0.54  1E6  1.28  0.56  1E5  5  2.69  0.31  4E7  1.81  0.98  1E6  45.2  2.22  1E5 
17  7  0.46  0.09  4E9  0.55  0.46  9E9  1.31  0.59  1E5  5  1.61  0.21  3E9  1.27  0.71  6E9  45.3  2.45  1E5 
18  7  1.17  0.31  9E5  1.76  0.99  1E4  5.14  2.53  1E+0  5  2.47  0.41  8E5  3.29  0.86  1E4  97.2  8.43  2E+0 
19  6  0.57  0.18  7E5  0.79  0.50  7E5  2.36  1.20  7E1  4  1.78  0.32  6E5  2.51  0.73  6E5  32.8  4.65  7E1 
20  8  3.13  0.56  1E4  3.65  1.97  2E4  13.5  4.75  3E+0  6  4.78  0.52  1E4  6.70  1.65  2E4  261  16.4  5E+0 
VB The storage format and scalability
To demonstrate the efficiency of CDS, the scalability and performance of the evaluation phase in MatRox is compared to GOFMM and STRUMPACK. To show the performance benefits of CDS, an HSSTree with the same depth is used for all the three tools. This ensures that the sizes and number of matrix subblocks are the same for all tools for fair comparison. Since CDS is used during evaluation, reported times are for the evaluation stage. We show how CDS outperforms the other two libraries for matrices with different datasparsity and will also demonstrate that CDS leads to a scalable evaluation phase on different architectures.
Figure 7 and Table VI demonstrate the performance benefits of CDS in MatRox compared to GOFMM and STRUMPACK on Haswell and KNL. MatRox benefits from the better memory layout provided by CDS and is faster than GOFMM and STRUMPACK, and on Haswell and and on KNL respectively. The performance benefits obtained from CDS depend on the architecture. Since KNL has higher compared to Haswell, better speedups are achieved on KNL over Haswell.
Eventhough CDS improves the performance of the evaluation phase for all datasets shown in Figure 7 and Table VI, the amount of speedup achieved differs based on the available datasparsity in the matrix. We compute the average rank of offdiagonal blocks after the matrix has been compressed (with adaptiverank) as a measure for the available datasparsity in the matrix/dataset. The higher the average rank, the less datasparse the dataset is. The highest reported speedups belong to matrices that are very datasparse such as matrices 2 and 10 with average ranks of 1 and 2 respectively. Matrices 5, 14, and 15 that have the largest average rank, in order 211, 253 and 255, show relatively lower speedups. As the compressed matrix becomes sparser, operational intensity decreases and thus, memory accesses dominate the execution time. CDS improves the memory access time in MatRox and makes it more efficient for datasparse matrices.
Figure 8 shows the scalability benefits of CDS for different matrices and architectures. While MatRox scales well on both architectures, GOFMM and STRUMPACK are not scalable or scale weakly when the number of cores increases specially on KNL. As explained in Section IIIB and in Figure 5, CDS improves spatial locality and reduces the memory access time with increasing number of cores while scalability in other libraries is limited by memory access times.
As discussed in Section IIIB, the compression phase of MatRox has an additional stage that converts the linkedlist based storage to CDS which we call the conversion time. Figure 9 shows the ratio of conversion to the entire compression time on Haswell. As shown, the conversion cost is small and is on average percentage of the entire compression phase on Haswell; the same trend is also observed on KNL. Also, often the datasets are compressed once with HSS compression and then used for repeated evaluations thus this cost will amortize.
Vi Related work
Hierarchical matrices are used to approximate matrix computations in almost linear complexity. Hackbusch first introduced matrices [23, 24], in which the matrix is partitioned hierarchically using a cluster tree and then parts of the offdiagonal blocks are approximated. Later, matrices were introduced [25]
which use nested basis matrices, to further reduce the computational complexity of dense matrix computations. Many lowrank formats exist and most are classified as
 or matrices. For example, Hierarchically OffDiagonal LowRank (HODLR) [26] which approximates all offdiagonal blocks in lowrank is considered as a specific subclass of matrices. HSS is similar to HODLR but uses nested basis matrices and thus is a subclass of matrices.Numerous algorithms are studied for the compression of hierarchical matrices some of which include truncated SVD, rankrevealing QR [27], rankrevealing LU [28], CUR factorization [29], Nystrom method [30], adaptive cross approximation [31], and interpolative decomposition (ID) [11]. MatRox uses ID in its compression phase. Randomized algorithms [10] are frequently used for faster lowrank approximations during compression. For example, Martinsson [11] uses randomized sampling for the construction of HSS matrices, in which elements are sampled from the matrix with matrixvector multiplications. ASKIT [21] and GOFMM implement neighborbased importance sampling and STRUMPACK [6] uses random projection[8]. Uniform sampling has been shown to perform well for many realworld problems [16, 17] and is used in MatRox.
During the evaluation, hierarchical matrices can be used to reduce the complexity of numerous matrix operations. The frontal matrices in multifrontal methods often have a lowrank structure [3] thus hierarchical matrices have been studied to accelerate matrix factorization and develop fast solvers in work such as [9, 32, 26]. matrices are used with to build robust preconditioners for GMRES [33]. Ghysels et al. [34] introduces a parallel and fully algebraic preconditioner based on HSS. Other work has improved the complexity of matrix operations such as matrix inversion [35] and matrixvector/matrix multiplication [32]. The evaluation phase in STRUMPACK and GOFMM supports matrixvector and matrixmatrix operations which we use for comparison.
Hierarchical matrix algorithms have been implemented on various platforms ranging from shared memory [6, 5, 7], distributed memory [36, 12, 37], and manycore architectures such as GPUs [38]. ASKIT and GOFMM provide a parallel fast multipole method on multicore processors. STRUMPACK [6, 12] and Hsolver [39] use HSS for accelerating multifrontal factorization on both distributed architectures and multicore processors. STRUMPACK also supports matrixmatrix multiplication on distributed architectures [12]. Existing parallel codes take advantage of tree parallelism for an efficient parallel implementation. Particularly, existing multicore frameworks such as GOFMM propose novel scheduling in combination with tree traversals. However, MatRox focuses on improving the memory access time of HSS matrixmatrix multiplication on multicore processors by proposing a performance model and a new storage format.
Vii Conclusion
We introduce MatRox to improve the performance of HSS algorithms on shared memory multicore architectures. We propose a novel storage format to improve the performance and scalability of HSS evaluation. Our mixedrank heuristic finds the optimal HSStree depth for faster HSS evaluation eliminating the need to tune for best depth. Uniform sampling is also used for fast compression. Our experiments show MatRox outperforms stateoftheart HSS libraries on multicore processors.
References
 [1] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” The annals of statistics, pp. 1171–1220, 2008.
 [2] S. Börm and J. Garcke, “Approximating gaussian processe with matrices,” in European Conference on Machine Learning. Springer, 2007, pp. 42–53.
 [3] S. Chandrasekaran, P. Dewilde, M. Gu, and N. Somasunderam, “On the numerical rank of the offdiagonal blocks of schur complements of discretized elliptic pdes,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 5, pp. 2261–2290, 2010.
 [4] W. Hackbusch, Hierarchical matrices: algorithms and analysis. Springer, 2015, vol. 49.
 [5] C. D. Yu, J. Levitt, S. Reiz, and G. Biros, “Geometryoblivious fmm for compressing dense spd matrices,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2017, p. 53.
 [6] P. Ghysels, X. S. Li, F.H. Rouet, S. Williams, and A. Napov, “An efficient multicore implementation of a novel hssstructured multifrontal solver using randomized sampling,” SIAM Journal on Scientific Computing, vol. 38, no. 5, pp. S358–S384, 2016.
 [7] R. Kriemann, “Parallelmatrix arithmetics on shared memory systems,” Computing, vol. 74, no. 3, pp. 273–297, 2005.
 [8] M. W. Mahoney et al., “Randomized algorithms for matrices and data,” Foundations and Trends® in Machine Learning, vol. 3, no. 2, pp. 123–224, 2011.
 [9] J. Xia, S. Chandrasekaran, M. Gu, and X. S. Li, “Fast algorithms for hierarchically semiseparable matrices,” Numerical Linear Algebra with Applications, vol. 17, no. 6, pp. 953–976, 2010.
 [10] N. Halko, P.G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
 [11] P.G. Martinsson, “A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix,” SIAM Journal on Matrix Analysis and Applications, vol. 32, no. 4, pp. 1251–1274, 2011.
 [12] F.H. Rouet, X. S. Li, P. Ghysels, and A. Napov, “A distributedmemory package for dense hierarchically semiseparable matrix computations using randomization,” ACM Transactions on Mathematical Software (TOMS), vol. 42, no. 4, p. 27, 2016.
 [13] S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
 [14] X.H. Sun and Y. Chen, “Reevaluating amadahl’s law in the multicore era,” Journal of Parallel and Distributed Computing, vol. 70, no. 2, pp. 183–188, 2010.
 [15] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2017.
 [16] S. Kumar, M. Mohri, and A. Talwalkar, “Sampling methods for the nyström method,” Journal of Machine Learning Research, vol. 13, no. Apr, pp. 981–1006, 2012.
 [17] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford, “Uniform sampling for matrix approximation,” in Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science. ACM, 2015, pp. 181–190.
 [18] A. Gittens and M. W. Mahoney, “Revisiting the nyström method for improved largescale machine learning,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 3977–4041, 2016.
 [19] A. Asuncion and D. Newman, “Uci machine learning repository,” 2007.
 [20] A. OpenMP, “Openmp 4.0 specification,” 2013.
 [21] W. B. March, B. Xiao, C. D. Yu, and G. Biros, “Askit: an efficient, parallel library for highdimensional kernel summations,” SIAM Journal on Scientific Computing, vol. 38, no. 5, pp. S720–S749, 2016.

[22]
E. Rebrova, G. Chavez, Y. Liu, P. Ghysels, and X. S. Li, “A study of clustering techniques and hierarchical matrix formats for kernel ridge regression,” in
2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018.  [23] S. Börm, L. Grasedyck, and W. Hackbusch, “Introduction to hierarchical matrices with applications,” Engineering analysis with boundary elements, vol. 27, no. 5, pp. 405–422, 2003.
 [24] W. Hackbusch, “A sparse matrix arithmetic based on matrices. part i: Introduction to matrices,” Computing, vol. 62, no. 2, pp. 89–108, 1999.
 [25] W. Hackbusch, B. Khoromskij, and S. Sauter, “On h2matrices: Lectures on applied mathematics,” 2000.
 [26] A. Aminfar, S. Ambikasaran, and E. Darve, “A fast block lowrank dense solver with applications to finiteelement matrices,” Journal of Computational Physics, vol. 304, pp. 170–188, 2016.
 [27] T. F. Chan, “Rank revealing qr factorizations,” Linear algebra and its applications, vol. 88, pp. 67–82, 1987.
 [28] L. Miranian and M. Gu, “Strong rank revealing lu factorizations,” Linear algebra and its applications, vol. 367, pp. 1–16, 2003.
 [29] M. W. Mahoney and P. Drineas, “Cur matrix decompositions for improved data analysis,” Proceedings of the National Academy of Sciences, pp. pnas–0 803 205 106, 2009.
 [30] C. K. Williams and M. Seeger, “Using the nyström method to speed up kernel machines,” in Advances in neural information processing systems, 2001, pp. 682–688.
 [31] M. Bebendorf and S. Rjasanow, “Adaptive lowrank approximation of collocation matrices,” Computing, vol. 70, no. 1, pp. 1–24, 2003.
 [32] S. Chandrasekaran, M. Gu, and T. Pals, “A fast ulv decomposition solver for hierarchically semiseparable representations,” SIAM Journal on Matrix Analysis and Applications, vol. 28, no. 3, pp. 603–622, 2006.
 [33] B. Engquist and L. Ying, “Sweeping preconditioner for the helmholtz equation: hierarchical matrix representation,” Communications on pure and applied mathematics, vol. 64, no. 5, pp. 697–735, 2011.
 [34] P. Ghysels, X. S. Li, C. Gorman, and F.H. Rouet, “A robust parallel preconditioner for indefinite systems using hierarchical matrices and randomized sampling,” in Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE, 2017, pp. 897–906.
 [35] P.G. Martinsson and V. Rokhlin, “A fast direct solver for boundary integral equations in two dimensions,” Journal of Computational Physics, vol. 205, no. 1, pp. 1–23, 2005.
 [36] W. B. March, B. Xiao, S. Tharakan, D. Y. Chenhan, and G. Biros, “A kernelindependent fmm in general dimensions,” in High Performance Computing, Networking, Storage and Analysis, 2015 SCInternational Conference for. IEEE, 2015, pp. 1–12.
 [37] W. B. March, B. Xiao, and G. Biros, “Askit: Approximate skeletonization kernelindependent treecode in high dimensions,” SIAM Journal on Scientific Computing, vol. 37, no. 2, pp. A1089–A1110, 2015.
 [38] W. B. March, B. Xiao, D. Y. Chenhan, and G. Biros, “An algebraic parallel treecode in arbitrary dimensions,” in Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. IEEE, 2015, pp. 571–580.
 [39] S. Wang, X. S. Li, F.H. Rouet, J. Xia, and M. V. De Hoop, “A parallel geometric multifrontal solver using hierarchically semiseparable structure,” ACM Transactions on Mathematical Software (TOMS), vol. 42, no. 3, p. 21, 2016.
Comments
There are no comments yet.