Cache Blocking and Parallel Runtime Scheduling of Hierarchical Matrices

国立国会図書館永続的識別子: info:ndljp/pid/13332152

資料種別: 博士論文

著者: Deshmukh, Sameer Satish

出版者: -

出版年: 2023-09

資料形態: デジタル

ページ数・大きさ等: -

授与大学名・学位: 東京工業大学,博士(工学)

すべて見る

国立国会図書館での利用に関する注記

本資料は、掲載誌(URI)等のリンク先にある学位授与機関のWebサイトやCiNii Dissertationsから、本文を自由に閲覧できる場合があります。

資料に関する注記

一般注記：: Important scientific problems in electrostatics, acoustics and statistics can be solved using the Boundary Element Method (BEM). The coefficient matri...

書店で探す

障害者向け資料で読む

障害者向け資料を見る（1種類）

書店で探す

障害者向け資料で読む

他サービス
- テキストデータ国立国会図書館デジタルコレクションで確認する

書誌情報

この資料の詳細や典拠（同じ主題の資料を指すキーワード、著者名）等を確認できます。

デジタル

資料種別: 博士論文
タイトル: Cache Blocking and Parallel Runtime Scheduling of Hierarchical Matrices
著者・編者: Deshmukh, Sameer Satish
著者標目: Deshmukh, Sameer Satish
出版年月日等: 2023-09
出版年（W3CDTF）: 2023-09
授与機関名: 東京工業大学
授与年月日: 2023-09-22
授与年月日（W3CDTF）: 2023-09-22
報告番号: 甲第12558号
学位: 博士(工学)
本文の言語コード: eng
対象利用者: 一般
一般注記: Important scientific problems in electrostatics, acoustics and statistics can be solved using the Boundary Element Method (BEM). The coefficient matrix from a BEM discretization results in a dense matrix, leading to O(N^2) and O(N^3) time complexity for the matrix-vector product and direct factorization algorithms, respectively. The underlying geometry from which the dense matrix is generated can be exploited in order to approximate the interactions between far points. These points are expressed in the dense matrix as off-diagonals blocks. Low rank approximation of the off-diagonal blocks of such a structured dense matrix can reduce the time complexity of the matrix-vector product and direct factorization algorithms to $O(N)$ by trading off accuracy for time. The accuracy of the algorithm can be tuned to match the required accuracy of the application.Algorithmic developments of the compression, multiplication and factorization routines of such low rank approximated matrices have led to reduction in the time complexity and better accuracy for a wide variety of problems. However, the implementation of such routines on modern, highly parallel computer architectures for such routines still requires the use of numerical software that was originally written for computation of dense linear algebra. The small, irregular kernels that are prevalent in the algorithms involving low rank approximation result in inefficient execution on modern hardware.In this thesis, we propose improvements to the efficiency of the matrix-vector product and direct factorization algorithms of low rank approximated dense matrices on shared and distributed memory machines. We show that our techniques are applicable to 2D problems for modeling acoustic waves, electrostatics and statistics with acceptable accuracy for the application in question. We target the block low rank and hierarchically semi-separable low rank matrix formats for this study since they have been shown to work well with 2D problems.The first part of this thesis improves the efficiency of the matrix-vector product. The matrix-vector product is useful for problems arising from the wave equation. The low rank matrix multiplication algorithm is a key computational kernel of the matrix vector product. We design efficient cache blocking algorithms by leveraging the Execution-Cache-Memory performance model. Our portable implementation making use of parallel loops written in a high level language with architecture-specific assembly micro-kernels. This preserves the portability of our method and allows us outperform vendor optimized BLAS libraries on the Fujitsu A64FX, Intel Xeon 6148 and AMD EPYC 7502 CPUs. We improve the matrix vector product of low rank matrices by obtaining a 2x performance improvement in the low rank matrix multiplication algorithm on all CPU architectures.The second part of this thesis improves the distributed memory factorization of hierarchical matrices arising out of 2D problems from electrostatics and statistics. Specifically, we show that we achieve up to 2x improvement in weak scaling performance over the state-of-the-art in dense direct factorization on up to 128 nodes of Fugaku. We show that this improvement in speed can be achieved with similar or better accuracy than other state-of-the-art implementations making use of low rank approximation of structured dense matrices. The main reasons behind the improvement can be attributed to the use of the HSS-ULV algorithm with a distributed asynchronous runtime system. The combination of an algorithm needing $O(N)$ computation and communication with an asynchronous distributed runtime system results in better overlap of computation with communication and results in better performance.
identifier:oai:t2r2.star.titech.ac.jp:50672915
国立国会図書館永続的識別子: info:ndljp/pid/13332152
https://dl.ndl.go.jp/pid/13332152
コレクション（共通）: 障害者向け資料
コレクション（障害者向け資料：レベル1）: テキストデータ
コレクション（個別）: 国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
https://dl.ndl.go.jp/collections/A00014
収集根拠: 博士論文（自動収集）
受理日（W3CDTF）: 2024-02-02T16:16:39+09:00
記録形式（IMT）: application/pdf
オンライン閲覧公開範囲: 国立国会図書館内限定公開
デジタル化資料送信: 図書館・個人送信対象外
遠隔複写可否（NDL）: 可
掲載誌（URI）: http://t2r2.star.titech.ac.jp/cgi-bin/publicationinfo.cgi?q_publication_content_number=CTT100902350
連携機関・データベース: 国立国会図書館 : 国立国会図書館デジタルコレクション
https://dl.ndl.go.jp