本資料は、掲載誌(URI)等のリンク先にある学位授与機関のWebサイトやCiNii Dissertationsから、本文を自由に閲覧できる場合があります。
博士論文
国立国会図書館館内限定公開
収録元データベースで確認する
国立国会図書館デジタルコレクション
デジタルデータあり
Cache Blocking and Parallel Runtime Scheduling of Hierarchical Matrices
- 国立国会図書館永続的識別子
- info:ndljp/pid/13332152
国立国会図書館での利用に関する注記
資料に関する注記
一般注記:
- Important scientific problems in electrostatics, acoustics and statistics can be solved using the Boundary Element Method (BEM). The coefficient matri...
書店で探す
障害者向け資料で読む
書店で探す
障害者向け資料で読む
書誌情報
この資料の詳細や典拠(同じ主題の資料を指すキーワード、著者名)等を確認できます。
デジタル
- 資料種別
- 博士論文
- 著者・編者
- Deshmukh, Sameer Satish
- 出版年月日等
- 2023-09
- 出版年(W3CDTF)
- 2023-09
- 授与機関名
- 東京工業大学
- 授与年月日
- 2023-09-22
- 授与年月日(W3CDTF)
- 2023-09-22
- 報告番号
- 甲第12558号
- 学位
- 博士(工学)
- 本文の言語コード
- eng
- 対象利用者
- 一般
- 一般注記
- Important scientific problems in electrostatics, acoustics and statistics can be solved using the Boundary Element Method (BEM). The coefficient matrix from a BEM discretization results in a dense matrix, leading to O(N^2) and O(N^3) time complexity for the matrix-vector product and direct factorization algorithms, respectively. The underlying geometry from which the dense matrix is generated can be exploited in order to approximate the interactions between far points. These points are expressed in the dense matrix as off-diagonals blocks. Low rank approximation of the off-diagonal blocks of such a structured dense matrix can reduce the time complexity of the matrix-vector product and direct factorization algorithms to $O(N)$ by trading off accuracy for time. The accuracy of the algorithm can be tuned to match the required accuracy of the application.Algorithmic developments of the compression, multiplication and factorization routines of such low rank approximated matrices have led to reduction in the time complexity and better accuracy for a wide variety of problems. However, the implementation of such routines on modern, highly parallel computer architectures for such routines still requires the use of numerical software that was originally written for computation of dense linear algebra. The small, irregular kernels that are prevalent in the algorithms involving low rank approximation result in inefficient execution on modern hardware.In this thesis, we propose improvements to the efficiency of the matrix-vector product and direct factorization algorithms of low rank approximated dense matrices on shared and distributed memory machines. We show that our techniques are applicable to 2D problems for modeling acoustic waves, electrostatics and statistics with acceptable accuracy for the application in question. We target the block low rank and hierarchically semi-separable low rank matrix formats for this study since they have been shown to work well with 2D problems.The first part of this thesis improves the efficiency of the matrix-vector product. The matrix-vector product is useful for problems arising from the wave equation. The low rank matrix multiplication algorithm is a key computational kernel of the matrix vector product. We design efficient cache blocking algorithms by leveraging the Execution-Cache-Memory performance model. Our portable implementation making use of parallel loops written in a high level language with architecture-specific assembly micro-kernels. This preserves the portability of our method and allows us outperform vendor optimized BLAS libraries on the Fujitsu A64FX, Intel Xeon 6148 and AMD EPYC 7502 CPUs. We improve the matrix vector product of low rank matrices by obtaining a 2x performance improvement in the low rank matrix multiplication algorithm on all CPU architectures.The second part of this thesis improves the distributed memory factorization of hierarchical matrices arising out of 2D problems from electrostatics and statistics. Specifically, we show that we achieve up to 2x improvement in weak scaling performance over the state-of-the-art in dense direct factorization on up to 128 nodes of Fugaku. We show that this improvement in speed can be achieved with similar or better accuracy than other state-of-the-art implementations making use of low rank approximation of structured dense matrices. The main reasons behind the improvement can be attributed to the use of the HSS-ULV algorithm with a distributed asynchronous runtime system. The combination of an algorithm needing $O(N)$ computation and communication with an asynchronous distributed runtime system results in better overlap of computation with communication and results in better performance.identifier:oai:t2r2.star.titech.ac.jp:50672915
- 国立国会図書館永続的識別子
- info:ndljp/pid/13332152
- コレクション(共通)
- コレクション(障害者向け資料:レベル1)
- コレクション(個別)
- 国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
- 収集根拠
- 博士論文(自動収集)
- 受理日(W3CDTF)
- 2024-02-02T16:16:39+09:00
- 記録形式(IMT)
- application/pdf
- オンライン閲覧公開範囲
- 国立国会図書館内限定公開
- デジタル化資料送信
- 図書館・個人送信対象外
- 遠隔複写可否(NDL)
- 可
- 掲載誌(URI)
- 連携機関・データベース
- 国立国会図書館 : 国立国会図書館デジタルコレクション