A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

Mporas, Iosif; Kelefouras, Vasilios; Kritikakou, Angeliki; Kolonias, Vasilios

dc.contributor.author	Mporas, Iosif
dc.contributor.author	Kelefouras, Vasilios
dc.contributor.author	Kritikakou, Angeliki
dc.contributor.author	Kolonias, Vasilios
dc.date.accessioned	2017-07-10T12:09:54Z
dc.date.available	2017-07-10T12:09:54Z
dc.date.issued	2016-03-01
dc.identifier.citation	Mporas , I , Kelefouras , V , Kritikakou , A & Kolonias , V 2016 , ' A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures ' , Journal of Supercomputing , vol. 72 , no. 3 , pp. 804-844 . https://doi.org/10.1007/s11227-015-1613-7
dc.identifier.issn	0920-8542
dc.identifier.uri	http://hdl.handle.net/2299/18847
dc.description	This is the Accepted Manuscript version of the following article: V. Kelefouras, A Kritikakou I. Mporas, V. Kolonias, “A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures”, The Journal of Supercomputing, Vol. 72 (3): 804-844, January 2016. The final published version is available at: https://link.springer.com/article/10.1007%2Fs11227-015-1613-7 © Springer Science+Business Media New York 2016
dc.description.abstract	Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.	en
dc.format.extent	41
dc.format.extent	3855266
dc.language.iso	eng
dc.relation.ispartof	Journal of Supercomputing
dc.title	A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures	en
dc.contributor.institution	School of Engineering and Technology
dc.description.status	Peer reviewed
rioxxterms.versionofrecord	10.1007/s11227-015-1613-7
rioxxterms.type	Journal Article/Review
herts.preservation.rarelyaccessed	true

Files in this item

Name:: Accepted_Manuscript.pdf
Size:: 3.676Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Research publications

Show simple item record