Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: Blanchard, Pierre; Higham, Nicholas, J; Lopez, Florent; Mary, Théo; Pranesh, Srikara
المصدر:
ISSN: 1064-8275 ; SIAM Journal on Scientific Computing ; https://hal.science/hal-02491076 ; SIAM Journal on Scientific Computing, 2020, 42 (3), pp.C124-C141. ⟨10.1137/19M1289546⟩.
الموضوع:
NVIDIA GPU; matrix multiplication; rounding error analysis; floating-point arithmetic; fused multiply-add; tensor cores; LU factorization; [INFO]Computer Science [cs]; [INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA]; [INFO.INFO-DC]Computer Science [cs]/Distributed; Parallel; and Cluster Computing [cs.DC]; [MATH]Mathematics [math]; [MATH.MATH-NA]Mathematics [math]/Numerical Analysis [math.NA]
نوع التسجيلة:
article in journal/newspaper
اللغة:
English

معلومة اضافية
- Contributors:
  Department of Mathematics Manchester (School of Mathematics); University of Manchester Manchester; Innovative Computing Laboratory Knoxville (ICL); The University of Tennessee Knoxville; Centre National de la Recherche Scientifique (CNRS); Performance et Qualité des Algorithmes Numériques (PEQUAN); LIP6; Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS)
- بيانات النشر:
  HAL CCSD
  Society for Industrial and Applied Mathematics
- الموضوع:
  2020
- نبذة مختصرة :
  International audience ; Computing units that carry out a fused multiply-add (FMA) operation with matrix arguments, referred to as tensor units by some vendors, have great potential for use in scientific computing. However, these units are inherently mixed precision and existing rounding error analyses do not support them. We consider a mixed precision block FMA that generalizes both the usual scalar FMA and existing tensor units. We describe how to exploit such a block FMA in the numerical linear algebra kernels of matrix multiplication and LU factorization and give detailed rounding error analyses of both kernels. An important application is to GMRES-based iterative refinement with block FMAs, for which our analysis provides new insight. Our framework is applicable to the tensor core units in the NVIDIA Volta and Turing GPUs. For these we compare matrix multiplication and LU factorization with TC16 and TC32 forms of FMA, which differ in the precision used for the output of the tensor cores. Our experiments on an NVDIA V100 GPU confirm the predictions of the analysis that the TC32 variant is much more accurate than the TC16 one, and they show that the accuracy boost is obtained with almost no performance loss.
- Relation:
  hal-02491076; https://hal.science/hal-02491076; https://hal.science/hal-02491076v2/document; https://hal.science/hal-02491076v2/file/BlockFMA.pdf
- الرقم المعرف:
  10.1137/19M1289546
- Rights:
  info:eu-repo/semantics/OpenAccess
- الرقم المعرف:
  edsbas.47B8F314

تعليقات

No Comments.