tags : Floating Point, Concurrency, Flynn’s Taxonomy, Machine Learning

Learning resources

Performance

  • Typically measured in floating point operations per second or FLOPS / GFLOPS
  • Good if the no. of floating point operations per memory access is high

Floating Point support

See Floating Point

  • GPUs support half, single and double precisions
  • double precision support on GPUs is fairly recent.
  • GPU vendors have their own things and support

F32

float32 is very widely used in gaming.

  • float32 multiplication is really a 24-bit multiplication, which is about 1/2 the cost of a 32-bit multiplication. So an int32 multiplication is about 2x as expensive as a float32 multiplication.
  • On modern desktop GPUs, the difference in performance (FLOPS) between float32 and float64 is close to 4x

Nvdia GPUs

CUDA core

  • CUDA cores each core can only do one multiply-accumulate(MAC) on 2 FP32 values
  • eg. x += x*y

Tensor core

  • Tensor core can take a 4x4 FP16 matrix and multiply it by another 4x4 FP16 matrix then add either a FP16/FP32 4x4 matrix to the resulting product and return it as a new matrix.
  • Certain Tensor cores added support for INT8 and INT4 precision modes for quantization.
  • Now there are various architecture variants that Nvdia build upon, Like Turing Tensor, Ampere Tensor etc.

See Category:Nvidia microarchitectures - Wikipedia

RAM

???

VRAM

  • Memory = how big the model is allowed to be

Frameworks

  • OpenCL: Dominant open GPGPU computing language
  • OpenAI Titron: Language and compiler for parallel programming
  • CUDA: Dominant proprietary framework

More on CUDA

  • Graphic cards support upto certain cuda version. Eg. my card when nvidia-smi is run shows CUDA 12.1, it doesn’t mean cuda is installed
  • So I can install cudatoolkit around that version.
  • But cudatoolkit is separate from nvdia driver. You can possibly run cudatoolkit for your graphic card without having the driver.

Pytorch

  • Eg. To run Pytorch you don’t need cudatoolkit because they ship their own CUDA runtime and math libs.
  • Local CUDA toolkit will be used if we build PyTorch from source etc.
  • If pytorch-cuda is built w cuda11.7, you need cuda11.7 installed in your machine. Does it not ship the runtime????
  • nvcc is the cuda compiler
  • torhaudio: https://pytorch.org/audio/main/installation.html