tags : Floating Point, Concurrency, Flynn’s Taxonomy, Machine Learning
Learning resources
- Are GPUs For You
- GPU Programming: When, Why and How? — GPU programming: why, when and how? documentation
- https://dl.acm.org/doi/pdf/10.1145/3570638
- What Every Developer Should Know About GPU Computing
- What is a flop? | Hacker News
- Course on CUDA Programming
- Can we 10x Rust hashmap throughput? - by Win Wang
- 1. Introduction — parallel-thread-execution 8.1 documentation
- Udacity CS344: Intro to Parallel Programming | NVIDIA Developer
- AUB Spring 2021 El Hajj - YouTube
- How GPU Computing Works | GTC 2021 - YouTube
- The CUDA Parallel Programming Model - 1. Concepts - Fang’s Notebook
- Convolutions with cuDNN – Peter Goldsborough
- https://medium.com/@penberg/demystifying-gpus-for-cpu-centric-programmers-e24934a620f1
Performance
- Typically measured in floating point operations per second or
FLOPS
/GFLOPS
- Good if the no. of floating point operations per memory access is high
Floating Point support
See Floating Point
- GPUs support
half
,single
anddouble
precisions double
precision support on GPUs is fairly recent.- GPU vendors have their own things and support
F32
float32 is very widely used in gaming.
- float32 multiplication is really a 24-bit multiplication, which is about 1/2 the cost of a 32-bit multiplication. So an int32 multiplication is about 2x as expensive as a float32 multiplication.
- On modern desktop GPUs, the difference in performance (FLOPS) between float32 and float64 is close to 4x
Nvdia GPUs
CUDA core
- CUDA cores each core can only do one multiply-accumulate(MAC) on 2 FP32 values
- eg. x += x*y
Tensor core
- Tensor core can take a
4x4 FP16
matrix and multiply it by another4x4 FP16
matrix then add either aFP16/FP32 4x4
matrix to the resulting product and return it as a new matrix. - Certain Tensor cores added support for
INT8
andINT4
precision modes for quantization. - Now there are various architecture variants that Nvdia build upon, Like Turing Tensor, Ampere Tensor etc.
See Category:Nvidia microarchitectures - Wikipedia
RAM
???
VRAM
- Memory = how big the model is allowed to be
Frameworks
- OpenCL: Dominant open GPGPU computing language
- OpenAI Titron: Language and compiler for parallel programming
- CUDA: Dominant proprietary framework
More on CUDA
- Graphic cards support upto certain cuda version. Eg. my card when
nvidia-smi
is run shows CUDA 12.1, it doesn’t mean cuda is installed - So I can install cudatoolkit around that version.
- But cudatoolkit is separate from nvdia driver. You can possibly run cudatoolkit for your graphic card without having the driver.
Pytorch
- Eg. To run Pytorch you don’t need cudatoolkit because they ship their own CUDA runtime and math libs.
- Local CUDA toolkit will be used if we build PyTorch from source etc.
- If pytorch-cuda is built w cuda11.7, you need cuda11.7 installed in your machine. Does it not ship the runtime????
nvcc
is the cuda compiler- torhaudio: https://pytorch.org/audio/main/installation.html