tags : Floating Point, Concurrency, Flynn’s Taxonomy, Machine Learning
Learning resources
- GPUs Go Brrr · Hazy Research
- Are GPUs For You
- We were wrong about GPUs | Hacker News
- GPU Programming: When, Why and How? — GPU programming: why, when and how? documentation
- https://dl.acm.org/doi/pdf/10.1145/3570638
- What Every Developer Should Know About GPU Computing
- What is a flop? | Hacker News
- Can we 10x Rust hashmap throughput? - by Win Wang
- 1. Introduction — parallel-thread-execution 8.1 documentation
- Udacity CS344: Intro to Parallel Programming | NVIDIA Developer
- AUB Spring 2021 El Hajj - YouTube
- How GPU Computing Works | GTC 2021 - YouTube
- Convolutions with cuDNN – Peter Goldsborough
- https://medium.com/@penberg/demystifying-gpus-for-cpu-centric-programmers-e24934a620f1
- CUDA
FAQ
Different kinds of hardware for ML
Feature | CPU | GPU | APU¹ | TPU | FPGA | ASIC² |
---|---|---|---|---|---|---|
Primary Use | General Compute | Graphics, Parallel | Combined CPU+GPU | ML Accelerate (NN/Tensor) | Reconfigurable Logic | Single Task Optimized |
Architecture | Few Powerful Cores | Many Simple Cores | Mixed CPU/GPU Cores | Matrix/Tensor ASIC | Customizable Logic Grid | Custom Fixed Hardware |
ML Since | Always | 2000s (GPGPU), 2012 | 2010s (Integrated) | 2015 (Internal), 2018 | Mid-2010s (Accel.) | Mid/Late 2010s (ML) |
ML Prevalence | System Base, Light ML | Very High (Training) | Moderate (Edge/PC) | Growing (Google Cloud) | Niche (Low Latency) | Growing Fast (Infer.) |
ML Advant. | Flexible, Sequential | Parallelism, Ecosystem | Balanced, Power/Cost Eff | Perf/Watt (Matrix), Scale | Customizable, Low Latency | Max Perf/Watt (Task) |
ML Limits | Poor Parallel | Power, Sparse Data | Shared Resource Limits | Less Flexible, Ecosystem | Complex Dev, HW Skill | Inflexible, High NRE |
ML Use Cases | Data Prep, Orchestrate | DL Training, Inference | Edge AI, Mixed Loads | Large Scale DL (GCP) | Real-time Inference | High-Vol Inference |
Nvdia GPUs
CUDA core
- CUDA cores each core can only do one multiply-accumulate(MAC) on 2 FP32 values
- eg. x += x*y
Tensor core
- Tensor core can take a
4x4 FP16
matrix and multiply it by another4x4 FP16
matrix then add either aFP16/FP32 4x4
matrix to the resulting product and return it as a new matrix. - Certain Tensor cores added support for
INT8
andINT4
precision modes for quantization. - Now there are various architecture variants that Nvdia build upon, Like Turing Tensor, Ampere Tensor etc.
See Category:Nvidia microarchitectures - Wikipedia
RAM
???
VRAM
- Memory = how big the model is allowed to be
Related topics
Performance
- Typically measured in floating point operations per second or
FLOPS
/GFLOPS
- Good if the no. of floating point operations per memory access is high
Floating Point support
See Floating Point
- GPUs support
half
,single
anddouble
precisions double
precision support on GPUs is fairly recent.- GPU vendors have their own things and support
F32
float32 is very widely used in gaming.
- float32 multiplication is really a 24-bit multiplication, which is about 1/2 the cost of a 32-bit multiplication. So an int32 multiplication is about 2x as expensive as a float32 multiplication.
- On modern desktop GPUs, the difference in performance (FLOPS) between float32 and float64 is close to 4x
Frameworks
- OpenCL: Dominant open GPGPU computing language
- OpenAI Titron: Language and compiler for parallel programming
- CUDA: Dominant proprietary framework
More on CUDA
- Graphic cards support upto certain cuda version. Eg. my card when
nvidia-smi
is run shows CUDA 12.1, it doesn’t mean cuda is installed - So I can install cudatoolkit around that version.
- But cudatoolkit is separate from nvdia driver. You can possibly run cudatoolkit for your graphic card without having the driver.
-
Pytorch
- Eg. To run Pytorch you don’t need cudatoolkit because they ship their own CUDA runtime and math libs.
- Local CUDA toolkit will be used if we build PyTorch from source etc.
- If pytorch-cuda is built w cuda11.7, you need cuda11.7 installed in your machine. Does it not ship the runtime????
nvcc
is the cuda compiler- torhaudio: https://pytorch.org/audio/main/installation.html
-
Setting up CUDA on NixOS
- So installing nvidia drivers is different game. Which has nothing to do with cuda. Figure that shit out first, that should go in configuration.nix or whatever configures the system.
- Now for the CUDA runtime, there are few knobs. But most importantly LD_LIBRARY_PATH should not be set globally. See this: Problems with rmagik / glibc: `GLIBCXX_3.4.32’ not found - #7 by rgoulter - Help - NixOS Discourse
- So install all CUDA stuff in a flake, and we should be good.
- Check versions
nvidia-smi
will give the cuda driver version- After installing
pkgs.cudaPackages.cudatoolkit
you’ll havenvcc
in your path.- Running
nvcc --version
will give local cuda version
- Running
- For flake
postShellHook = '' #export LD_DEBUG=libs; # debugging export LD_LIBRARY_PATH="${pkgs.lib.makeLibraryPath [ pkgs.stdenv.cc.cc # pkgs.libGL # pkgs.glib # pkgs.zlib # NOTE: for why we need to set it to "/run/opengl-driver", check following: # - This is primarily to get libcuda.so which is part of the # nvidia kernel driver installation and not part of # cudatoolkit # - https://github.com/NixOS/nixpkgs/issues/272221 # - https://github.com/NixOS/nixpkgs/issues/217780 # NOTE: Instead of using /run/opengl-driver we could do # pkgs.linuxPackages.nvidia_x11 but that'd get another # version of libcuda.so which is not compatiable with the # original driver, so we need to refer to the stuff # directly installed on the OS "/run/opengl-driver" # "${pkgs.cudaPackages.cudatoolkit}" "${pkgs.cudaPackages.cudnn}" ]}" '';
- Other packages
- sometimes we need to add these to LD_LIBRARY_PATH directly
pkgs.cudaPackages.cudatoolkit
pkgs.cudaPackages.cudnn
pkgs.cudaPackages.libcublas
pkgs.cudaPackages.cuda_cudart
pkgscudaPackages.cutensor