tags : Floating Point, Concurrency, Flynn’s Taxonomy, Machine Learning

https://modal.com/gpu-glossary/readme 🌟

Learning resources

FAQ

Different kinds of hardware for ML

FeatureCPUGPUAPU¹TPUFPGAASIC²
Primary UseGeneral ComputeGraphics, ParallelCombined CPU+GPUML Accelerate (NN/Tensor)Reconfigurable LogicSingle Task Optimized
ArchitectureFew Powerful CoresMany Simple CoresMixed CPU/GPU CoresMatrix/Tensor ASICCustomizable Logic GridCustom Fixed Hardware
ML SinceAlways2000s (GPGPU), 20122010s (Integrated)2015 (Internal), 2018Mid-2010s (Accel.)Mid/Late 2010s (ML)
ML PrevalenceSystem Base, Light MLVery High (Training)Moderate (Edge/PC)Growing (Google Cloud)Niche (Low Latency)Growing Fast (Infer.)
ML Advant.Flexible, SequentialParallelism, EcosystemBalanced, Power/Cost EffPerf/Watt (Matrix), ScaleCustomizable, Low LatencyMax Perf/Watt (Task)
ML LimitsPoor ParallelPower, Sparse DataShared Resource LimitsLess Flexible, EcosystemComplex Dev, HW SkillInflexible, High NRE
ML Use CasesData Prep, OrchestrateDL Training, InferenceEdge AI, Mixed LoadsLarge Scale DL (GCP)Real-time InferenceHigh-Vol Inference

Nvdia GPUs

CUDA core

  • CUDA cores each core can only do one multiply-accumulate(MAC) on 2 FP32 values
  • eg. x += x*y

Tensor core

  • Tensor core can take a 4x4 FP16 matrix and multiply it by another 4x4 FP16 matrix then add either a FP16/FP32 4x4 matrix to the resulting product and return it as a new matrix.
  • Certain Tensor cores added support for INT8 and INT4 precision modes for quantization.
  • Now there are various architecture variants that Nvdia build upon, Like Turing Tensor, Ampere Tensor etc.

See Category:Nvidia microarchitectures - Wikipedia

RAM

???

VRAM

  • Memory = how big the model is allowed to be

Performance

  • Typically measured in floating point operations per second or FLOPS / GFLOPS
  • Good if the no. of floating point operations per memory access is high

Floating Point support

See Floating Point

  • GPUs support half, single and double precisions
  • double precision support on GPUs is fairly recent.
  • GPU vendors have their own things and support

F32

float32 is very widely used in gaming.

  • float32 multiplication is really a 24-bit multiplication, which is about 1/2 the cost of a 32-bit multiplication. So an int32 multiplication is about 2x as expensive as a float32 multiplication.
  • On modern desktop GPUs, the difference in performance (FLOPS) between float32 and float64 is close to 4x

Frameworks

  • OpenCL: Dominant open GPGPU computing language
  • OpenAI Titron: Language and compiler for parallel programming
  • CUDA: Dominant proprietary framework

More on CUDA

  • Graphic cards support upto certain cuda version. Eg. my card when nvidia-smi is run shows CUDA 12.1, it doesn’t mean cuda is installed
  • So I can install cudatoolkit around that version.
  • But cudatoolkit is separate from nvdia driver. You can possibly run cudatoolkit for your graphic card without having the driver.
  • Pytorch

    • Eg. To run Pytorch you don’t need cudatoolkit because they ship their own CUDA runtime and math libs.
    • Local CUDA toolkit will be used if we build PyTorch from source etc.
    • If pytorch-cuda is built w cuda11.7, you need cuda11.7 installed in your machine. Does it not ship the runtime????
    • nvcc is the cuda compiler
    • torhaudio: https://pytorch.org/audio/main/installation.html
  • Setting up CUDA on NixOS

    • So installing nvidia drivers is different game. Which has nothing to do with cuda. Figure that shit out first, that should go in configuration.nix or whatever configures the system.
    • Now for the CUDA runtime, there are few knobs. But most importantly LD_LIBRARY_PATH should not be set globally. See this: Problems with rmagik / glibc: `GLIBCXX_3.4.32’ not found - #7 by rgoulter - Help - NixOS Discourse
    • So install all CUDA stuff in a flake, and we should be good.
    • Check versions
      • nvidia-smi will give the cuda driver version
      • After installing pkgs.cudaPackages.cudatoolkit you’ll have nvcc in your path.
        • Running nvcc --version will give local cuda version
    • For flake
      postShellHook = ''
      #export LD_DEBUG=libs; # debugging
       
      export LD_LIBRARY_PATH="${pkgs.lib.makeLibraryPath [
        pkgs.stdenv.cc.cc
        # pkgs.libGL
        # pkgs.glib
        # pkgs.zlib
       
        # NOTE: for why we need to set it to "/run/opengl-driver", check following:
        # - This is primarily to get libcuda.so which is part of the
        #   nvidia kernel driver installation and not part of
        #   cudatoolkit
        # - https://github.com/NixOS/nixpkgs/issues/272221
        # - https://github.com/NixOS/nixpkgs/issues/217780
        # NOTE: Instead of using /run/opengl-driver we could do
        #       pkgs.linuxPackages.nvidia_x11 but that'd get another
        #       version of libcuda.so which is not compatiable with the
        #       original driver, so we need to refer to the stuff
        #       directly installed on the OS
        "/run/opengl-driver"
       
        # "${pkgs.cudaPackages.cudatoolkit}"
        "${pkgs.cudaPackages.cudnn}"
      ]}"
      '';
    • Other packages
      • sometimes we need to add these to LD_LIBRARY_PATH directly
      • pkgs.cudaPackages.cudatoolkit
      • pkgs.cudaPackages.cudnn
      • pkgs.cudaPackages.libcublas
      • pkgs.cudaPackages.cuda_cudart
      • pkgscudaPackages.cutensor