tags : Algorithms, Data Engineering, GPGPU, Computer vision
FAQ
Keras vs Pytorch vs TF
- Keras is high level framework and has less boilerplate so easy to understand etc. It earlier only largely supported TF but now supports pytorch
- Pytorch is good for training and dev but is hard to put to prod.
- TF is pytorch alternative.
Classical ML vs DL
- See Why Tree-Based Models Beat Deep Learning on Tabular Data
- The combination can be very useful sometimes, for example for transfer learning for working with low resource datasets/problems. Use a deep neural network to go from high dimensionality data to a compact fixed length vector. Basically doing feature extraction. This network is increasingly trained on large amounts of unlabeled data, using self-supervision. Then use simple classical model like a linear model, Random Forest or k-nearest-neighbors to make model for specialized task of interest, using a much smaller labeled dataset. This is relevant for many task around sound, image, multi-variate timeseries. Probably also NLP (not my field).
What are the steps
Feature transformation
- Transform the inputs to fit your model.
- For LLMs training requires more like 100 x A100 and a cluster to train on
- DeepSpeed - Wikipedia
- To do backprop you have to accumulate a loss gradient for every model parameter and trace it back to the input. I haven’t calculated it precisely but a reasonable rule of thumb is to reserve VRAM equal to the model size + input size. That is, if you load your model and a single batch onto the GPU and it takes up 12 GB, you will want to reserve another 12 GB or so.
Weight training
- Aim for High throughput
- If you’re going to do quantization, you’ll do quantization aware training if needed
- The real point of using GPUs is often to form minibatches during training.
- training is doing forward + backward + applying gradients,
Traininig steps
- Split data into train and validation sets or something.
Training laws
- Chinchilla data-optimal scaling laws: In plain English
- chinchilla reddit discussion
- Multiple epochs can give us better results, but this helps us know if we even need to do another eval.
training data : model parameter count
ratio for a certain budget- How much training data should I ideally use for a model of X size?
- Ratio:
20:1
, Eg.20bn training tokens:1bn parameter
parameters, on1 epoch
Serialize the weights
Model and Weights
- Not necessary to save the model and weights separately.
- Many serialization formats and frameworks provide options to save the model and its weights together in a single file.
- We can save separately if needed.
- In pytorch, state_dict contains all this stuff
Serialization Formats
- JSON, HDF5(Keras), SavedModel(TF, uses Protocol Buffers) ONNX, pth(Pytorch objects as pickle), safetensors, ckpt(TF), ggml etc.
- Some formats are more dangerous than others. safetensors and ggml nice.
- Most of the time these can be converted into one another etc.
- ckpt, pth, pickle can contain malicious code
pth, pt
are same, usually not recommended to usepth
cuz pth falls in sys. so usept
.- pytorch has
torch.jit.save()
andtorch.save()
, the first one saved in a way that the model can be loaded by c++ for inference etc. while the later saves in Python pickle, which is useful for prototyping, researching, and training.
ONNX vs safetensors
- These are for different usecases. Inference and Storage.
- There’s some compatibility issues with ONNX ecosystem and safetensors
-
ONNX (framework interoperability, inference)
- ONNX is designed to allow framework interoperability. This is a protobuf file.
- ONNX consumes a .onnx file, which is a definition of the network and weights.
- GGML instead just consumes the weights, and defines the network in code. GGML less bloated than onnx.
-
safetensors
- Format for storing tensors safely (as opposed to pickle) and that is still fast (zero-copy).
Components of sharing a model
- Model (checkpoints) : The main deal
- Weights: This sometimes comes w the model file, sometimes separately
- Tokenizer : The algorithm to use for tokenization that the model will understand
Optimize weights
ONNX
- ONNX is a format where on how we can store pretrained model but it also provides a runtime which can run quantized model.
- See Convert Transformers to ONNX with Hugging Face Optimum
- There is also TFLite runtime which support ONNX models.
Quantization
- Model quantization is a method of reducing the size of a trained model while maintaining accuracy.
- You can quantize for performance/power consumption/size weight sizes etc., all of many by decreasing precision of weights.
- There are various quantization schemes, some may use more memory etc. Eg. llma.cpp has different quantization methods like q4_2, q4_3, q5_0, q5_1 etc. Then there’s also GPTQ for 4bit quantization.
- OpenVino is a nice tool to do quantization for Intel CPUs
-
Resources
- 8-bit Methods for Efficient Deep Learning — Tim Dettmers (University of Washington) - YouTube 🌟
- Basic math related to computation and memory usage for transformers 🌟
- What Is int8 Quantization and Why Is It Popular for Deep Neural Networks?
- Quantization | Papers With Code
- https://inst.eecs.berkeley.edu/~ee290-2/sp20/assets/labs/lab1.pdf
- TensorFlow model optimization | TensorFlow Model Optimization
- Post-training quantization | TensorFlow Lite
- 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes
- Quantization
- Quantization for Neural Networks - Lei Mao’s Log Book
-
Explanations
- q4_0 could denote a version trained with a focus on the fourth quarter’s data or a particular aspect of language understanding or generation that has been labeled as “4_0”.
- q5_1 might indicate an incremental update or a variation focused on a specialized task.
- The K could signify knowledge-intensive tasks, while S, M, L might denote small, medium, or large versions or settings within a particular model configuration.
- The fp16 likely means that the model uses 16-bit floating-point precision, which trades off some accuracy for efficiency in computation.
Other stuff
- Pruning
- distillation (Eg. Alpaca(small model) is distillation of gpt3.5(big model))
Ship the weights
Inference
There are several inference engines to choose from.
- There are also cloud providers which offer GPU for inference. Eg. GenesisCloud, Paperspace, Big ones like AWS Inferentia.
- CPU is often faster for inference except on very huge models like CNNs or I guess big Transformers.
- inference is only doing forward pass
- Sometimes inference may require GPU
- a GPU is good if you have a lot of compute intensity for inference
- Now you are calling your model to do the inference on those inputs. If the model is running on CPU, then data goes directly from RAM to CPU, eval happens on the cpu, and output is saved back to the RAM and sent as response.
- If you are running your model on GPU, then the inputs are first copied from the RAM to the GPU memory, and then passed from the GPU memory to the GPU compute unit for processing, the output is saved to GPU memory, and then copied back the the RAM to be sent as response, So you have two extra copy steps
- Aim for low latency
- Can be a web service.
- “inference endpoint” which is basically an HTTP API that you can send requests to and get back responses.
- or just use it in some way
- Most trained ML models that ship today can generate predictions on a raspberry pi or a cell phone. LLMs still require more hardware for inference, but you’d be surprised how little they need compared to what’s needed for training.
- People usually train of GPU and inference on CPU. Saves a lot of money.
- Really depends on the mode.
- For big models(60B+), interference is possible w Threadripper builds with > 256GB RAM + some assortment of 2x-4x GPUs. BUT it’ll be stupid slow, probably MINUTES per token.
- GLM-130B runs on 4x 3090, uses INT4.
Deploying
- tensorflow serving (with ONNX, you can train with pytorch and infer w TF)
- Nvidia Titron (There’s also openAI titron that’s different)
Web based
- WebAssembly GGML: whisper example, doesn’t use webgpu. wasm+cpu
- WebGPT
- POC, uses vanilla JS to make use of WebGPU to run a basic GPT
- [2112.09332] WebGPT: Browser-assisted question-answering with human feedback
- TVM based
- TVM is used to compile the GPU-side code to WebGPU
- Emscripten is used to compile the CPU-side code to WebAssembly.
- Examples
Continuous learning
- You can update a model with new data at any time. Production models are often updated at intervals with monitoring. It is generally possible but not always practical.
- Federated learning - Wikipedia
RL
It’s common to set agents to set some fraction of their time exploring and some other exploiting the environment.
NN
whole “online learning” field that addresses just that
Hardware
Chips
generated by chatgpt
CPU
Central Processing Units (CPUs) are the main processing units in a computer. They are designed to handle a wide range of tasks, including running the operating system and applications. However, CPUs are not optimized for parallel computing, which is a key aspect of machine learning.
GPU
Graphics Processing Units (GPUs) were originally designed for rendering graphics, but their parallel architecture makes them well-suited for machine learning tasks that require large-scale parallel processing, such as training deep neural networks. GPUs are particularly useful for matrix operations, which are common in machine learning.
APU
Accelerated Processing Units (APUs) are a combination of CPU and GPU on a single chip. They are designed to provide high-performance computing for a variety of tasks, including machine learning. APUs are well-suited for tasks that require both serial and parallel processing.
TPU
Tensor Processing Units (TPUs) are Google’s custom-designed chips for machine learning. TPUs are optimized for matrix multiplication, which is a key operation in deep neural networks. TPUs are particularly useful for large-scale machine learning tasks, such as training complex models on massive datasets.
FPGA
Field-Programmable Gate Arrays (FPGAs) are chips that can be reprogrammed after manufacturing. FPGAs are highly customizable and can be optimized for specific machine learning tasks. They are particularly useful for tasks that require low latency and high throughput, such as real-time image recognition.
ASIC
Application-Specific Integrated Circuits (ASICs) are custom-designed chips that are optimized for specific tasks. ASICs are highly specialized and can be designed to perform a particular machine learning task very efficiently. ASICs are particularly useful for tasks that require high throughput and low power consumption, such as inference in deep neural networks.
Models
- GPT-J
- gpt-turbo and FLAN-T5-XXL
- Roberta
- LLAMA or Standford Alpaca
Terms I keep hearing
- dropout regularization in a deep neural network
- activations
- weights