tags : Modern AI Stack , Machine Learning, StableDiffusion
Resources
- Airtable link: MLModels Airtable
- All links: https://github.com/imaurer/awesome-decentralized-llm
- https://github.com/evanmiller/LLM-Reading-List
Fine tuning base model
- Fine-tuning Alpaca
- ML Blog - Fine-Tune Your Own Llama 2 Model in a Colab Notebook
- Google Colab
- Axolotl (from OpenAccess-AI-Collective ) github repo now supports flash attention with QLora fine tunes
- https://github.com/mshumer/gpt-llm-trainer
Tools/libs
- https://github.com/marella/ctransformers (ggml python bindings)
- https://huggingface.co/hkunlp/instructor-xl (embeddings)
Meta
- Base models are not supposed to used directly, they are meant for fine-tuning in a way
LLaMA
- LLaMA is not tuned for instruction following like ChatGPT
- llma.cpp story : What is the meaning of hacked? · Issue #33 · ggerganov/llama.cpp · GitHub
Alpaca
- What’s Alpaca-LoRA ? Technique used to finetune llama using lora
Case studies/Tools
- ReLLM: Exact Structure for Large Language Model Completions
- Context-Free Grammar Parsing with LLMs
- GitHub - microsoft/guidance: A guidance language for controlling large language models.
Comparison
Guanaco 7B (llma.cpp)
- CPU
- 1 Thread, CPU: 0.17-0.26 tokens/s
- 11 Threads, 12vCPU: ~1token/s
- 21 Threads, 12vCPU: ~0.3token/s
- 10 Threads, 12vCPU: ~0.3token/s
- 1 Thread, CPU, cuBALS: 0.17-0.26 tokens/s
- 9 Thread, CPU, cuBALS: 5 tokens/s
- GPTQ (GPU)
- ~25token/s
Others
- Someone on internet
- My llama2 inference performance on desktop RTX 3060
- 5bit quantization (llama.cpp with cuda)
- 47 tokens/sec (21ms/token) for llama7b
- 27 tokens/sec (37ms/token) for llama13b using
- 5bit quantization (llama.cpp with cuda)
- My llama2 inference performance on desktop RTX 3060
Resources
- https://github.com/ggerganov/llama.cpp/blob/master/docs/token_generation_performance_tips.md
- https://github.com/oobabooga/text-generation-webui/blob/main/docs/Generation-parameters.md
Techniques
Prompt chaining
- Making language models compositional and treating them as tiny reasoning engines, rather than sources of truth.
- Combines language models with other tools like web search, traditional programming functions, and pulling information from databases. This helps make up for the weaknesses of the language model.
- Source: Squish Meets Structure: Designing with Language Models