tags : Modern AI Stack , Machine Learning, StableDiffusion

Resources

Fine tuning base model

Tools/libs

Meta

  • Base models are not supposed to used directly, they are meant for fine-tuning in a way

LLaMA

Alpaca

  • What’s Alpaca-LoRA ? Technique used to finetune llama using lora

Case studies/Tools

Comparison

Guanaco 7B (llma.cpp)

  • CPU
    • 1 Thread, CPU: 0.17-0.26 tokens/s
    • 11 Threads, 12vCPU: ~1token/s
    • 21 Threads, 12vCPU: ~0.3token/s
    • 10 Threads, 12vCPU: ~0.3token/s
    • 1 Thread, CPU, cuBALS: 0.17-0.26 tokens/s
    • 9 Thread, CPU, cuBALS: 5 tokens/s
  • GPTQ (GPU)
    • ~25token/s

Others

  • Someone on internet
    • My llama2 inference performance on desktop RTX 3060
      • 5bit quantization (llama.cpp with cuda)
        • 47 tokens/sec (21ms/token) for llama7b
        • 27 tokens/sec (37ms/token) for llama13b using

Resources

Techniques

Prompt chaining

  • Making language models compositional and treating them as tiny reasoning engines, rather than sources of truth.
  • Combines language models with other tools like web search, traditional programming functions, and pulling information from databases. This helps make up for the weaknesses of the language model.
  • Source: Squish Meets Structure: Designing with Language Models