LLMs in Many Flavors
Large language models have become indispensable for natural language tasks, with models like GPT-3, LLaMa, and Mistral pushing the state of the art ever further. However, with great power comes great confusion—how do you actually use these giant models?
Each model has its own format: weights, activations, configurations, etc. Furthermore, models are developed by different groups, with different codebases and assumptions. Trying to use multiple models together, or convert between formats, can be challenging.
This post dives deep into the jungle of LLM zoo, surveying the landscape of different models, formats, and codebases. We'll explore:
The de facto standard Transformers library
Different backends for running LLaMA
Novel quantized and compressed formats like AutoGPTQ
Optimized runtimes like ExLLaMA
And more
Whether you want to compare models, convert formats, or optimize performance, this post aims to clarify the chaos. I'll provide code snippets for actual format conversions and performance benchmarks. My goal is for you to walk away with a mental map of the LLM ecosystem, empowered to work with any model or backend.
This post contains a wide survey across many models, formats, and techniques from both industry and academia. It's not completely comprehensive, but should provide orientation and jumping-off points to explore areas of interest. So buckle up, and let's organize this zoo!
LLM Zoo
What is a LLaMA?
LLaMA is not an animal but actually stands for Large Language Model Meta AI, is a family of autoregressive language models developed by Meta AI. It builds upon the Transformer-based architecture of GPT-3, but with some key distinctions.
Architecture
LLaMA utilizes a transformer architecture with multi-head attention layers, enabling efficient scaling to hundreds of billions of parameters. The LLaMA family comprises four models:
LLaMA-7B: 7 billion parameters
LLaMA-13B: 13 billion parameters
LLaMA-33B: 33 billion parameters
LLaMA-65B: 65 billion parameters
Each model in the LLaMA family is a separate foundational model, trained on a large set of unlabeled data, making them ideal for fine-tuning for a variety of tasks
Running LLaMA with Hugging Face
To utilize LLaMA, you don't actually need access to a proprietary API from Meta. You can integrate LLaMA models directly into your applications using the Hugging Face Inference API.
The Inference API allows running models locally or through a web API without substantial infrastructure. It supports text generation, classification, tokenization, and more across frameworks like PyTorch and TensorFlow.
To use it, you'll need to:
Register for a Hugging Face account
Grab an API token from Account Settings
Authenticate requests with your token
For example, in Python:
import requests
API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"
headers = {"Authorization": "Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "Can you please let us know more details about your ",
})
The API is rate-limited, but higher volumes are available. For production needs, Inference Endpoints provide dedicated infrastructure for models.
So don't be constrained to proprietary access. With Hugging Face, you can run models like LLaMA directly in your applications! I'll cover more advanced usage in future posts.
Conclusion
Thanks for reading! I am cutting it short but this is just the start. I will keep posting updates and more in-depth piece. I will ramp it up as it moves along.