LLM Zoo: What is a Llama

Part 1: Who's that LLM?... It's a LLama

Jan 03, 2024

LLMs in Many Flavors

Large language models have become indispensable for natural language tasks, with models like GPT-3, LLaMa, and Mistral pushing the state of the art ever further. However, with great power comes great confusion—how do you actually use these giant models?

Each model has its own format: weights, activations, configurations, etc. Furthermore, models are developed by different groups, with different codebases and assumptions. Trying to use multiple models together, or convert between formats, can be challenging.

This post dives deep into the jungle of LLM zoo, surveying the landscape of different models, formats, and codebases. We'll explore:

The de facto standard Transformers library
Different backends for running LLaMA
Novel quantized and compressed formats like AutoGPTQ
Optimized runtimes like ExLLaMA
And more

Whether you want to compare models, convert formats, or optimize performance, this post aims to clarify the chaos. I'll provide code snippets for actual format conversions and performance benchmarks. My goal is for you to walk away with a mental map of the LLM ecosystem, empowered to work with any model or backend.

This post contains a wide survey across many models, formats, and techniques from both industry and academia. It's not completely comprehensive, but should provide orientation and jumping-off points to explore areas of interest. So buckle up, and let's organize this zoo!

LLM Zoo

What is a LLaMA?

LLaMA is not an animal but actually stands for Large Language Model Meta AI, is a family of autoregressive language models developed by Meta AI. It builds upon the Transformer-based architecture of GPT-3, but with some key distinctions.

Architecture

LLaMA utilizes a transformer architecture with multi-head attention layers, enabling efficient scaling to hundreds of billions of parameters. The LLaMA family comprises four models:

LLaMA-7B: 7 billion parameters
LLaMA-13B: 13 billion parameters
LLaMA-33B: 33 billion parameters
LLaMA-65B: 65 billion parameters

Each model in the LLaMA family is a separate foundational model, trained on a large set of unlabeled data, making them ideal for fine-tuning for a variety of tasks

Running LLaMA with Hugging Face

To utilize LLaMA, you don't actually need access to a proprietary API from Meta. You can integrate LLaMA models directly into your applications using the Hugging Face Inference API.

The Inference API allows running models locally or through a web API without substantial infrastructure. It supports text generation, classification, tokenization, and more across frameworks like PyTorch and TensorFlow.

To use it, you'll need to:

Register for a Hugging Face account
Grab an API token from Account Settings
Authenticate requests with your token

For example, in Python:

import requests

API_URL = "https://api-inference.huggingface.co/models/meta-llama/Llama-2-70b-chat-hf"
headers = {"Authorization": "Bearer xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()
	
output = query({
	"inputs": "Can you please let us know more details about your ",
})

The API is rate-limited, but higher volumes are available. For production needs, Inference Endpoints provide dedicated infrastructure for models.

So don't be constrained to proprietary access. With Hugging Face, you can run models like LLaMA directly in your applications! I'll cover more advanced usage in future posts.

Conclusion

Thanks for reading! I am cutting it short but this is just the start. I will keep posting updates and more in-depth piece. I will ramp it up as it moves along.

Earl’s Substack

Discussion about this post