AI Model Formats Explained: Demystifying Llama.cpp, GGUF, GGML, and Transformers

Posted by Aug on November 26, 2023

Abstract:
This post provides a breakdown of a helpful Reddit discussion concerning various AI model formats (GGUF, GGML, safetensors) and associated tools like Llama.cpp and Hugging Face Transformers. The author adds personal notes on GGML’s role in model speed versus quantization and summarizes key takeaways for understanding the complex landscape of local LLM deployment, including different quantization libraries (AutoGPTQ, ExLlama v2) and their supported formats.

Estimated reading time: 3 minutes

The world of running Large Language Models (LLMs) locally on your own computer can feel like navigating a maze of different file formats, tools, and libraries. I recently stumbled upon an excellent Reddit post in the /r/LocalLLaMA community that does a great job of clarifying the roles of things like Llama.cpp, GGUF, GGML, Hugging Face Transformers, and more:

Original Reddit Post: Transformers, Llama.cpp, GGUF, GGML, GPTQ & other animals

I found it super helpful, and I thought I’d share a summary of its key points, along with a couple of my own observations from working with these tools.

One thing I wanted to clarify, which the post touches on, is about GGML and quantization. From my experience converting a GPT-J float16 model to GGML (as detailed in my previous post on speeding up GPT-J), the GGML conversion itself didn’t quantize the model further. It remained a float16 model but became significantly faster to run. So, GGML’s primary benefit, in that case, was speed through a more efficient format, not necessarily size reduction from quantization (though GGML can also store quantized models).

With that said, here’s a rundown of the main tools and formats discussed in the Reddit post, based on the summary information (which I initially had GPT-4 help me distill):

Key Tools, Formats, and Their Relationships:

  1. Llama.cpp (by Georgi Gerganov):

    • This is a very popular tool for running LLMs efficiently on CPUs (and increasingly GPUs).
    • It primarily uses the GGUF file format.
    • It can convert models saved in safetensors format (a secure way to store model weights) into its preferred .gguf format.
    • Notably, Llama.cpp has deprecated support for the older GGML format.
  2. GGUF (Georgi Gerganov Universal Format):

    • This is the successor to GGML, designed by the same author.
    • A big plus: GGUF files usually include all necessary metadata (like tokenizer details, which define how text is broken into pieces for the model) within the single model file. This means you often don’t need separate configuration files like tokenizer_config.json.
    • However, GGUF files generally don’t include the “prompt template” (the specific way you need to structure your input questions for best results). You can usually find these templates on the model’s page on Hugging Face.
  3. GGML (Georgi Gerganov Model Library):

    • The older format that GGUF replaces.
    • As mentioned, Llama.cpp no longer actively supports it.
    • Some other tools, like koboldcpp (a fork of Llama.cpp), might still use GGML for running .bin files.
  4. Transformers (by Hugging Face):

    • A comprehensive Python library that’s a go-to for many working with AI models.
    • Supports models in safetensors format (both unquantized and quantized using methods like GPTQ).
    • Also supports the older .bin format (usually unquantized).
    • The “Use in Transformers” button you sometimes see on Hugging Face model pages is often an auto-generated template and might not always work perfectly out-of-the-box for every model.
  5. Quantization Libraries & Formats:

    • AutoGPTQ: A library for quantizing models using the GPTQ algorithm (making them smaller and often faster, sometimes with a small trade-off in accuracy). Transformers can use models quantized with AutoGPTQ.
    • ExLlama v2: Offers a highly optimized way to run LLaMA models quantized with GPTQ. It also supports another quantization method called AWQ (low-bit quantization). It typically uses safetensors for these quantized models.
    • AWQ (Activation-aware Weight Quantization): Another method for low-bit quantization (like INT3/4, meaning 3-bit or 4-bit numbers for weights), also using the safetensors format.

Other Important Notes from the Discussion:

  • The landscape is complex: There are many frameworks (Transformers, Llama.cpp, etc.), and they don’t all support the same quantization methods or file formats. You often need to match the tool to the specific model format you have.
  • safetensors files are generally used with the Transformers library, not directly with Llama.cpp (which prefers GGUF).

This Reddit discussion does a great job of highlighting how quickly this field is evolving. Understanding these different formats and tools is key if you’re diving into running LLMs locally. It’s a bit of a learning curve, but resources like that Reddit thread are invaluable!

Next, I will work on integrating a UI for my model.