Abstract:
This post serves as a guide to dramatically improve the inference speed of the GPT-J-6B language model on a local machine. It details the author’s experience converting a Hugging Face float16
model to the GGML format, which reduced response times from minutes to under 20 seconds. The process covers memory requirements for conversion, the steps to convert the model using convert-h5-to-ggml.py
, building the necessary gpt-j
executable from the ggml
repository, and running the optimized model. A note on the newer GGUF format and an unsuccessful attempt to convert GPT-J to GGUF is also included.
Estimated reading time: 6 minutes
If you’ve ever tried to get a large language model like EleutherAI’s GPT-J-6B (from Hugging Face) to generate text on your own computer, you might have faced the same problem I did. This process, called “inference,” can be incredibly slow. I was waiting 15 to 30 minutes just for the model to load and produce a response to my prompts! This was on my ThinkPad X1 Carbon (Gen 11) with 64GB of RAM, so my computer was powerful enough. This performance was too slow for any practical use.
The GGML Solution: A Massive Speed Boost
Then I discovered a technique that made it much faster: converting the model to the GGML format. GGML is a file format created by Georgi Gerganov (ggerganov). It’s designed to help large AI models run faster, especially on regular computer CPUs.
I converted the standard float16
version of the GPT-J-6B model to this GGML format. The float16
version uses 16-bit floating-point numbers for its internal calculations. Importantly, I did this without any further “quantization” (a process that can shrink models but sometimes reduce accuracy). The results were very impressive.
Response times dropped sharply from many minutes to under 20 seconds. The text generation speed was then about 400 milliseconds per token (a token is roughly a word or part of a word). This made a crucial difference for me!
Here’s how I did it, and how you can too.
Step 1: Convert Your GPT-J Model to GGML Format
First, you’ll need to convert the pytorch_model.bin
file (which is how Hugging Face often stores these models) into the .bin
GGML format.
Important Note on Memory: This conversion uses a lot of RAM (computer memory). I initially tried it in my WSL (Windows Subsystem for Linux) environment on a laptop with 32GB of RAM, but WSL itself was limited to only 16GB. The conversion program kept stopping suddenly (it showed a “Killed” message). I had success by running the Python conversion script directly in the Windows Command Prompt (cmd.exe). This allowed the process to use the full 32GB of system RAM. Looking at the Task Manager, the script used about 24GB of RAM during the conversion. So, make sure the system you use for conversion has plenty of memory – likely 24GB or more for GPT-J-6B.
Files You’ll Need:
To run the conversion, you’ll need the main model file (usually pytorch_model.bin
or a similar .bin
or .safetensors
file if you’re using float16
) and a few essential configuration files from the Hugging Face repository.
- Go to the Hugging Face model page for GPT-J-6B, for example, EleutherAI/gpt-j-6b.
- Navigate to the “Files and versions” tab.
- Crucially, if you’re using the
float16
version (which I highly recommend for a balance of performance and quality before GGML conversion), select thefloat16
branch from the branches dropdown. If you are using a different precision (likemain
forfloat32
), select that branch. - Download the following files into a dedicated folder for your model (e.g.,
models/my-gpt-j-float16
):pytorch_model.bin
(or the equivalent for your chosen precision, likemodel-00001-of-00002.bin
if sharded)config.json
vocab.json
added_tokens.json
(if present)
These files give the conversion program information about the model’s structure and the words it uses.
Running the Conversion Script:
Once you have all the necessary files in your model’s folder (e.g., C:\models\my-gpt-j-float16\
or ~/models/my-gpt-j-float16/
), you’ll use the convert-h5-to-ggml.py
script from ggerganov’s ggml
repository.
- Get the script: If you don’t have it, clone the
ggml
repository:1
git clone https://github.com/ggerganov/ggml
- Run the script: Run the Python script, telling it the location of your model’s folder. The final argument
1
tells the script you’re converting afloat16
model. If you were converting afloat32
model, you’d use0
.1
python3 ./ggml/examples/gpt-j/convert-h5-to-ggml.py ./models/my-gpt-j-float16 1
(Adjust the paths like
./ggml/...
and./models/...
to match where you clonedggml
and where your model folder is located relative to your current directory).
The output will show the script processing different parts (called “variables”) of the model:
1
2
3
4
5
6
7
8
Processing variable: transformer.h.27.ln_1.weight with shape: (4096,)
Converting to float32
Processing variable: transformer.h.27.ln_1.bias with shape: (4096,)
Converting to float32
... (many similar lines) ...
Processing variable: lm_head.bias with shape: (50400,)
Converting to float32
Done. Output file: C:\\models\\my-gpt-j-float16/ggml-model-f16.bin
The script will save the converted model as ggml-model-f16.bin
(if you converted float16
) inside your model’s folder.
Step 2: Build the GPT-J Executable
Now that you have your model in GGML format, you need a program to use it. The ggml
repository includes example code that you can compile (build) into an executable program for this.
Important: You will probably need a Linux-like system (such as Linux itself, macOS, or WSL on Windows) to build this. I did this part in WSL.
- Navigate into the
ggml
directory you cloned earlier. - Create a build directory and move into it:
1 2
cd ggml mkdir build && cd build
- Use
cmake
to prepare the build (this tool checks your system and prepares the files for building):1
cmake ..
- Compile the code. The
make
command builds the programs. For example,make -j4
tells it to use 4 processor cores to build faster (you can change the number 4). We’re specifically building thegpt-j
tool:1
make -j4 gpt-j
This will create several executable files in the
ggml/build/bin/
directory, including one namedgpt-j
.
Step 3: Run Inference with Your GGML Model!
Now you’re ready! You can use the gpt-j
executable you just built to load your converted GGML model and generate text.
Run it from your terminal like this:
1
2
3
4
5
# Example from within the ggml/build/bin directory:
./gpt-j -m ~/models/my-gpt-j-float16/ggml-model-f16.bin -p "What is the meaning of life?"
# Or, if your executable is elsewhere, use the full path to gpt-j:
# ~/ggml/build/bin/gpt-j -m ~/models/my-gpt-j-float16/ggml-model-f16.bin -p "What is the meaning of life?"
Replace ~/models/my-gpt-j-float16/ggml-model-f16.bin
with the correct path to your converted model file, and "What is the meaning of life?"
with your desired prompt.
You should see the model load, then your prompt, and finally the generated text:
1
2
3
4
5
6
7
8
9
10
> What is the meaning of life?
The meaning of life is to find out what the meaning of life is.
main: mem per token = 15550964 bytes # Memory used per token
main: load time = 5256.00 ms # Time to load the model
main: sample time = 33.95 ms # Time for sampling strategy
main: predict time = 95766.09 ms / 467.15 ms per token # Total prediction time and per-token speed
main: total time = 101924.70 ms # Total time for the whole operation
The performance numbers show its speed. The “ms per token” value is a key measure of how fast text is generated.
A Note on GGUF: The Newer Format
It’s worth mentioning that Georgi Gerganov has since developed an even newer model format called GGUF. GGUF is designed to be an improvement on GGML. It offers better model information (metadata), is easier to extend, and provides better compatibility with future model designs. Many tools, like llama.cpp
(also from ggerganov), now mainly use GGUF.
When I worked on this (in late 2023), GGUF mostly supported Llama-type models. There are scripts to convert GGML models to GGUF, like convert-llama-ggml-to-gguf.py
in the llama.cpp
repository.
However, when I tried to convert my GPT-J GGML model to GGUF, I ran into an error. Here’s the command I used and the traceback:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
C:\projects\llama.cpp>python convert-llama-ggml-to-gguf.py -i c:\Models\gpt4chan_model_float16\ggml-model-f16.bin -o c:\models\gpt4chan_model_float16\gguf-model-f16.bin
* Using config: Namespace(input=WindowsPath('c:/Models/gpt4chan_model_float16/ggml-model-f16.bin'), output=WindowsPath('c:/models/gpt4chan_model_float16/gguf-model-f16.bin'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm')
=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===
- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGMLv1 with ftype MOSTLY_F16
Traceback (most recent call last):
File "C:\projects\llama.cpp\convert-llama-ggml-to-gguf.py", line 445, in <module>
main()
File "C:\projects\llama.cpp\convert-llama-ggml-to-gguf.py", line 419, in main
offset = model.load(data, 0) # noqa
File "C:\projects\llama.cpp\convert-llama-ggml-to-gguf.py", line 181, in load
offset += vocab.load(data, offset, hp.n_vocab)
File "C:\projects\llama.cpp\convert-llama-ggml-to-gguf.py", line 85, in load
assert itemlen < 4096, 'Absurd vocab item length'
AssertionError: Absurd vocab item length
The error AssertionError: Absurd vocab item length
suggested the script was not compatible with my GPT-J model’s vocabulary. This is likely because the script is specifically for Llama models, as its name indicates. So, for GPT-J, GGML was the best option I found at the time to achieve this speedup.
For newer models, particularly Llama-based ones, you should consider GGUF first.
Conclusion: GGML Improves GPT-J Speed
Converting my GPT-J-6B model to GGML significantly increased its speed on my local machine, from painfully slow to impressively fast. If you’re experiencing similar slow performance with GPT-J, I highly recommend trying this conversion process.
A big thank you to ggerganov for creating ggml
and these tools that make running large models much more accessible!