CPU only LLM (Vicuna-13B) with Web Search

Using Llama.cpp to run Vicuna-13B (strongest local model), with Google Search

May 16, 2023

Overview

This project is an experiment to see how far we can push LLM models without GPUs. Here is a quick example conversation, that I ran locally on my machine, without a GPU, getting results from Google and integrating them gracefully into the response:

[user] Tell me about the Red Devil Locomotive
Thought: I'm not sure what the user is asking for specifically, so I will search for more information on the "Red Devil Locomotive" and see if that provides any useful details or context.
Search: red devil locomotive
Final Answer: The "Red Devil" is a nickname given to the South African Railways (SAR) Class 26 4-8-4 steam locomotive, which was rebuilt in 1981 by mechanical engineer David Wardale. It was known for its high speed and powerful performance, and was popular among train enthusiasts. The Red Devil was the product of several modifications made to the original Class 25NC locomotive, including changes to the boiler, firebox, and cylinders.

Which is pretty great!

You can run this yourself by:

Cloning https://github.com/glinscott/llama.cpp/tree/vicuna_agent (which is a tiny mod of https://github.com/ggerganov/llama.cpp)
Building llama.cpp
Acquire Vicuna 13B
Set up Google Search integration
Run `python3 agent.py`

Let’s dig into what’s going on behind the scenes.

Llama.cpp

Llama.cpp allows you to run LLM models on your CPU. It is slower than GPU inference, but still realtime (340ms per token on my system with 16 threads). Since your desktop system generally has much more RAM available than the GPU (and it’s cheaper RAM!), this is a compelling trade-off for larger models. I’ve been following llama.cpp for a while, and it’s been improving at an incredible pace - the perplexity for quantized models is approaching baseline float16 models.

First, we need to get Vicuna-13B, the strongest available LLM we can run with llama.cpp (as measured by https://chat.lmsys.org/?arena as of May 14th - this changes almost every week). Vicuna is very impressive, it improves hugely on the baseline Llama models from Meta, but is not yet in the same league as ChatGPT (3.5), and miles away from GPT4.

Notably, Vicuna-13B has two versions - we will be using the latest version 1.1 (https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md) which uses the EOS token to denote the model is done with output. This is not natively supported by llama.cpp yet, so we need a small patch, which searches for “</s>” in the input, and replaces it with the actual EOS token.

I chose to run the 8 bit quantized version of the model, as the perplexity basically matches float16, but is faster and runs in half the RAM: https://github.com/ggerganov/llama.cpp#quantization.

Agent

The agent itself is a basic python wrapper around llama.cpp, which just watches for “Observation:”, forwards the search query onto the Google Search API, then formats the response for the LLM.

Here is what the search output looks like for the above example:

Observation:Title: South African Class 26 4-8-4 - Wikipedia\
Snippet: The South African Railways Class 26 4-8-4 of 1981, popularly known as the Red Devil, is a 4-8-4 steam locomotive which was rebuilt from a Class 25NC ...\
###\
Title: Modifications in Mainline Steam - The Red Devil — Coalition for ...\
Snippet: Apr 28, 2016 ... South African Railways (SAR) Class 26 number 3450 (nicknamed the "Red Devil"), is the product of mechanical engineer David Wardale's 1981 ...\
[end observation]

The model gets the top two search results, with snippets of the page. Still, that’s enough for it to build a reasonable amount of context for a lot of queries. The backslashes are a bit of a hack to allow multi-line inputs to work with llama.cpp.

The parsing code is a bit of a mess, but this is it:

    if buffer.startswith("Search:"):
      # Reset query
      query = buffer[8:]
    elif buffer.startswith("Observation:"):
      time.sleep(1)
      items = search(query)['items']
      result = "###\\".join(["Title: %s\\Snippet: %s\\" % (item["title"], item["snippet"]) for item in items[:2]])
      result += "\n\n"
      p.stdin.write(result.encode())
      p.stdin.flush()
      buffer = ""

The Prompt

I didn’t spend as much time as I should have here, but it seems to work well enough so far:

Answer the following questions as best you can. You have access to a search engine, which can fill in gaps in your knowledge, or give you information about current events.

If you want to search about something, use the following format:

Search: the query to the search engine
Observation: the result of the search

Use the following format when interacting with the user.  Always use
this format, for each question the user asks.

Question: the input question you must answer
Thought: describe how you are going to answer the question
(optionally: query search engine, using Search, as many times as you need)
Final Answer: the final answer to the original input question</s>

User input now begins.

There are a few interesting parts in here.

“Always use this format, for each question the user asks.” Otherwise I saw the LLM go off script, usually hallucinating search results for itself, and then starting to generate it’s own user queries.
“Question/Thought/Answer” phrasing. This is borrowed from other agent prompts, and seems to help the agent stay on script.

Google Search Integration

I signed up for the Google Search API and integrated it following this article, it’s a tiny amount of code in the end. You also need `pip install google-api-python-client`.

def search(search_term, api_key=API_KEY, cse_id=API_CX, **kwargs):
  service = build("customsearch", "v1", developerKey=api_key)
  res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
  return res

This is also the place where we could plug in to a vector database, or really, do any custom actions we want to.

Conclusion

This experiment was a lot of fun. It was really easy to hook the model up to search, and get some good results. The part that took the most work was getting python input/output working without buffering (I used GPT-4 to help there :).

Gary’s Substack

Discussion about this post