Jan 25, 2023

Prompt LLMs for code, not arithmetic

It is now well-known that arithmetic is a weak spot for LLMs. If you directly prompt an LLM with calculation, or even simple arithmetic word problems, it will often get it wrong.

But this paper argues that trying to get an LLM to do calculations is a misunderstanding of its strengths. LLMs are meant for understanding language, not calculation. The key idea is to separate the problem into two parts: understanding, and computation (or solving). Let the LLM do what it is good at– understanding– and then offload the computation to an actual programming language.

Check out below the quick video summary I posted last week. Also see their GitHub for the full source prompts they use.

Let’s see a few examples of how that works.

But first, an example of where LLMs fail (all examples in this post come from the text-davinci model. The LLM’s answers are demarcated with 🤖💬 and 🤖🛑).

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. How many tennis balls does he have
now?

Answer: Roger started with 5 tennis balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6. The answer is 11.

Question: The bakers at the Beverly Hills Bakery baked 200 loaves of
bread on Monday morning. They sold 93 loaves in the morning and 39
loaves in the afternoon. A grocery store returned 6 unsold loaves. How
many loaves of bread did they have left?

Answer:
🤖💬
The bakers started with 200 loaves of bread. They sold 93 loaves in
the morning and 39 loaves in the afternoon. 6 loaves were
returned. 200 - (93 + 39 + 6) = 62 loaves of bread left.
🤖🛑

Instead, we prompt it to not compute the answer, but to provide Python code to compute the answer.

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. How many tennis balls does he have
now?

Solution in Python:
def solution():
  """Roger has 5 tennis balls. He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. How many tennis balls does he have
now?"""
  # Roger started with 5 tennis balls.
  tennis_balls = 5
  # 2 cans of 3 tennis balls each is
  bought_balls = 2 * 3
  # tennis balls. The answer is
  answer = tennis_balls + bought_balls
  return answer

Q: The bakers at the Beverly Hills Bakery baked 200 loaves of bread on
Monday morning. They sold 93 loaves in the morning and 39 loaves in
the afternoon. A grocery store returned 6 unsold loaves. How many
loaves of bread did they have left?

Solution in Python:
🤖💬
def solution():
  """The bakers at the Beverly Hills Bakery baked 200 loaves of bread
on Monday morning. They sold 93 loaves in the morning and 39 loaves in
the afternoon. A grocery store returned 6 unsold loaves. How many
loaves of bread did they have left?"""
  # The bakers started with 200 loaves of bread.
  bread = 200
  # They sold 93 loaves in the morning and 39 loaves in the afternoon.
  sold_bread = 93 + 39
  # A grocery store returned 6 unsold loaves.
  returned_bread = 6
  # The answer is
  answer = bread - sold_bread + returned_bread
  return answer
🤖🛑

You can check that running the Python code gives the correct answer (74).

You can do the same thing with logical problems, not just arithmetic:

Question: say "java" twice and "data" once, and then repeat all of
this three times.

Answer: java java data java java data java java data


Question: Repeat the word duck four times, but halfway through also
say quack

Answer:
🤖💬
duck quack duck duck quack duck
🤖🛑

OK it failed that. Let’s try the same with code-style prompts.

Question: say java twice and data once, and then repeat all of this
three times.

# solution using Python:

def solution():
    """Question: say java twice and data once, and then repeat all of
    this three times."""
    result = []
    tmp = ["java", "java", "data"]
    for i in range(3):
        result.extend(tmp)
    return " ".join(result)


Question: Repeat the word duck four times, but halfway through also
say quack


# solution using Python:
🤖💬
def solution():
    """Question: Repeat the word duck four times, but halfway through
    also say quack"""
    result = []
    for i in range(4):
        if i == 2:
            result.append("quack")
        result.append("duck")
    return " ".join(result)
🤖🛑

tl;dr– use LLMs for what they were meant to do– understand and predict natural language– and then leverage that for computation.