Logo BeyondX

Evaluating Multi-Unknown Algebra Problem Reasoning

1University of California, Los Angeles
Example of Multi-Unknown Problem

Introduction

Large Language Models (LLMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in multiple unknown problems has not been systematically studied.

To bridge this gap, we present Logo BeyondX, a algebra problem benchmark that contains problems with more than two unknown variables. It consists of 464 examples, expanded from ALG514 and DRAW-1K dataset involving one or two-unknown.

Our benchmark has its unique challenges:

  • Our benchmark requires advanced solving skills.
    Solving multi-unknown problems requires accurate formulation of equation sets, rather than simply obtaining answers through step-by-step procedures.
  • Our benchmark requires advanced calculation.
    Solving multi-unknown equations requires intricate arithmetic calculations which may not be easily achieved through in-context guidance.

Leaderboard on BeyondX

Accuracy scores (%) on the Logo BeyondX.

Unknown types: BeyondX_3: Three-unknown, BeyondX_4: Four-unknown, BeyondX_5: Five-unknown.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Logo BeyondX Dataset

Overview

Logo BeyondX is a consolidated mathematical reasoning benchmark within multi-unknown. In total, Logo BeyondX includes 464 examples automatically generated from 2 different source datasets (ALG514, DRAW-1k). All the data examples were divided into three subsets: BeyondX_3 (194), BeyondX_4 (158), and BeyondX_5 (112).
You can download the dataset on Hugging Face Dataset.

Examples

One example for each unknown in Logo BeyondX

Visualization

Formulate-and-Solve

Overview

Experiment Results

We develop several prompting methods as our baselines:

  • Zero-shot-CoT: Append "Let's think step by step" to the prompt without any demonstration examples.
  • Plan-and-Solve: Append "Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step" to the prompt without any demonstration examples.
  • Few-shot-CoT (Manual): Asks models to generate a natural language response related to demonstration examples.
  • Few-shot-PoT (Manual): Asks models to generate a Python code program related to demonstration examples, and uses an external computer to execute the code.
  • Few-shot-EoT (Manual): Asks models to generate a Equation format response related to demonstration examples, and uses an external symbolic solver to execute the equation.
  • Few-shot-Declarative (Manual): Asks models to generate a Peano format response related to demonstration examples, and uses an external symbolic solver to execute the Peano.
  • Analogical (Auto): Automatically asks models to self-generate relevant examples and solving steps as demonstrations before proceeding to solve the problem.
  • Auto-Zero-shot-CoT (Auto): Automatically asks models to generate solving steps via Zero-shot-CoT as demonstrations before proceeding to solve the problem.
  • Formulate-and-Solve (Auto): Automatically asks models to generate solving steps via human-solving instruction as demonstrations before proceeding to solve the problem.

We evaluate the performance of each baseline and our method on general algebra datasets (such as ALG514, DRAW-1K, ASDiv, and HMWP) and Logo BeyondX.

Results on General-Purpose Models

More Results

Explorer

Explore the outputs of each model on Logo BeyondX

BibTeX

@inproceedings{kao2024beyondx,
  author    = {Kuei-Chun, Kao and Ruochen, Wang and Cho-Jui, Hsieh},
  title     = {Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?},
  booktitle={International Conference on Learning Representations (ICLR)},
  year      = {2024}
}