BeyondX: Evaluating Multi-Unknown Algebra Problem Reasoning

Introduction

Large Language Models (LLMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in multiple unknown problems has not been systematically studied.

To bridge this gap, we present Logo BeyondX, a algebra problem benchmark that contains problems with more than two unknown variables. It consists of 464 examples, expanded from ALG514 and DRAW-1K dataset involving one or two-unknown.

Our benchmark has its unique challenges:

Our benchmark requires advanced solving skills.
Solving multi-unknown problems requires accurate formulation of equation sets, rather than simply obtaining answers through step-by-step procedures.
Our benchmark requires advanced calculation.
Solving multi-unknown equations requires intricate arithmetic calculations which may not be easily achieved through in-context guidance.

Leaderboard on BeyondX

Accuracy scores (%) on the Logo BeyondX.

Unknown types: BeyondX_3: Three-unknown, BeyondX_4: Four-unknown, BeyondX_5: Five-unknown.

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Overview

Logo BeyondX is a consolidated mathematical reasoning benchmark within multi-unknown. In total, Logo BeyondX includes 464 examples automatically generated from 2 different source datasets (ALG514, DRAW-1k). All the data examples were divided into three subsets: BeyondX_3 (194), BeyondX_4 (158), and BeyondX_5 (112).
You can download the dataset on Hugging Face Dataset.

How to generate multiple unknown algebra problems via prompting?

Full instruction on how to generate multiple unknown algebra problems via prompting?

Summary of the different math datasets in Logo BeyondX.

Examples

One example for each unknown in Logo BeyondX

Two-unknown (source problem)

Three-unknown

Four-unknown

Five-unknown

Visualization

Overview

Full instruction on how to solve multiple unknown algebra problems via prompting?

We develop several prompting methods as our baselines:

Zero-shot-CoT: Append "Let's think step by step" to the prompt without any demonstration examples.
Plan-and-Solve: Append "Let's first understand the problem and devise a plan to solve the problem. Then, let's carry out the plan and solve the problem step by step" to the prompt without any demonstration examples.
Few-shot-CoT (Manual): Asks models to generate a natural language response related to demonstration examples.
Few-shot-PoT (Manual): Asks models to generate a Python code program related to demonstration examples, and uses an external computer to execute the code.
Few-shot-EoT (Manual): Asks models to generate a Equation format response related to demonstration examples, and uses an external symbolic solver to execute the equation.
Few-shot-Declarative (Manual): Asks models to generate a Peano format response related to demonstration examples, and uses an external symbolic solver to execute the Peano.
Analogical (Auto): Automatically asks models to self-generate relevant examples and solving steps as demonstrations before proceeding to solve the problem.
Auto-Zero-shot-CoT (Auto): Automatically asks models to generate solving steps via Zero-shot-CoT as demonstrations before proceeding to solve the problem.
Formulate-and-Solve (Auto): Automatically asks models to generate solving steps via human-solving instruction as demonstrations before proceeding to solve the problem.

We evaluate the performance of each baseline and our method on general algebra datasets (such as ALG514, DRAW-1K, ASDiv, and HMWP) and Logo BeyondX.

Performance of GPT-3.5 on general algebra dataset with single and double unknown.

Performance of Gemini-Pro on general algebra dataset with single and double unknown.

Performance of GPT-4 on general algebra dataset with single and double unknown.

Results on General-Purpose Models

Performance of GPT-3.5 on the Logo BeyondX.

Performance of Gemini-Pro on the Logo BeyondX.

Performance of GPT-4 on the Logo BeyondX.

More Results

Performance of open-source models on the Logo BeyondX under Zero-shot-CoT.

Ablation Study of Formulate-and-Solve on the Logo BeyondX.

Error Analysis of Formulate-and-Solve on the Logo BeyondX using GPT-3.5.

Performance of GPT-3.5 models on the common arithmetic dataset.

Explorer

Explore the outputs of each model on Logo BeyondX

BibTeX

@inproceedings{kao2024beyondx,
  author    = {Kuei-Chun, Kao and Ruochen, Wang and Cho-Jui, Hsieh},
  title     = {Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?},
  booktitle={International Conference on Learning Representations (ICLR)},
  year      = {2024}
}

BeyondX

Evaluating Multi-Unknown Algebra Problem Reasoning

Introduction

Leaderboard on BeyondX

BeyondX Dataset

Overview

Examples

Visualization

Formulate-and-Solve

Overview

Experiment Results

Results on General-Purpose Models

More Results

Explorer

BibTeX