Accuracy scores (%) on the BeyondX.
Large Language Models (LLMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in multiple unknown problems has not been systematically studied.
To bridge this gap, we present BeyondX, a algebra problem benchmark that contains problems with more than two unknown variables. It consists of 464 examples, expanded from ALG514 and DRAW-1K dataset involving one or two-unknown.
Our benchmark has its unique challenges:
Accuracy scores (%) on the BeyondX.
🚨 To submit your results to the leaderboard, please send to this email with your result json files.
BeyondX is a consolidated mathematical reasoning benchmark within
multi-unknown.
In total,
BeyondX includes 464 examples automatically generated from 2
different source
datasets (ALG514, DRAW-1k).
All the data examples were divided into three subsets: BeyondX_3 (194), BeyondX_4 (158), and BeyondX_5
(112).
You can download the dataset on Hugging Face
Dataset.
How to generate multiple unknown algebra problems via prompting?
Full instruction on how to generate multiple unknown algebra problems via prompting?
Summary of the different math datasets in BeyondX.
One example for each unknown in BeyondX
Two-unknown (source problem)
Three-unknown
Four-unknown
Five-unknown
Full instruction on how to solve multiple unknown algebra problems via prompting?
We develop several prompting methods as our baselines:
We evaluate the performance of each baseline and our method on general algebra datasets (such as ALG514, DRAW-1K, ASDiv, and HMWP) and BeyondX.
Performance of GPT-3.5 on general algebra dataset with single and double unknown.
Performance of Gemini-Pro on general algebra dataset with single and double unknown.
Performance of GPT-4 on general algebra dataset with single and double unknown.
Performance of GPT-3.5 on the BeyondX.
Performance of Gemini-Pro on the BeyondX.
Performance of GPT-4 on the BeyondX.
Performance of open-source models on the BeyondX under Zero-shot-CoT.
Ablation Study of Formulate-and-Solve on the BeyondX.
Error Analysis of Formulate-and-Solve on the BeyondX using GPT-3.5.
Performance of GPT-3.5 models on the common arithmetic dataset.
Explore the outputs of each model on BeyondX
@inproceedings{kao2024beyondx,
author = {Kuei-Chun, Kao and Ruochen, Wang and Cho-Jui, Hsieh},
title = {Solving for X and Beyond: Can Large Language Models Solve Complex Math Problems with More-Than-Two Unknowns?},
booktitle={International Conference on Learning Representations (ICLR)},
year = {2024}
}