Product of Experts with LLMs:

Jump to: TL;DR, Introduction, Method, Key Insights, Conclusion, Acknowledgements, Related Links, Cite

TL;DR

We boost performance on the ARC reasoning benchmark by using the same LLM in two ways: (1) generating diverse solutions with depth-first search and (2) combine the same model’ scores of solutions from different perspectives using an “Product of Expert (Hinton, 1999)” approach. Our method solves 71.6% of tasks at very low cost (~2ct/task).

High-Level Overview of our Approach: We use a single LLM to generate multiple candidate solutions and utilize multiple perspectives to improve the uncertainty of choosing which solution is best.

Introduction

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) represents a particularly challenging benchmark in AI research. Unlike traditional ML challenges that reward crystallized knowledge, ARC-AGI focuses on fluid intelligence-the ability to reason about novel problems without specific prior training. Introduced by François Chollet in 2019, the benchmark explicitly measures general reasoning capabilities, making it a critical test for progress toward AGI.

Why ARC-AGI Represents a Key AGI Benchmark

ARC-AGI tests fluid intelligence rather than task-specific skills, making it a necessary benchmark for AGI evaluation:

Evaluates abstract reasoning and adaptability to novel problems
Serves as a necessary (but not sufficient) condition for AGI capabilities
A systematic test to measure the reasoning gap between human and artificial intelligence
Remains challenging for conventional deep learning approaches

The “AGI” designation indicates that systems unable to solve these tasks clearly fall short of general intelligence, rather than implying that success demonstrates full AGI capabilities.

Method

Our approach comprises several key components:

Task-Specific Data Augmentations: We apply data augmentations tailored to the unique characteristics of ARC tasks during training, generation, and scoring phases to enhance model robustness.
Depth-First Search Algorithm: A depth-first search (DFS) algorithm is employed on LLM predictions to generate a diverse set of high-probability candidate solutions.
LLM as Generator and Scorer: The same LLM is utilized both to generate candidate solutions and to score them, using output probabilities to select the most promising solutions.
Product of Experts (PoE): We implement a PoE approach to combine multiple model outputs, refining the selection of candidate solutions.

Key Insights

Large language models (LLMs) demonstrate surprisingly strong capabilities in solving ARC tasks end-to-end, effectively uncovering complex underlying patterns. Critically, we found that these models excel more at evaluating potential solutions than at generating them directly. Leveraging this insight, we employed a depth-first search to efficiently explore possible solutions and then used the same LLM to reliably identify the most promising candidate.

Specifically, our key findings include:

Enhanced Performance through Data Variation
- Applying transformations such as rotation, grid transposition, and color substitution significantly boosts LLM performance.
- Providing multiple perspectives on tasks helps overcome inherent limitations, like the autoregressive generation process of LLMs.
Efficient Solution Candidate Generation
- Standard greedy and stochastic sampling methods were unreliable in generating the correct solution, even with adjusted parameters.
- Employing a depth-first search guided by the LLM’s evaluation of potential paths greatly improved the chance of finding the correct solution.
Effective Solution Selection
- Ensemble scoring across diverse augmentations yields more consistent and reliable evaluations.
- Our “Product-of-Experts” method for combining the scores of different augmentations emerged as a powerful and effective approach for selecting the correct solution.

Key Figures

The two key plots of our work underline the impact of both contributions in isolation.

Performance impact of DFS-based sampling: Number of solutions found by various sampling algorithms as a function of runtime. The different values for each sampling variant are calculated using 1 (identity), 2 (reflections), 4 (rotation), 8 (reflections+rotation) and 16 augmentations. Additionally, colors and the order or examples are randomly permuted in each augmented version of a task. For almost any runtime budget, we find that a DFS variant discovers the most solutions. — **Performance impact of DFS-based sampling**: Number of solutions found by various sampling algorithms as a function of runtime. The different values for each sampling variant are calculated using 1 (identity), 2 (reflections), 4 (rotation), 8 (reflections+rotation) and 16 augmentations. Additionally, colors and the order or examples are randomly permuted in each augmented version of a task. For almost any runtime budget, we find that a DFS variant discovers the most solutions.

Performance impact of Product of Expert selection: Accuracy and coverage of different selection methods as a function of the confidence threshold T. The solid black line shows the proportion of tasks where the correct solution is among the generated candidates. The solid colored lines show what percentage of the tasks would be solved using different aggregation methods (top-2 accuracy), while the dotted lines show how this percentage relates to the black line. Our product of probabilities approach performs best among the tested aggregation methods. — **Performance impact of Product of Expert selection**: Accuracy and coverage of different selection methods as a function of the confidence threshold T. The solid black line shows the proportion of tasks where the correct solution is among the generated candidates. The solid colored lines show what percentage of the tasks would be solved using different aggregation methods (top-2 accuracy), while the dotted lines show how this percentage relates to the black line. Our product of probabilities approach performs best among the tested aggregation methods.

Comparative Analysis

Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. Notably, this performance is achieved with a remarkably low inference cost, averaging only around 2 cents per task on readily available hardware.

To contextualize our results, we compare our method with other approaches:

Method	Score (%)	Open Source Weights	Inference Cost per Task
Our Method	71.6	yes	~$0.02
Average Human	60.2	?	?
OpenAI’s o3	82.8	no	~$17

Table 1: Comparative performance on ARC-AGI evaluation set.

Conclusion

By integrating task-specific data augmentations, a depth-first search algorithm, and leveraging LLMs as both generators and scorers, our method significantly enhances performance on the ARC-AGI benchmark. This approach underscores the potential of combining strategic data processing techniques with powerful language models to tackle complex abstract reasoning tasks.

Acknowledgements

We sincerely thank Lambda for providing 8xH100 GPUs that enabled rapid iteration of our ideas. This work was supported by the Carl-Zeiss-Stiftung through the “Research Center for Algorithmic Intelligence as an Emergent Phenomenon” and by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project 233630050 (TRR 146).

Our GTC’25 Talk focusing on the competition and the technical aspects of this project
The arcprize.org project and last year’s ARC Kaggle Competition 2024 that initiated this work.

BibTeX

If you consider citing us, feel free to use the bibtex-entry below.

@article{poeforllms2025arc,
  author={Daniel Franzen and Jan Disselhoff and David Hartmann},
  title={Product of Experts for LLMs: Boosting Performance on ARC is a Matter of Perspective},
  booktitle    = {Forty-second International Conference on Machine Learning, {ICML} 2025, Vancouver, Canada},
  publisher    = {OpenReview.net},
  year={2025},
  url          = {https://openreview.net/forum?id=dsBjxI6l8W},
}