PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

1 VinUniversity ; 2 University of Illinois, Urbana-Champaign ; 3 University of Notre Dame ; 4 Monash University
* Co-first Authors.
Co-corresponding authors. Correspondence to: khoa.dd@vinuni.edu.vn and binh.nt2@vinuni.edu.vn

Abstract

Scientific peer review is under mounting strain as major machine learning venues face rapidly growing submission volumes, heavier reviewer workloads, and increasingly difficult paper-to-reviewer matching. At the same time, Large Language Models (LLMs) have moved from proofreading aids to automated reviewer agents capable of drafting full scientific critiques. This raises a central question: are LLMs sufficient reviewers for evaluating scientific work, especially when human reviewers themselves operate under severe time pressure?

We introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmark for evaluating both LLM-generated and human reviews across four core duties: depth of analysis, novelty assessment, flaw identification and prioritization, and multi-dimensional constructiveness. Each duty is measured through a dedicated pipeline grounded in argument mining, retrieval-augmented verification, and consensus-based scoring. Across 1,000 papers from ICLR, ICML, and NeurIPS, PRISM shows that LLM reviewers can be strong task-matched specialists, but no single system matches the balanced performance of human reviewers. LLM reviewers are therefore best used as deliberate, human-assisted supplements rather than general-purpose replacements.

1. Introducing PRISM

Major ML venues now receive tens of thousands of submissions, which makes reviewer assignment, workload, and review quality increasingly difficult to manage. LLM reviewers offer scale, but common evaluation methods often rely on surface similarity metrics or broad LLM-as-a-judge scores. These approaches can blur together fluency, factuality, and scientific rigor.

PRISM asks a stricter question:

Does a review provide grounded analysis, calibrated novelty judgment, valid flaw detection, and actionable feedback?

To answer it, PRISM evaluates each manuscript-review pair through four independent and interpretable pipelines. Each pipeline extracts small review units, verifies them against the manuscript or prior literature, and computes metrics from those structured decisions instead of relying on a single opaque judge rating.

PRISM evaluation pipeline overview showing the four dimensional assessment framework
PRISM overview. Each review is decomposed into evidence units, novelty claims, flaw arguments, and atomic comments, then scored by modular evaluator pipelines.

What PRISM Measures

DimensionUnit of AnalysisJudge TaskMetric Output
Depth of AnalysisArgumentative Discourse UnitsClassify claims, premises, topics, and grounding levelPremise Ratio, Grounding Score, DoA
Novelty AssessmentVerbatim novelty claimsRetrieve prior work and verify literature supportNovelty Score, Support Rate, Strict Support Rate
Flaw IdentificationDistinct flaw argumentsVerify flaws, merge consensus truth, classify severityCritical Recall, Minor Recall, nCPS
ConstructivenessAtomic Review CommentsScore helpfulness across five dimensionsMean Constructiveness Score

PRISM uses constrained LLM judging for extraction and labeling. The final scores, however, are computed analytically from labels, retrieval evidence, consensus verification, and the structure of review comments.

2. Methods

3. Results and Analysis

PRISM shows that LLM reviewers tend to specialize in different review responsibilities. No single system dominates all four dimensions.

Key Findings

FindingEvidencePractical Reading
Humans remain the most balanced baselineHuman DoA = 0.494Humans are still needed for calibrated final judgment
DeepReview and CycleReviewer nearly match human depthDoA = 0.483 and 0.484Structured reasoning helps LLMs produce substantiated critiques
SEA has the strongest novelty groundingNovelty score = 0.833 vs. human = 0.787Retrieval-oriented pipelines help verify literature claims
Reviewer2 is the best flaw scannerCritical Recall = 0.591 vs. human = 0.343LLMs can surface missed issues for human reviewers
DeepReview gives the most constructive feedbackMCS = 0.634 vs. human = 0.566System design matters for solution-oriented critique

Headline Results

Evaluation DimensionBest Automated SystemHuman Baseline
Depth of AnalysisCycleReviewer: 0.484; DeepReview: 0.4830.494
Novelty AssessmentSEA: 0.8330.787
Critical Flaw RecallReviewer2: 0.5910.343
Minor Flaw RecallReviewer2: 0.4590.281
PrioritizationSEA: 0.9770.973
ConstructivenessDeepReview: 0.6340.566

Key Takeaways by Dimension

Depth of Analysis: Humans lead (DoA = 0.494) through premise density and methodology focus. DeepReview and CycleReviewer nearly match by generating many grounded premises, while TreeReview falls into surface-level reviewing patterns.

Novelty Assessment: All systems operate within 0.750–0.830 for evidence-grounding. SEA achieves the highest score (0.833), showing that retrieval-oriented pipelines help verify literature claims.

Flaw Detection: Reviewer2 is an exhaustive flaw scanner with Critical Recall of 0.591 (vs. human 0.343). LLMs can surface more candidate issues, but human oversight is needed to manage precision.

Constructiveness: DeepReview significantly outperforms others (MCS = 0.634 vs. human 0.566). Human reviewers are specific but often stop at diagnosis; DeepReview closes the loop from “this is a problem” to “here is how to fix it.”

Core Insight

Human and LLM reviewers are complementary. Humans excel at balanced judgment and calibration, while LLMs excel at exhaustive scanning and systematic verification. Their union covers more diagnostic ground than either group alone.

4. Conclusion

No Single System Wins

SystemStrengthWeakness
Reviewer2Exhaustive flaw scanning (highest recall)Limited solution provision
DeepReviewConstructive feedback (actionable, professional)Slightly lower flaw recall
SEANovelty verification (highest literature support)Lower constructiveness
CycleReviewerStrong analytical depthHigh hallucination rate
TreeReviewLimited comparative advantageSurface-level trap (24% effort on formatting)

Since no single system dominates all four dimensions, the evidence points toward targeted ensemble deployment:

These systems are most effective as specialist assistants within a human-led pipeline rather than autonomous reviewers.

PRISM belongs to public benchmarks and reviewer-assistance projects that emphasize inspectable evaluation. Related systems include Reviewer2, SEA, DeepReview, TreeReview, and CycleReviewer.

BibTeX

@article{prism2026,
title={PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers},
author={Ngoc Phan, Toan Huynh, Tran Khanh Thanh, Duy A. Nguyen, Nguyen Pham Tuan Anh, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D Doan, Binh Nguyen},
journal={arXiv preprint},
year={2026}
}