Abstract
Scientific peer review is under mounting strain as major machine learning venues face rapidly growing submission volumes, heavier reviewer workloads, and increasingly difficult paper-to-reviewer matching. At the same time, Large Language Models (LLMs) have moved from proofreading aids to automated reviewer agents capable of drafting full scientific critiques. This raises a central question: are LLMs sufficient reviewers for evaluating scientific work, especially when human reviewers themselves operate under severe time pressure?
We introduce PRISM (Peer Review Intelligence via Structured Multi-dimensional assessment), a benchmark for evaluating both LLM-generated and human reviews across four core duties: depth of analysis, novelty assessment, flaw identification and prioritization, and multi-dimensional constructiveness. Each duty is measured through a dedicated pipeline grounded in argument mining, retrieval-augmented verification, and consensus-based scoring. Across 1,000 papers from ICLR, ICML, and NeurIPS, PRISM shows that LLM reviewers can be strong task-matched specialists, but no single system matches the balanced performance of human reviewers. LLM reviewers are therefore best used as deliberate, human-assisted supplements rather than general-purpose replacements.
1. Introducing PRISM
Major ML venues now receive tens of thousands of submissions, which makes reviewer assignment, workload, and review quality increasingly difficult to manage. LLM reviewers offer scale, but common evaluation methods often rely on surface similarity metrics or broad LLM-as-a-judge scores. These approaches can blur together fluency, factuality, and scientific rigor.
PRISM asks a stricter question:
Does a review provide grounded analysis, calibrated novelty judgment, valid flaw detection, and actionable feedback?
To answer it, PRISM evaluates each manuscript-review pair through four independent and interpretable pipelines. Each pipeline extracts small review units, verifies them against the manuscript or prior literature, and computes metrics from those structured decisions instead of relying on a single opaque judge rating.
What PRISM Measures
| Dimension | Unit of Analysis | Judge Task | Metric Output |
|---|---|---|---|
| Depth of Analysis | Argumentative Discourse Units | Classify claims, premises, topics, and grounding level | Premise Ratio, Grounding Score, DoA |
| Novelty Assessment | Verbatim novelty claims | Retrieve prior work and verify literature support | Novelty Score, Support Rate, Strict Support Rate |
| Flaw Identification | Distinct flaw arguments | Verify flaws, merge consensus truth, classify severity | Critical Recall, Minor Recall, nCPS |
| Constructiveness | Atomic Review Comments | Score helpfulness across five dimensions | Mean Constructiveness Score |
PRISM uses constrained LLM judging for extraction and labeling. The final scores, however, are computed analytically from labels, retrieval evidence, consensus verification, and the structure of review comments.
2. Methods
3. Results and Analysis
PRISM shows that LLM reviewers tend to specialize in different review responsibilities. No single system dominates all four dimensions.
Key Findings
| Finding | Evidence | Practical Reading |
|---|---|---|
| Humans remain the most balanced baseline | Human DoA = 0.494 | Humans are still needed for calibrated final judgment |
| DeepReview and CycleReviewer nearly match human depth | DoA = 0.483 and 0.484 | Structured reasoning helps LLMs produce substantiated critiques |
| SEA has the strongest novelty grounding | Novelty score = 0.833 vs. human = 0.787 | Retrieval-oriented pipelines help verify literature claims |
| Reviewer2 is the best flaw scanner | Critical Recall = 0.591 vs. human = 0.343 | LLMs can surface missed issues for human reviewers |
| DeepReview gives the most constructive feedback | MCS = 0.634 vs. human = 0.566 | System design matters for solution-oriented critique |
Headline Results
| Evaluation Dimension | Best Automated System | Human Baseline |
|---|---|---|
| Depth of Analysis | CycleReviewer: 0.484; DeepReview: 0.483 | 0.494 |
| Novelty Assessment | SEA: 0.833 | 0.787 |
| Critical Flaw Recall | Reviewer2: 0.591 | 0.343 |
| Minor Flaw Recall | Reviewer2: 0.459 | 0.281 |
| Prioritization | SEA: 0.977 | 0.973 |
| Constructiveness | DeepReview: 0.634 | 0.566 |
Key Takeaways by Dimension
Depth of Analysis: Humans lead (DoA = 0.494) through premise density and methodology focus. DeepReview and CycleReviewer nearly match by generating many grounded premises, while TreeReview falls into surface-level reviewing patterns.
Novelty Assessment: All systems operate within 0.750–0.830 for evidence-grounding. SEA achieves the highest score (0.833), showing that retrieval-oriented pipelines help verify literature claims.
Flaw Detection: Reviewer2 is an exhaustive flaw scanner with Critical Recall of 0.591 (vs. human 0.343). LLMs can surface more candidate issues, but human oversight is needed to manage precision.
Constructiveness: DeepReview significantly outperforms others (MCS = 0.634 vs. human 0.566). Human reviewers are specific but often stop at diagnosis; DeepReview closes the loop from “this is a problem” to “here is how to fix it.”
Core Insight
Human and LLM reviewers are complementary. Humans excel at balanced judgment and calibration, while LLMs excel at exhaustive scanning and systematic verification. Their union covers more diagnostic ground than either group alone.
4. Conclusion
No Single System Wins
| System | Strength | Weakness |
|---|---|---|
| Reviewer2 | Exhaustive flaw scanning (highest recall) | Limited solution provision |
| DeepReview | Constructive feedback (actionable, professional) | Slightly lower flaw recall |
| SEA | Novelty verification (highest literature support) | Lower constructiveness |
| CycleReviewer | Strong analytical depth | High hallucination rate |
| TreeReview | Limited comparative advantage | Surface-level trap (24% effort on formatting) |
Since no single system dominates all four dimensions, the evidence points toward targeted ensemble deployment:
- Use Reviewer2 for exhaustive flaw scanning, as it catches critical issues that human reviewers may miss.
- Use DeepReview for constructive feedback drafting, as it provides actionable and solution-oriented suggestions.
- Use SEA for novelty-grounding checks, as it verifies claims against prior literature effectively.
- Use human reviewers for final judgment, as they provide the most balanced and cognitively aligned assessment.
These systems are most effective as specialist assistants within a human-led pipeline rather than autonomous reviewers.
Related Systems
PRISM belongs to public benchmarks and reviewer-assistance projects that emphasize inspectable evaluation. Related systems include Reviewer2, SEA, DeepReview, TreeReview, and CycleReviewer.
BibTeX
@article{prism2026, title={PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers}, author={Ngoc Phan, Toan Huynh, Tran Khanh Thanh, Duy A. Nguyen, Nguyen Pham Tuan Anh, Thanh Nguyen, Nitesh V. Chawla, Wray Buntine, Kok-Seng Wong, Khoa D Doan, Binh Nguyen}, journal={arXiv preprint}, year={2026}}
