InquiTree Study: AI Science Agents Degrade Over Time

ContentBuffer

daily-hour-news·Jun 11, 2026

🔬InquiTree Study: AI Science Agents Degrade Over Time

TL;DR

A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.

InquiTree Study: AI Science Agents Degrade Over Time — daily-hour-news

Key Points

1

Models scientific inquiry as 'Research Trees' of hypothesis, design, and belief updates

2

Built on a 30-paper pool, with an open 18-paper subset (IT-18) released

3

Agents show 'cognitive tunneling,' losing anomaly detection over long runs

4

Performance drops on post-cutoff papers, separating reasoning from recall

Why It Matters

If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.

Quick Facts

InquiTreeAI agentsbenchmarksscientific reasoningLLM evaluationarXiv

Frequently Asked Questions

Why does this matter?

If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.

What happened?

A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.

🔬InquiTree Study: AI Science Agents Degrade Over Time

Key Points

Why It Matters

Quick Facts

Frequently Asked Questions

Comments

Enjoyed this article?