🔬InquiTree Study: AI Science Agents Degrade Over Time
TL;DR
A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.
A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.
Key Points
Models scientific inquiry as 'Research Trees' of hypothesis, design, and belief updates
Built on a 30-paper pool, with an open 18-paper subset (IT-18) released
Agents show 'cognitive tunneling,' losing anomaly detection over long runs
Performance drops on post-cutoff papers, separating reasoning from recall
Why It Matters
If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.
Quick Facts
Frequently Asked Questions
Why does this matter?
If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.
What happened?
A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.
Comments
Be the first to comment
Enjoyed this article?
Get it daily. 7am. Free. Reads in 5 minutes.
Join 1,984 builders reading daily.