Skip to content
daily-hour-news·

🔬InquiTree Study: AI Science Agents Degrade Over Time

TL;DR

A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.

A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.

InquiTree Study: AI Science Agents Degrade Over Time — daily-hour-news

Key Points

1

Models scientific inquiry as 'Research Trees' of hypothesis, design, and belief updates

2

Built on a 30-paper pool, with an open 18-paper subset (IT-18) released

3

Agents show 'cognitive tunneling,' losing anomaly detection over long runs

4

Performance drops on post-cutoff papers, separating reasoning from recall

Why It Matters

If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.

Quick Facts

InquiTreeAI agentsbenchmarksscientific reasoningLLM evaluationarXiv

Frequently Asked Questions

Why does this matter?

If agent reasoning decays over long tasks and leans on memorized data, 'AI scientist' claims need far stronger evaluation than one-shot benchmarks.

What happened?

A new arXiv benchmark, InquiTree, tests LLM agents as iterative scientists and finds their judgment erodes over long inquiries. Agents also do worse on papers published after their training cutoff, hinting skill is partly memorization.

Comments

Subscribe to join the conversation...

Be the first to comment

Enjoyed this article?

Get it daily. 7am. Free. Reads in 5 minutes.

Join 1,984 builders reading daily.