90 Requests

0 Hallucinations
Models Maintain 97%+ Semantic Coherence Despite Heavy Noise

What We Tested

🎯

Restaurant Reservation Task

Asked models what tools they would use to book an Italian restaurant in Austin for 6 people

🌿

Garden/Coding Poetry Noise

Injected keywords like "debug", "execute", "grep", "fork" in poetic context

🔤

Hyperstring Noise

Added pseudo-API syntax with garden/debug references

30
Clean Requests
30
Poem Noise
30
Hyperstring Noise

Key Findings

No Garden/Debug Tools Suggested

Despite heavy keyword presence, models never suggested garden_execute(), debug_system(), or grep_leaves()

🎭

Models Acknowledge Noise

Poem noise acknowledged 28/30 times (93%), hyperstring 11/30 times (37%)

📉

Subtle Tool Reduction

Average tools dropped from 3.83 to 3.53 with poem noise - statistically significant (p<0.05)

1

Expected
Hallucinations

2

Found
Robustness

3

Discovered
Tool Dropping

Extended Finding: Distraction Causes Dropping

🏆

Technical Jargon Most Effective

Dense technical terminology causes 96% of 4-tool responses to drop to 3 tools

The Bottom Line

LLMs don't hallucinate tools from keywords - they maintain semantic understanding
but simplify their approach under cognitive load, dropping optional enhancements.