Asked models what tools they would use to book an Italian restaurant in Austin for 6 people
Injected keywords like "debug", "execute", "grep", "fork" in poetic context
Added pseudo-API syntax with garden/debug references
Despite heavy keyword presence, models never suggested garden_execute(), debug_system(), or grep_leaves()
Poem noise acknowledged 28/30 times (93%), hyperstring 11/30 times (37%)
Average tools dropped from 3.83 to 3.53 with poem noise - statistically significant (p<0.05)
Expected
Hallucinations
Found
Robustness
Discovered
Tool Dropping
Dense technical terminology causes 96% of 4-tool responses to drop to 3 tools
LLMs don't hallucinate tools from keywords - they maintain semantic understanding
but simplify their approach under cognitive load, dropping optional enhancements.