Ask HN: What are you using to mitigate prompt injection?

If anything at all.

6 points | by ramoz 1 day ago

3 comments

  • raw_anon_1111 17 hours ago
    Absolutely nothing.

    Most of my use cases for using LLMs in production is call centers

    https://news.ycombinator.com/item?id=47241412

    Where basically it’s: accepting user input -> LLM to figure out which tool to call (user’s intent) -> JSON -> call API with strict security boundaries just like with a web app.

    What’s the worse that could happen that couldn’t happen with a web app if I have bad security around the underlying API?

    I’m sure that if you did successfully break my systems they could get it to say inappropriate things back to them for the lulz, but who cares?

  • lyfeninja 10 hours ago
    I politely ask the LLM to not share sensitive data, pretty please with a cherry on top.

    For real though, @oliver_dr 's approach aligns with most best practices I've seen. I have attempted using a separate "agent" to help with the user input validation and identify leakage in the output. It seems to work in testing but hard to truly know how well it works for real-world uses since you never know what you're gonna get.

  • oliver_dr 23 hours ago
    We've been dealing with this at multiple layers. Here's what actually works in production:

    Input-side (preventing injection):

    - Strict input sanitization with role-boundary enforcement in the system prompt. Sounds basic, but most people skip it.

    - Separate "user content" from "system instructions" at the API level. Don't concatenate untrusted input into your system prompt. Use the dedicated `user` role in the messages array.

    - For tool-calling agents, validate that tool arguments match expected schemas before execution. An LLM-as-judge approach for tool call safety is expensive but effective for high-stakes actions.

    Output-side (catching when injection succeeds):

    This is the part most people underinvest in. Even with perfect input filtering, you still need output guardrails:

    - Run the LLM output through evaluation metrics that score for factual correctness, instruction adherence, and safety before it reaches the user.

    - For RAG systems specifically, verify that the generated answer is actually grounded in the retrieved context, not fabricated or influenced by injected instructions.

    The "defense in depth" framing matters here. Input filtering alone has a ceiling because adversarial prompts evolve faster than regex rules. Output evaluation catches the failures that slip through. We use DeepRails' Defend API for this layer - it scores outputs on correctness, completeness, and safety, then auto-remediates failures before they reach end users. But the principle applies regardless of tooling: treat output verification as a first-class concern, not an afterthought.

    Simon Willison's work on dual-LLM patterns is also worth reading if you haven't: https://simonwillison.net/2023/Apr/25/dual-llm-pattern/

    • daemonologist 7 hours ago

          Don't post generated comments or AI-edited comments. HN is for conversation between humans.