HIPAA AI Security
Prompt injection in healthcare AI: what you actually need to worry about
By Mat Steinlin, Head of Information Security
Last updated: April 2026
Prompt injection is one of the most discussed vulnerabilities in LLM security, and one of the most unevenly applied. Some healthcare AI teams dismiss it ("our app isn't a public chatbot"). Others engineer comprehensive defenses for use cases where the threat model barely applies. Both responses waste effort.
The problem is that general prompt injection content (and there's a lot of good general content) treats all LLM applications the same. Healthcare AI is not one thing. An internal tool that summarizes authenticated EHR records for clinicians has a fundamentally different prompt injection profile than a patient-facing intake chatbot. Building the right defenses requires knowing which use case you're actually in.
This chapter covers what prompt injection is, where it's a real threat in healthcare AI contexts, where it isn't, and what mitigations are actually worth implementing for each.
Direct and indirect injection: what the terms mean
OWASP's LLM Top 10 and PortSwigger's Web Security Academy cover the taxonomy in depth. The short version, with healthcare framing:
Direct prompt injection occurs when a user submits crafted input designed to override or manipulate the model's instructions. The user is the attacker. In a patient-facing chatbot, this might look like a message that attempts to get the model to reveal its system prompt, ignore safety instructions, or produce output it's configured not to produce.
Indirect prompt injection occurs when the attack comes from content that the model retrieves and includes in its context, not from the user directly. The model is manipulated by a document, record, or API response that contains embedded instructions. In healthcare AI, this is the more dangerous and underappreciated vector: when your pipeline retrieves clinical notes, patient messages, or external documents, that content becomes part of the model's context. If any of it contains injection payloads, the model may act on them.
The healthcare risk profile for each differs substantially, which is why the use-case analysis matters.
Use cases with meaningful prompt injection exposure
Patient-facing chat interfaces
Any LLM-powered interface where patients or members of the public can submit arbitrary text is exposed to direct prompt injection. The user controls what goes into the prompt. The attack surface is as wide as your user base.
The relevant questions for a patient-facing chat interface:
Can the model take any action beyond generating text? (If yes, see the agentic workflows section.)
Does the model have access to any PHI beyond what the user already knows? (If yes, a successful injection could expose other records.)
Are model outputs used in downstream clinical decisions without human review? (If yes, manipulated outputs carry downstream risk.)
For a read-only FAQ chatbot where outputs go directly to patients, the practical impact of a successful injection is bounded: the worst case is an inappropriate or misleading response, which is bad but not catastrophic. For a chatbot that can retrieve other patients' records or trigger actions, the stakes are higher.
Agentic workflows with tool access
When an LLM can take actions in the world (query records, send messages, update fields, trigger workflows), a successful prompt injection attack has real-world consequences. The model doesn't just produce bad text; it takes bad actions.
This is covered in depth in the agentic AI security chapter. The key point for this chapter: if your AI feature uses tool calling, function calling, or any capability that lets the model affect external systems, treat prompt injection as a first-tier threat. The mitigations in this chapter are necessary but not sufficient; that chapter covers the additional controls required when the model has tool access.
RAG pipelines over user-submitted or third-party content
Retrieval-Augmented Generation pipelines that incorporate content from outside your control are the highest indirect injection risk in healthcare AI. When the retrieval corpus includes patient-submitted intake forms, uploaded documents, external data sources, or any content that an adversary could write, that content can carry injection payloads.
The attack surface is indirect but real. The patient doesn't interact with the model; their submitted document does.
Use cases where exposure is low
Not every healthcare AI feature has meaningful prompt injection exposure. These categories have low to minimal risk:
Internal tools with authenticated, trusted users. If the only inputs to your model come from staff who have been authenticated, credentialed, and are subject to your access policies, the threat from direct injection is low. A clinician manipulating a summarization tool to produce different output can do so, but the impact is limited to their own workflow, and it requires intentional, unauthorized behavior from a credentialed employee rather than an external attacker.
Closed pipelines with system-controlled context. If your LLM pipeline draws entirely from internal, system-controlled data sources (a curated knowledge base, structured database records, internal documents) and no user or external input reaches the context, indirect injection requires a compromise of your internal data infrastructure first. This is a meaningful threat model reduction.
No-action pipelines with human review. If the model's only capability is generating text that a human reviews before any action is taken, a successful injection produces a problematic output that a human can catch. The blast radius is bounded by the review step. This doesn't mean you should skip input validation, but it means a successful injection is a content quality problem, not a data breach.
The corollary: if your use case falls into one of these categories, you don't need to implement the full defensive stack described in the next section. Focus your effort on validating outputs and monitoring for anomalies. Proportionality is the right principle here.
Indirect prompt injection in clinical document processing
This is the vector that gets the least attention in general prompt injection content and has specific relevance for healthcare AI.
The threat scenario
Consider a RAG pipeline that retrieves and summarizes clinical notes for a care team. A patient submits an intake form. The intake form is processed, stored, and later retrieved as part of the patient's record. A malicious actor (or an automated system generating intake forms) embeds a prompt injection payload in the form:
The pipeline retrieves the form as part of the patient's context. The model summarizes it for a clinician. Depending on the model, the system prompt construction, and the presence of mitigations, the model may follow the embedded instruction, adding a false clinical note to a summary that influences care decisions.
This scenario isn't hypothetical; it's a specific instantiation of a well-documented attack class (OWASP LLM02) applied to healthcare document workflows.
Why RAG pipelines are particularly exposed
RAG pipelines amplify indirect injection risk because they mix trusted and untrusted content in the same context window, often without clear demarcation. Your system prompt is trusted. Retrieved documents are not; they came from somewhere else, potentially from a source you don't fully control.
The model processes both as text. It can't inherently distinguish "this is a clinical note I should summarize" from "this is an instruction I should follow." Good system prompt construction and output validation help, but they are not a complete defense.
The injection detection function below scans retrieved context for patterns commonly associated with injection attempts before the content is included in a prompt:
Two implementation notes: first, pattern-based scanning is heuristic: it catches known injection families, not novel attacks. A determined attacker who has read your scanner's patterns can construct payloads that evade it. Use this as a detection and logging layer, not as a trust boundary. Second, flagging suspicious documents for review is better than silently dropping them; your logging system should capture what was flagged and why, so you can investigate patterns over time.
Practical mitigations
No single mitigation reliably prevents prompt injection. These defenses work in layers, with each one reducing a different category of risk. The list below starts with what's fastest to implement and ends with the most structural changes; the ones that require design decisions upfront but provide the most reliable protection.
Input validation
For patient-facing interfaces, validate and sanitize user input before it reaches the model. This is not about blocking every possible injection; no sanitization layer reliably does that. It's about reducing noise and flagging high-confidence attack attempts for review.
What this does and doesn't do: it catches high-confidence direct injection patterns and enforces a length limit. It doesn't catch novel injection techniques, encoded payloads, or sophisticated obfuscation, so it should be paired with output validation and monitoring.
Output validation
Validating model outputs before they flow into downstream systems is often more reliable than trying to prevent injection at the input layer. You know what a valid output looks like; you can check for it.
On provider-native structured output: OpenAI's structured outputs (response_format: {"type": "json_schema", "json_schema": {...}}) and Anthropic's tool use pattern for structured extraction constrain model output at the API level. A model in structured output mode cannot produce arbitrary text that overrides your schema. This is more reliable than post-hoc JSON validation of a text response. Use it when your use case involves structured outputs.
Privilege separation
The model should have the minimum access it needs and nothing more. If the summarization pipeline doesn't need to write to patient records, don't give it a database connection that can write. If the clinical triage assistant doesn't need to send messages, don't give it access to the messaging API.
This is the same principle as least-privilege access control, applied to AI systems. It doesn't prevent injection, but it limits what a successful injection can accomplish. A model that can only read can't be manipulated into writing, and one without messaging access can't be manipulated into sending phishing messages.
Privilege separation for AI workloads is covered in more depth in the agentic AI security chapter.
Prompt hardening and its limits
System prompt construction can reduce susceptibility to naive injection attempts. Patterns that help:
Explicit instruction about the model's role and what it should not do, stated early in the system prompt
Clear delimiters between system instructions, retrieved context, and user input
Explicit instruction to treat content from retrieved documents as data to analyze, not instructions to follow
Reminding the model of its role at the end of the system prompt, after retrieved context
The honest caveat: prompt hardening isn't a reliable defense against sophisticated injection — research has demonstrated that most prompt-based defenses can be overcome with sufficient creativity. Treat it as a layer that reduces susceptibility to common attacks, not as a trust boundary. The reliable defenses are structural: output validation, privilege separation, and limiting what the model can do.
Monitoring for anomalous outputs
If your logs capture model outputs (which they should), you can detect injection attempts after the fact. Patterns worth monitoring:
Outputs that don't match the expected format or schema
Outputs significantly longer than typical for the use case
Outputs containing text that appears to be system instructions or meta-commentary about the model's role
Sudden changes in output characteristics (tone, language, format) that deviate from baseline
Post-hoc detection doesn't prevent injection, but it helps you identify incidents, understand your exposure, and improve defenses over time.
Prompt injection and agentic workflows
When your LLM can take real-world actions (send messages, update records, query external APIs, trigger workflows), the blast radius of a successful injection expands from "bad output" to "bad action."
A clinician reviews a manipulated summary and might catch the problem. A model that automatically sends a message, updates a field, or makes an API call based on a manipulated context doesn't have that review step. The consequences are immediate and potentially irreversible.
This is a separate and more complex threat model. The agentic AI security chapter covers agentic AI security in full: what tool access blast radius means in practice, how to scope agent permissions, and what audit requirements apply when AI systems can act autonomously on patient data.
Documenting your prompt injection posture for security reviews
Customer security reviews and penetration tests increasingly ask about LLM security, and prompt injection is a common question. The right answer is a documented threat model, not a blanket assurance.
A defensible prompt injection posture for a security review includes:
Threat model documentation. Which of your AI features have user-submitted input reaching the model? Which features include retrieved content in the context? Which features give the model tool access? Map each feature to its injection exposure category and document the reasoning.
Mitigations in place. For features with meaningful exposure, document what mitigations you've implemented: input validation, output schema validation, privilege separation, prompt construction practices, and monitoring. Be specific about what each mitigation does and what it doesn't.
What you've tested. If you've run injection tests against your own application (either as part of a security review or internal testing), document the results. "We tested our intake form pipeline with common injection patterns; our content flagging caught X of Y patterns; the remaining Y are flagged for future improvement" is a better answer than "we've implemented industry standard protections."
Honest scope. Don't overclaim. If one of your features has low exposure because it's an internal tool with authenticated users and system-controlled context, say that and explain why. A well-reasoned "this feature has low exposure because..." is more credible than claiming comprehensive defenses for every feature.
FAQs
Does prompt injection affect models with built-in safety filtering?
Safety filtering and prompt injection resistance are different things. A model may refuse to produce harmful content while still following embedded instructions that change its behavior in non-harmful ways, for example following an instruction to add false information to a summary, or to respond differently to a specific user.
Some frontier models have improved resistance to common injection patterns. Anthropic, OpenAI, and others publish research on this. Don't rely on it as a primary defense — model behavior changes between versions, and your mitigations should be at the application level.
Is it worth doing red team testing for prompt injection?
For patient-facing AI features, yes. Have someone on your team (or an external security consultant) attempt common injection patterns against your application before it handles real patient data. This reveals gaps in your mitigations that you can fix before deployment. For internal tools with low exposure, the cost/benefit is less clear; use your threat model assessment to decide.
Can I prevent indirect injection by using a closed retrieval corpus?
If your RAG pipeline retrieves only from a corpus you fully control (internal documents written by your organization, structured EHR data from your own database) and no external or user-submitted content ever enters that corpus, your indirect injection exposure is limited to an attacker who can write to your retrieval corpus — which requires compromising your internal systems first. For most healthcare AI teams, that's a meaningful risk reduction. Document the assumption explicitly, because it breaks the moment user-submitted or third-party content enters the retrieval corpus.
Next steps
Prompt injection in a chat interface or RAG pipeline produces bad outputs. In an agentic workflow, it produces bad actions. If your team is building or evaluating agent-based features, the threat model from this chapter is necessary but not sufficient.
Agentic AI security in healthcare: the full threat model for LLMs that can take real-world actions in clinical systems
Audit logging for healthcare AI: monitoring for anomalous outputs is one of the mitigations in this chapter; that section covers how to build it