HIPAA AI Security
Audit logging for healthcare AI: compliance baseline vs. security operations
By Mat Steinlin, Head of Information Security
Last updated: April 2026
One company we worked with had diligently logged every LLM interaction for months. When a customer security review asked for records covering a three-week period from the prior quarter, they couldn't produce them. Their logging infrastructure had a silent failure (a misconfigured log drain, unnoticed), and the logs from that window didn't exist. The fix was straightforward. The disclosure to their compliance team was not.
This is the thing about audit logging that most teams get wrong: they build logging and assume it's working. They treat it as a one-time infrastructure task, not an ongoing operational concern. And they implement what HIPAA requires without thinking about what makes logs useful when something actually goes wrong.
This chapter covers both dimensions. The compliance requirements are the floor; the security operations layer is what makes logging worth having.
For HIPAA retention requirements, retention periods, and what auditors check, see Audit log retention. For the basic compliance logging implementation, see HIPAA-Compliant AI. This chapter builds on those foundations.
What HIPAA requires vs. what security requires
45 CFR 164.312(b) requires "hardware, software, and/or procedural mechanisms that record and examine activity in information systems that contain or use electronic protected health information." For LLM interactions, that means logging activity involving PHI: the prompts, the responses, timestamps, user attribution, and which model was used.
That is the compliance minimum. HIPAA does not specify log format, anomaly detection, alerting thresholds, log analysis tooling, or what to do when logging fails. Those are your decisions.
Security logging goes further than compliance logging in two directions. First, it captures metadata beyond the HIPAA-required fields: operational signals that are not required for a compliance audit but are necessary for detecting and investigating incidents. Second, it treats the logging infrastructure itself as a system that needs monitoring, not just a passive recorder.
The practical difference: a team with compliance-only logging can satisfy an auditor asking "did this request occur?" They often can't answer "why did API costs spike 300% overnight?" or "which system was responsible for the anomalous request volume on March 15?" Security logging closes that gap.
What to log for LLM interactions
The compliance minimum
The HIPAA-required fields for LLM audit logging are covered in HIPAA-Compliant AI: prompt content, response, timestamp, user or system attribution, and model used. These fields establish an auditable record that PHI was handled.
Security metadata worth adding
The fields below are not required by HIPAA. They are required for making logs actionable when something goes wrong.
Request duration (duration_ms): Unusually long LLM requests are an anomaly signal. A request that normally takes 800ms taking 45 seconds can indicate prompt injection causing the model to generate an unexpectedly large response, a misconfigured retry policy, or upstream rate limiting that your application is silently retrying. Without duration data, you can't distinguish these cases.
Token counts (input_tokens, output_tokens): Token counts are both a cost signal and an abuse signal. A sudden spike in average input tokens suggests someone is sending unusually large context, which could be a bug, an adversarial prompt embedding a large payload, or a developer testing with production-scale data in a development environment. Token count anomalies are often the first signal of a problem, appearing before cost alerts trigger.
Response status (status, finish_reason): Track whether each request succeeded, was rate-limited, was refused by the model's safety filters, or errored. A sudden increase in safety filter refusals on a particular scope can indicate prompt injection probing: someone testing inputs systematically to find what the model will respond to.
Scope and key attribution (scope, key_id): Covered in depth in API key management. Without scope attribution in your logs, you cannot answer "which system made these requests" during an investigation. These fields are required for the logs to be useful for anything beyond basic compliance.
De-identification flag (deidentified): A boolean indicating whether PHI de-identification was applied before the request was sent. This becomes relevant during an incident investigation: if de-identification was active, the prompts in your logs contain tokens rather than raw PHI, which changes the scope of your breach notification analysis.
Source identifier (source_ip, service_name): The application or service that initiated the request, and where it came from. For internal tools and service-to-service calls, a service name is more useful than an IP. For patient-facing features where requests originate from end users, the application name plus user attribution is typically sufficient.
The full logging structure:
One note on the request_id return value: your general application logs should include this identifier (without any PHI) so that during an investigation you can correlate a specific event in your application timeline to the corresponding PHI-containing audit log entry. This keeps PHI out of general logs while maintaining traceability.
Log storage: encryption and access control
Logs contain PHI. This is where many teams create a compliance gap without realizing it: they build diligent logging, then store logs in plaintext S3, an unencrypted Elasticsearch cluster, or a shared log aggregator that doesn't have appropriate access controls. The audit trail exists, but the storage fails the Security Rule's encryption and access control requirements.
Two requirements apply:
Encryption at rest. All PHI must be encrypted at rest under 45 CFR 164.312(a)(2)(iv). For S3, this means server-side encryption. AWS S3 provides three options; for HIPAA workloads, SSE-KMS using a customer-managed key is the right choice: it encrypts data using a key you control in AWS KMS, and access to the KMS key can be audited independently.
Access controls. PHI audit logs shouldn't be accessible to everyone who has access to your general application infrastructure. S3 bucket policies, IAM roles, and KMS key policies should collectively restrict log access to the systems and individuals with a legitimate operational need.
Configuring a HIPAA-appropriate S3 log bucket in Python using boto3:
For log ingestion, a Python logging handler that writes directly to S3:
For AWS CloudWatch Logs, encryption with a customer-managed KMS key can be enabled on the log group directly. CloudWatch is better for operational access (fast query with CloudWatch Insights); S3 is better for long-term retention at scale.
Short-term vs. long-term retention
LLM audit logs serve two distinct purposes with different access patterns and retention requirements. Conflating them in a single storage tier creates either a cost problem (retaining six years of data in a fast-access store) or an operational problem (investigators cannot quickly query logs that are in cold storage).
Short-term operational logs (7–30 days): Fast access for debugging, configuration verification, and prompt optimization. Developers legitimately need to see recent requests to understand model behavior, verify de-identification is working, or diagnose unexpected outputs. CloudWatch Logs or a similar queryable store works well here. Access controls should still be strict (these logs contain PHI), but the access pattern is more frequent.
Long-term compliance logs (6+ years per HIPAA; verify your retention policy): Archival access for audits and breach investigations. The access pattern is infrequent and typically batch: "retrieve all requests from scope X between dates Y and Z." S3 with Glacier tiering is cost-effective. Fast query is less important than durability, encryption, and assured retention.
The routing pattern: your logging infrastructure writes to both tiers. Short-term logs are queryable immediately. Long-term logs are archived via a drain from the short-term store or written directly to S3.
For full retention requirements, including what HIPAA specifies vs. what individual state laws may require, see Audit log retention.
Using logs as an anomaly detection signal
Most teams build logging for audits and investigations: things that happen after a problem is discovered. The same logs can surface problems before they become incidents, if you define what anomalous looks like.
Five signals worth monitoring in LLM audit logs:
Volume spikes on a single scope. A production scope that normally handles 200 requests per hour suddenly handling 8,000 is either a runaway loop, an adversarial usage pattern, or a misconfigured client retrying aggressively. Any of these is worth alerting on. The threshold depends on your baseline; start by alerting when any scope exceeds 5x its rolling hourly average.
Unexpected models in production logs. If your production allowlist covers three specific models and your logs show a fourth, something bypassed your controls. This either means a code change that wasn't reviewed or an infrastructure misconfiguration. Either way, it should not appear silently in a compliance log two weeks later.
Sudden increase in safety refusals. A spike in finish_reason: content_filter responses on a patient-facing feature can indicate systematic prompt injection probing: someone sending crafted inputs to find what the model will respond to. One or two refusals per day is normal. A hundred in an hour is not.
Off-hours access by production scopes. Patient-facing features that have no legitimate overnight traffic showing LLM activity at 3am is worth investigation. This signal generates false positives for global teams and background jobs, so calibrate carefully, but for clearly bounded use cases it is a reliable anomaly indicator.
Token count outliers. Individual requests with input token counts far above your application's normal distribution can indicate unusually large prompts being injected or a developer inadvertently sending a full document as context.
A basic threshold alerting implementation using a sliding window:
These checks are intentionally simple. The goal is detecting obvious problems fast, not building a machine learning anomaly detection pipeline. Start with volume spikes and model violations; add refusal rate monitoring once you have a baseline for what normal looks like on your specific features.
The silent failure problem
The incident described at the start of this chapter (three weeks of missing logs discovered during a security review) isn't unusual. Logging infrastructure fails silently more often than it fails loudly: the application keeps running, requests keep succeeding, and the only indication something is wrong is the absence of records, which you only notice when you need them.
Silent failure modes in LLM logging:
A logging handler throws an exception on write, catches it silently, and drops the record
An S3 upload fails due to a permissions change after a rotation and the error is swallowed
A disk fills up on a log aggregator host and new records are silently dropped
A configuration deployment updates the log destination without updating the encryption key reference, causing writes to fail
A library update changes a field name in the response object the logging wrapper depends on, causing null values to be logged without an error
The OWASP Logging Cheat Sheet recommends treating logging failures as significant events, not background noise. For PHI audit logging, a logging failure is a compliance event.
How to verify your logging is working
The verification approach: write a synthetic test request on a schedule, then confirm the corresponding log entry appears in your storage backend. If it does not appear within a defined window, alert.
This canary approach catches the failure modes that silently drop records. It doesn't catch misconfigured encryption (the record writes but may not be encrypted correctly); verify encryption configuration separately when making infrastructure changes.
How to alert on logging failures
At minimum: alert on the canary check failure. For more comprehensive coverage:
Alert if the logging handler's error count exceeds zero in a rolling window (configure your log aggregator to track handler exceptions)
Alert if the volume of audit log entries falls below the expected minimum for production scopes during business hours
Alert on S3 write errors from your logging handler (surface these explicitly rather than swallowing them)
Log drain setup and long-term storage options
A log drain exports log data from your short-term store to long-term archival. This is a distinct operation from the initial log write. Your application writes to a fast-access store; the drain moves records to archival storage automatically.
S3 via Amazon Kinesis Data Firehose
For teams on AWS, Kinesis Data Firehose is the standard pattern for buffered S3 delivery with encryption:
Langfuse
Langfuse is an open-source LLM observability platform that provides a queryable interface over your audit log data, useful for prompt debugging, cost analysis, and short-term operational access. It is not a long-term compliance archive on its own; combine it with S3 for retention.
Langfuse uses OpenTelemetry for log ingestion, which means you can configure it as a drain destination without changing your logging implementation, as long as your logs emit in OTEL format. Beta support for Langfuse as a drain destination is available in Aptible AI Gateway.
SIEM integration
For larger teams with an existing SIEM (Splunk, Sumo Logic, Elastic SIEM), routing LLM audit logs into the SIEM provides unified visibility across your security data. The primary consideration is cost: LLM logs, especially prompt and response content, have high per-record size. Filter carefully to avoid ingesting more than you need for security monitoring; route full logs to S3 and send a reduced field set to the SIEM.
FAQs
Do I need to log every LLM request or only the ones that involve PHI?
Audit logs serve two compliance functions. The first is the audit control standard under 45 CFR 164.312(b), which requires recording and examining activity in systems that contain or use PHI. The second is the Workforce Access Management requirement under 45 CFR 164.312(a): logs are how you demonstrate that only authorized personnel accessed PHI. If a request does not involve PHI (a purely synthetic or test request with no patient data), logging it is not strictly required under either standard.
In practice, filtering reliably by PHI presence at log time is difficult. A model summarizing patient notes might receive a prompt that includes a session ID but no direct PHI; the response, however, contains PHI-derived content. Log everything that flows through your PHI-handling systems. The storage cost is low relative to the investigation and compliance risk of a logging gap.
How long do we need to keep LLM audit logs?
HIPAA requires documentation to be retained for six years from creation or last effective date. For audit logs, that means six years from when the log was created. Some states impose longer retention requirements; verify applicable state law for your patient population. For the full retention framework, see Audit log retention.
Can we store LLM audit logs in the same place as our application logs?
Not if your application logs aren't encrypted at rest and appropriately access-controlled. LLM audit logs contain PHI. They need encryption and access controls appropriate for PHI. If your general application log store already meets those standards, co-location is technically acceptable, but the access controls must treat the PHI audit logs as a restricted subset; not everyone with access to application logs should have access to logs containing patient data.
What if a developer needs to see logs for debugging?
Give developers access to short-term operational logs for scopes they own, with PHI fields visible only to authorized roles. In practice: developers can see request metadata (scope, model, duration, status, token counts) and a sanitized version of the prompt, with full PHI access gated behind an explicit access request that is itself logged. Limiting PHI access to what is necessary for a legitimate purpose is not just a best practice: it’s required under the HIPAA Privacy Rule's Minimum Necessary standard (45 CFR 164.502(b)). Gating full log PHI access behind an explicit, logged request is a direct implementation of that requirement.
The de-identification flag in the logging schema above is relevant here. If de-identification was applied, the prompt in the log contains tokens rather than PHI; a developer can debug the model interaction without seeing raw patient data. This is one operational argument for implementing de-identification even when a BAA is in place.
What happens if our logging fails during a request?
Decide in advance whether to allow or block requests when logging is unavailable, and document that decision. The right choice depends on your use case: for patient-facing features where clinical workflows depend on a timely response, blocking requests because logging is unavailable may cause more harm than completing the request with an alerting gap. For internal tooling with no immediate clinical dependency, blocking requests when logging is unavailable is the more conservative choice.
More consequential than the allow-or-block decision is what a logging gap means after the fact. A gap discovered during a security review or audit may trigger breach notification obligations under the HIPAA Breach Notification Rule (45 CFR Part 164 Subpart D). The Rule requires notification when unauthorized PHI access cannot be ruled out — and a missing log record cannot establish that PHI was not improperly accessed during that window. The four-factor risk assessment still applies, but without logs, the analysis defaults toward notification.
Whatever your allow-or-block policy, make it explicit and document it. An undocumented policy that defaults to allowing requests is not a defensible audit position. "We had a documented policy to allow requests when logging was unavailable for availability reasons, with active alerting on logging failures to detect and close gaps promptly" is.
Next steps
Logging provides the audit trail. De-identification reduces what's in that trail, limiting PHI exposure at the provider and reducing blast radius if your log storage is ever compromised. The two controls work together.
PHI de-identification as a security control: how to reduce what PHI reaches the model and ends up in your logs
API key management and scope isolation: the scope attribution that makes logs useful for investigation
Shadow AI in healthcare: what happens outside your sanctioned logging infrastructure