AI Vendor Compliance Checklist: SLAs, Audit Logging & Security

AI Vendor Compliance Checklist: SLAs, Audit Logging & Security Evaluating an AI vendor with a standard SaaS security questionnaire is like auditing a nuclear plant with a fire extinguisher inspection form. The instruments are wrong for the hazard.

AI systems make autonomous decisions, interact with sensitive data at inference time, and introduce attack surfaces — prompt injection, data leakage, retrieval poisoning — that traditional vendor reviews were never designed to catch. According to the IBM Cost of a Data Breach Report 2025, 13% of organizations experienced attacks that specifically targeted AI models or applications, and the average breach cost remains $4.44M. Those numbers shift the calculus on vendor due diligence considerably.

This checklist covers three specific domains — SLAs, audit logging, and security controls — with pass/fail criteria that security teams, procurement leads, and compliance officers can apply before signing any AI vendor contract.

Key Takeaways

AI vendor SLAs must cover inference pipeline performance, agent task latency, and model deprecation notice — not just API uptime
Audit logs must be decision-level and replayable, with field parity across inference and tool/agent invocations
Certifications are a baseline — require runtime threat detection for AI-specific attack vectors with vendor-disclosed catch rates
Any vendor accessing PHI requires a BAA that explicitly addresses AI-specific risks, not generic HIPAA template language
Vague demo-only disclosures, force majeure clauses covering model provider outages, and scope-limited certifications are automatic red flags

Before You Begin: Prerequisites for a Meaningful Vendor Review

Don't schedule a demo until you have documents in hand. Vendors who resist pre-contract disclosure aren't protecting proprietary information — they're revealing their control posture.

Request these before any evaluation meeting:

SOC 2 Type II report with explicit scope clarification (which deployment modes are covered?)
Sample audit log entries for both LLM inference events and tool/agent invocations
SLA documentation including penalty structures and credit caps
Data residency disclosures, including failover behavior
Completed security questionnaire addressing AI-specific risks

Know your internal context before you start:

Checklist requirements escalate significantly depending on your environment:

Which regulations apply — HIPAA, GDPR, EU AI Act, SEC, NIST AI RMF each carry different logging and control obligations
What data the AI vendor touches — PHI, PII, financial transaction data, and proprietary IP each raise the stakes
How the system operates — an autonomous trading agent with API execution rights demands far deeper scrutiny than a document summarization tool

A procurement team evaluating simple summarization needs a different checklist depth than one evaluating an agentic system with real-world tool calls. The sections below are calibrated for both.

SLA Requirements: What to Demand in Writing

Most AI vendor SLAs are standard SaaS templates repurposed with a "model" find-and-replace — measuring API endpoint availability and nothing else. A vendor can report 99.9% uptime while the underlying inference pipeline is degraded, returning slow, inconsistent, or incorrect outputs that still count as "available" under a generic uptime SLA.

Before negotiating specifics, anchor to the math: 99.9% uptime means approximately 8.7 hours of permitted annual downtime; 99.95% cuts that to roughly 4.4 hours. Set your threshold based on deployment criticality, not the vendor's default offering.

Uptime, Latency, and Agentic Performance

AI-specific SLA targets must cover:

Uptime commitments for the full inference pipeline, separate from the customer-facing portal
p95 inference latency per request type (not average latency, which masks outliers)
Agent task completion time for multi-step agentic workflows
Tool call response time for autonomous agent actions
API rate limit definitions tied explicitly to the uptime SLA — rate limiting during peak load is a service reduction and must not hide in a separate terms section

On model versioning: LLM providers update models without notice in ways that silently break downstream workflows. Require a minimum 90-day deprecation notice for production models, and negotiate for a version lock option where available.

OpenAI commits to at least 6 months notice for generally available models, while Anthropic provides at least 60 days for publicly released models. Google blocks new access one month before retirement. Use these as your minimum benchmarks — and require that any shorter emergency notice periods are contractually defined and limited to genuine security or compliance events.

Model deprecation notice comparison across OpenAI Anthropic and Google AI platforms

Penalty Structures and Termination Rights

Metric commitments only matter if non-compliance carries real consequences. A credible penalty structure requires tiered credits, an escalation mechanism, and a clear exit path.

What to require:

Tiered service credits tied to specific metric shortfalls (example: 5% credit per 0.1% uptime shortfall below threshold, 10% if response-time SLA is missed by more than 50%)
Escalating penalties for repeated failures within a rolling 90-day window
Termination-for-cause rights after repeated SLA breaches — without this, credits are the vendor's only consequence

Immediate red flags in SLA language:

Credit caps set at 10–20% of monthly fees with no termination rights
"Best effort" language with no financial consequence attached
Force majeure clauses that cover the vendor's own upstream model provider outages (this is non-negotiable — if their model provider goes down, that's their supply chain problem, not an act of God)
Automated ticket confirmations accepted as satisfying the SLA "response" requirement

Audit Logging Requirements: What Evidence-Grade Looks Like

Traditional application logs tell you that an API call happened. AI audit logs need to tell you what the model decided, why, what action it triggered, and whether that decision could be reconstructed six months later during a regulatory review.

That's the difference between event-level logging and decision-level logging. In regulated environments and agentic deployments, event-level logs won't survive audit scrutiny.

What AI Audit Logs Must Capture

Required fields for a compliant LLM inference log entry:

Field	Purpose
Timestamp	Regulatory traceability
Request ID / Trace ID	Incident reconstruction
User/agent identity	Attribution for agentic actions
Application and environment	Scope isolation
Model provider and version	Reproducibility
Input/output token counts	Cost and anomaly detection
Guardrail rules evaluated	Policy enforcement evidence
Policy decision (allow/restrict/deny)	Compliance artifact
Latency	Performance SLA verification

Field parity between LLM inference events and tool/agent invocations is required. If a vendor can answer "who called this model" but not "who invoked this tool call," the audit trail has a compliance gap that will surface during regulatory review or incident forensics.

PromptHalo's audit logs, for example, capture every decision alongside its reason, the acting agent or passport identity, session and tenant context, and timestamp. That coverage extends across inference, tool calls, and agent-to-agent handoffs within the same append-only log.

Replayability requirements:

Logs must be queryable and structured — not just stored
Exportable in SIEM-compatible formats (JSON, CEF, Syslog)
Sufficient to reconstruct the full decision sequence of an agentic workflow for post-incident response or regulatory reporting

AI audit log compliance requirements replayability and SIEM export specifications checklist

Framework Mapping and Tamper-Evidence

Audit logs that exist but can't be referenced to a recognized framework are compliance documentation that won't hold up under audit.

Require framework mapping to at least one of:

OWASP LLM Top 10 (2025) — specifically LLM01 Prompt Injection, LLM02 Sensitive Information Disclosure, LLM03 Supply Chain Vulnerabilities, LLM05 Improper Output Handling
NIST AI RMF — particularly MEASURE 2.4 (production monitoring) and MEASURE 2.8 (transparency and accountability documentation)
EU AI Act Article 12 — which mandates automatic event recording for high-risk AI systems, with a minimum 6-month retention period

Framework mapping establishes what your logs prove. Tamper-evidence determines whether regulators will trust the logs themselves.

Tamper-evidence requirements:

Logs must be immutable once written — append-only architecture with integrity verification
Cryptographic hashing or equivalent confirmation that no post-hoc modification occurred
Customer-configurable retention periods per log type (auth events, inference logs, tool invocations, admin actions)
Clear documentation of what happens to logs at retention expiry

Security Controls: Runtime and Certification Checklist

Security Certifications and Deployment Architecture

Baseline certifications to require:

SOC 2 Type II — verify scope explicitly. A certificate covering the vendor's marketing portal doesn't cover your agentic deployment
ISO 27001 — useful for information security management, but not AI-specific
ISO/IEC 42001 — the AI management system standard; increasingly relevant for AI-specific governance
HITRUST AI Security Certification — for healthcare deployments; available as a standalone or paired assessment
CSA AI Controls Matrix coverage — vendor-neutral, aligned with NIST AI RMF and ISO 42001

Note what certifications don't cover: SOC 2 and ISO 27001 have no native criteria for prompt injection, jailbreaks, or retrieval poisoning. Treat them as a starting point, not a comprehensive signal.

Deployment architecture transparency:

For regulated environments — fintech, healthcare, government — shared multi-tenant SaaS with vendor-managed log aggregation is typically insufficient. Require clarity on:

Whether deployment is shared SaaS, VPC-isolated, or split-plane (data plane in customer environment, metadata only crossing to vendor control plane)
Whether prompt content, model responses, and audit logs leave your cloud account
How the architecture holds during failover or maintenance windows

Runtime Threat Detection and AI-Specific Attack Coverage

Traditional certifications don't address the AI attack surface. This is where many vendor evaluations fall short.

AI-native threats requiring runtime coverage:

Prompt injection — adversarial inputs that alter model behavior or bypass instructions
Jailbreaks — attempts to push models outside their intended behavior constraints
Data leakage — sensitive information surfaced through model outputs
Retrieval poisoning — poisoned content in RAG pipelines carrying hidden instructions
Out-of-scope tool and API calls — autonomous agents invoking capabilities beyond their intended authority

Five AI-native attack vectors requiring runtime security coverage prompt injection to tool abuse

Critical enforcement question: Does enforcement happen inline before the model executes (pre-call), or only after (post-call logging)? With post-call detection, the attack has already executed by the time your system responds.

Pre-call enforcement — deciding allow, restrict, challenge, deny, or monitor before the model runs — is the architectural distinction that separates runtime security from after-the-fact logging. PromptHalo operates inline on every inference, tool call, and agent-to-agent handoff, enforcing each decision in under 100ms before execution.

When evaluating vendors, ask for specifics on both methodology and measured performance:

Detection methodology (ML-based vs. rule-based vs. hybrid)
Published catch rate and false positive rate benchmarks

PromptHalo's ML-based detection operates at above 95% catch rate with under 5% false positives, compared to roughly 35% catch rates and 15–20% false positives typical of rule-based approaches. In production environments handling sensitive transactions or regulated data, that difference translates directly to breach exposure.

Data Handling, PHI Access, and Subcontractor Disclosure

Encryption and key management:

AES-256 at rest; TLS 1.2+ in transit (minimum)
Clarify whether encryption keys are customer-controlled or vendor-managed — the distinction is critical for regulated data
Confirm data residency: where data is physically stored, under which jurisdiction, and whether geographic restrictions hold during failover

PHI and BAA requirements:

Under HIPAA, any vendor that creates, receives, maintains, or transmits ePHI — including encrypted ePHI they cannot read — is a business associate. A BAA is mandatory before sharing PHI.

Standard BAA templates don't address AI-specific risks. Push for explicit contractual language covering:

Whether the vendor can use PHI to train global models
Whether AI hallucinations that expose PHI qualify as reportable incidents
Breach notification timelines — HIPAA's 60-day outer limit is a ceiling, not a target; push for 24–72 hours contractually

Subcontractor disclosure:

AI vendors rely on cloud hosting providers, model providers, and data processing partners. Before signing, ask for a complete subprocessor list with data access descriptions for each. Fourth-party relationships carry the same compliance obligations: a gap in your vendor's supply chain is a gap in your compliance posture.

How to Score Vendor Responses and Spot Red Flags

Pass/Fail Scoring Framework

Score vendors on two conditions before the first demo:

Passes: Written, evidence-backed responses to all three domains (SLAs, audit logging, security) provided proactively
Fails: First meaningful disclosure arrives in a slide deck — any demo-only capability is unverified until it appears in writing

Red Flag Contract Clauses — AI-Specific

Force majeure covering the vendor's own upstream model provider outages
Automated ticket confirmation counted as SLA "response"
Model deprecation clauses with no minimum notice period
BAA templates that don't address AI hallucinations or training data usage
Service credit caps below 20% of monthly fees with no termination rights
Certifications whose scope covers a deployment mode different from the one you're buying

According to Stanford Law School's 2025 analysis, 92% of AI vendors claim broad data usage rights, while only 17% commit to full regulatory compliance. That gap is where your exposure lives — and where contract review pays for itself.

AI vendor compliance gap showing 92 percent data usage claims versus 17 percent regulatory commitment

Green Flag Responses

Published SLA targets with financial penalties accepted in writing, pre-contract
Independently verifiable audit logs with documented export capabilities
Runtime enforcement with disclosed catch rates and false positive benchmarks
A BAA that explicitly addresses AI-specific PHI risks
A subcontractor list that covers fourth-party data access without requiring a separate discovery request

Frequently Asked Questions

What is an SLA in AI?

In AI, a Service Level Agreement covers inference pipeline availability, model performance, agent task latency, and model version stability — not just API uptime. The financial penalties and termination rights tied to those commitments are what make an SLA enforceable rather than aspirational.

What role do audit logs play in AI governance?

AI audit logs create the evidence trail regulators, security teams, and incident responders need to verify what a model decided, why, and what action it triggered. Decision-level logs mapped to OWASP LLM Top 10 and NIST AI RMF convert logging from a monitoring tool into a compliance artifact that holds up under regulatory scrutiny.

What agreement is required for AI vendors who access PHI?

A Business Associate Agreement (BAA) is mandatory under HIPAA before sharing PHI with any AI vendor. Standard templates must be extended to cover AI-specific risks: training data usage, hallucination-triggered disclosures, and breach notification timelines shorter than HIPAA's 60-day maximum.

What security certifications should an AI vendor hold?

SOC 2 Type II and ISO 27001 are the baseline — confirm the certification scope covers your actual deployment mode, not just the vendor's general infrastructure. Also evaluate coverage against OWASP LLM Top 10 attack categories and consider ISO/IEC 42001 for AI-specific risk that traditional certifications don't address.

How do you evaluate AI vendor SLA penalties?

A credible penalty structure includes tiered service credits tied to specific metric shortfalls, escalating credits for repeated failures within a rolling window, and termination-for-cause rights. An SLA with no financial consequences is a marketing document — verify that penalties are binding before signing.

What are the biggest red flags in an AI vendor compliance review?

The top three: certifications scoped to a deployment mode you're not buying, force majeure clauses that cover the vendor's own model provider outages, and audit logs stored in vendor-managed infrastructure with no customer export rights. Any one of these should pause the evaluation until resolved in writing.