Model Risk and AI: Deploying Claude under SR 11-7

In a bank, you can't ship an AI workflow the way a startup ships a feature. The moment a system informs a decision, your model-risk-management (MRM) function treats it as a model — and the supervisory expectations in SR 11-7 apply. The good news: those expectations are a clear blueprint for doing it right.

Yes, your LLM is a "model"

SR 11-7 defines a model broadly: a quantitative method that processes inputs into estimates used for business decisions. A Claude-powered system that drafts a credit memo, dispositions an AML alert, or summarizes a filing for a decision falls squarely inside that definition. Treating it as anything less is the fastest way to get a launch blocked — or worse, to fail an exam after it's live.

That classification has immediate consequences: the system belongs in the model inventory, it gets a risk tier based on materiality and complexity, and the rigor of everything that follows — validation depth, monitoring frequency, documentation — scales with that tier. A drafting assistant whose every output is reviewed by an analyst will tier lower than an alert-triage system feeding decisions at volume. Getting the tier right early sets expectations for everyone downstream.

What model-risk teams will ask

LLMs are newer than the statistical models SR 11-7 was written for, but the questions are the same. Expect to provide:

Intended use and limitations. A documented statement of what the system is for, where it must not be used, and its known failure modes.
Development standards. How prompts, retrieval, and tools were built and tested, with data lineage you can trace.
Independent validation. Evaluation against a representative benchmark by someone other than the builder, with results you can reproduce.
Ongoing monitoring. Live quality and drift checks once it's in production, not a one-time sign-off.
Human oversight and change control. Clear accountability for outputs and a controlled process for changing prompts or models.

Translating SR 11-7's three validation elements

Validation under SR 11-7 rests on three elements, and each has a concrete LLM equivalent. Conceptual soundness becomes the documented case for the architecture: why retrieval-grounded generation fits this task, how prompts and tools are designed, what the known failure modes are, and why the controls address them. Outcomes analysis becomes benchmark evaluation: a golden set of real cases with agreed correct outcomes, error analysis by category rather than a single accuracy number, and comparison against the incumbent process — human baseline included. Ongoing monitoring becomes production telemetry: input-mix shift, grounding and citation rates, override and escalation frequency, and scheduled human review of sampled outputs.

Validators can't inspect a frontier model's internals — and don't need to. Behavioral validation against your documented intended use is the standard the framework actually asks for: evidence the system performs as claimed on your task, under your controls, with its limitations stated and tested.

Vendor models are still your models

Using a third-party model doesn't outsource the model risk; SR 11-7 is explicit that vendor models get the same discipline. In practice that means pinning to specific model versions rather than floating on "latest," validating behavior at the version you deploy, and re-running your evaluation suite before adopting an upgrade. It also means the contractual layer is part of the risk file: data-use commitments (your data isn't used for training), availability terms, and change-notification expectations all belong in the documentation your validators review.

Change control for a model you don't retrain

With an LLM system, most change doesn't come from retraining — it comes from edits to prompts, retrieval configuration, tools, and model versions. Treat each of those as a model change: proposed, tested against the regression benchmark, approved by someone other than the author, and recorded with its evaluation results. That one habit converts "we tweaked the prompt last Tuesday" — an examiner's red flag — into a clean, evidenced change history.

What the exam file should contain

Inventory entry and tier. The system, its owner, its materiality rationale.
Intended-use statement. Scope, exclusions, and documented limitations.
Validation report. Conceptual soundness, benchmark results, reproducible method.
Monitoring evidence. Dashboards, thresholds, and the review cadence actually happening.
Change log. Every prompt, retrieval, and version change with pre-deployment test results.
Oversight records. Who approved what, and the sampling reviews that back delegated autonomy.

Start with assist, not autonomy

The lowest-risk, fastest-to-approve pattern is augmentation: Claude gathers context, drafts a recommendation with its rationale and citations, and a qualified person decides. That keeps a human firmly accountable, produces exactly the evidence MRM needs, and lets you expand the system's autonomy later — once it has a track record your second line of defense trusts.

Build it this way and "the model-risk team won't approve it" stops being the reason your AI never ships.

Take the readiness assessment Claude for financial services