In a bank, you can't ship an AI workflow the way a startup ships a feature. The moment a system informs a decision, your model-risk-management (MRM) function treats it as a model — and the supervisory expectations in SR 11-7 apply. The good news: those expectations are a clear blueprint for doing it right.
SR 11-7 defines a model broadly: a quantitative method that processes inputs into estimates used for business decisions. A Claude-powered system that drafts a credit memo, dispositions an AML alert, or summarizes a filing for a decision falls squarely inside that definition. Treating it as anything less is the fastest way to get a launch blocked — or worse, to fail an exam after it's live.
That classification has immediate consequences: the system belongs in the model inventory, it gets a risk tier based on materiality and complexity, and the rigor of everything that follows — validation depth, monitoring frequency, documentation — scales with that tier. A drafting assistant whose every output is reviewed by an analyst will tier lower than an alert-triage system feeding decisions at volume. Getting the tier right early sets expectations for everyone downstream.
LLMs are newer than the statistical models SR 11-7 was written for, but the questions are the same. Expect to provide:
Validation under SR 11-7 rests on three elements, and each has a concrete LLM equivalent. Conceptual soundness becomes the documented case for the architecture: why retrieval-grounded generation fits this task, how prompts and tools are designed, what the known failure modes are, and why the controls address them. Outcomes analysis becomes benchmark evaluation: a golden set of real cases with agreed correct outcomes, error analysis by category rather than a single accuracy number, and comparison against the incumbent process — human baseline included. Ongoing monitoring becomes production telemetry: input-mix shift, grounding and citation rates, override and escalation frequency, and scheduled human review of sampled outputs.
Validators can't inspect a frontier model's internals — and don't need to. Behavioral validation against your documented intended use is the standard the framework actually asks for: evidence the system performs as claimed on your task, under your controls, with its limitations stated and tested.
Using a third-party model doesn't outsource the model risk; SR 11-7 is explicit that vendor models get the same discipline. In practice that means pinning to specific model versions rather than floating on "latest," validating behavior at the version you deploy, and re-running your evaluation suite before adopting an upgrade. It also means the contractual layer is part of the risk file: data-use commitments (your data isn't used for training), availability terms, and change-notification expectations all belong in the documentation your validators review.
With an LLM system, most change doesn't come from retraining — it comes from edits to prompts, retrieval configuration, tools, and model versions. Treat each of those as a model change: proposed, tested against the regression benchmark, approved by someone other than the author, and recorded with its evaluation results. That one habit converts "we tweaked the prompt last Tuesday" — an examiner's red flag — into a clean, evidenced change history.
The lowest-risk, fastest-to-approve pattern is augmentation: Claude gathers context, drafts a recommendation with its rationale and citations, and a qualified person decides. That keeps a human firmly accountable, produces exactly the evidence MRM needs, and lets you expand the system's autonomy later — once it has a track record your second line of defense trusts.
Build it this way and "the model-risk team won't approve it" stops being the reason your AI never ships.