NIST AI RMF: Govern, Map, Measure, Manage in Practice
NIST’s AI Risk Management Framework is easier to apply when you treat it as four operational questions: who owns the model, what can go wrong, how do you prove it’s behaving, and how do you respond when it doesn’t? For a deployed LLM, "Measure" means more than accuracy—it means tracking jailbreak success rates, hallucination frequency, policy violations, latency, drift, and abuse signals against real production traffic.
NIST AI RMF: Govern, Map, Measure, Manage in Practice
CVE-2024-3094 sat in XZ Utils for weeks before Andres Freund noticed SSH on his Debian box was suddenly taking an extra 500 milliseconds. That’s the kind of boring, operational signal security people actually catch; it’s also the kind of signal most AI teams ignore when they say they are “monitoring” an LLM in production. NIST’s AI Risk Management Framework is useful precisely because it forces you to stop hand-waving and answer four questions that map cleanly to deployed models: who owns this thing, what can it break, how do you prove it is behaving, and what do you do when it isn’t?
Govern: Put a human name on the model, not a steering committee
If no one can tell you who can approve a prompt template change at 2 a.m., you do not have governance; you have a slide deck. In practice, “Govern” should assign a named model owner, a security reviewer, and an incident lead with authority to disable the model, rotate keys, or revert a system prompt without waiting for a quarterly committee meeting. That matters because the failure mode for LLMs is usually not a dramatic exploit; it is slow, procedural drift where product, legal, and ML each assume someone else is watching the blast radius.
The useful part of governance is not policy prose. It is making sure the model registry includes the exact base model, fine-tune version, system prompt hash, retrieval sources, and tool permissions. If you cannot reconstruct which version of GPT-4o, Claude, or Llama 3 was exposed to which customers on which day, your postmortem will be fiction. And yes, that includes vendor-managed models; “we use Azure OpenAI” is not an ownership model.
Map: Trace the model’s attack surface before it ships to Slack and Jira
“Map” is where most teams get lazy and write down “users interact with the chatbot.” That is not mapping. A deployed LLM usually has at least four attack surfaces: the prompt channel, the retrieval layer, the tool layer, and the output channel. Each one has different abuse paths. Prompt injection is obvious in chat, but the nastier cases show up when the model can call Jira, ServiceNow, GitHub, or a payment API and turn a poisoned document into an action.
This is where people should stop pretending all risk is jailbreaks. A model that summarizes internal tickets can leak secrets through retrieval. A model that drafts code can propagate insecure defaults into a repo. A model that answers customer support questions can be manipulated into policy violations that look like “helpfulness” until Legal reads the transcript. If you are using OpenAI, Anthropic, or an on-prem Llama stack behind LangChain, map the permissions exactly the way you would map an AWS role: read, write, execute, and exfiltrate. Anything less is cosplay.
A contrarian point: the model itself is often not the highest-risk component. The glue code is. I have seen more damage from sloppy tool routing, weak allowlists, and overbroad retrieval than from any exotic jailbreak. Security teams love to test prompt injection because it feels novel; the real breakage is usually a forgotten connector with access to far too much.
Measure: Accuracy is the least interesting metric in production
If your AI dashboard only shows “accuracy,” you are measuring the wrong thing. For a deployed LLM, “Measure” should include jailbreak success rate, hallucination frequency on known-answer prompts, policy violation rate, refusal rate, tool-call error rate, latency p95, retrieval hit quality, and drift in both inputs and outputs. You also want abuse signals: repeated prompt variants, token spikes, long-context stuffing, and attempts to elicit system prompts or hidden policies.
Use real traffic, not a lab benchmark that looks pretty in a deck. A model that scores well on a static eval can still fall apart when users paste in malformed JSON, multilingual prompts, or 40-page PDFs full of irrelevant junk. The point is to measure behavior under the same mess your production system sees at 4 p.m. on a Friday. If you are not sampling live prompts and outputs, tagging them by risk class, and comparing them against a baseline, you are not measuring—you are guessing.
This is also where vendor claims go to die. “Guardrails” are not a metric. “Safe by design” is not a metric. If a prompt-injection suite like Garak, Microsoft PyRIT, or custom red-team scripts can get the model to reveal hidden instructions 12% of the time, that is a number. If your moderation layer blocks 98% of obvious abuse but misses indirect prompt injection through retrieved documents, that is a number too. Numbers are useful because they let you decide whether to throttle, quarantine, or shut the thing off.
Manage: Build response playbooks before the model starts freelancing
“Manage” is the part everyone claims they’ll do “once we have enough data.” That is backwards. You need response playbooks before launch: revoke tool access, disable retrieval, swap to a safer model, force read-only mode, or fall back to a human queue. If the model can create tickets in Jira or trigger actions in ServiceNow, your containment plan should include a way to cut those permissions without taking the whole product offline.
The response plan should also cover abuse patterns that are not technically “security incidents” until they become expensive. For example: a customer discovers that repeated prompt variation can bypass a policy filter; a support bot starts hallucinating refund promises; a coding assistant begins suggesting vulnerable patterns copied from training data; or an internal assistant leaks snippets from a SharePoint index it should never have touched. These are not theoretical. They are the sort of incidents that get buried under “product quality” until someone from audit asks why the transcript says otherwise.
The standard advice says to “continuously improve” the model. Fine. But sometimes the right move is to make it dumber. Strip tool access, narrow the retrieval corpus, reduce context length, or move high-risk workflows back to deterministic software. A brittle but bounded system is easier to defend than a clever one with a wide-open mouth and admin rights.
The Bottom Line
Treat NIST AI RMF as an operating model, not a compliance poster. Assign one owner per model, inventory every tool and retrieval path, and measure live jailbreaks, hallucinations, policy violations, latency, and drift against production traffic. If you cannot disable tool use or retrieval in minutes, you do not have an incident response plan; you have a hope.