
Published: June 3, 2026
Published June 3, 2026
The FDA's February 3, 2026 premarket cybersecurity guidance expects fuzz testing as part of the security testing required under Section 524B, scoped to the device's threat model and architecture views, with documented methodology, coverage, findings, and tester qualifications. Generic fuzzers (raw AFL++ against a TCP port, off-the-shelf BLE fuzzers, internet-of-things scanners) almost never satisfy that bar on a medical device because the protocols are stateful, framed, checksummed, and negotiated. A defensible fuzz program builds a harness per protocol: HL7 v2 over MLLP, DICOM with association negotiation, BLE ATT/GATT under MTU constraints, MQTT with TLS and persistent sessions, CoAP with block-wise transfer, and proprietary binary protocols reconstructed from pcaps and firmware. AI accelerates grammar synthesis, seed corpus expansion, and crash triage, but it does not replace the human-authored stateful orchestration, the hardware-in-the-loop setup, or the named tester signing the report.
Most fuzz testing in medical-device submissions is one of two things: a port scanner with -fuzz in the name, or an off-the-shelf BLE tool run for an afternoon. Neither survives a Section 524B review when the reviewer asks "what protocol states did you cover, and where is the coverage signal?"
Key takeaways
- Fuzz testing is required under the Feb 2026 guidance, scoped to the threat model and documented to the same evidence bar as the rest of security testing.
- Generic fuzzers fail on medical protocols because of statefulness, framing, checksums, association negotiation, MTU constraints, and block-wise transfer.
- The defensible model is one harness per protocol, with explicit grammar sources, seed corpora, transport setup, and coverage signal.
- AI helps with grammar synthesis from pcaps and specs, seed expansion, and crash deduplication. It does not handle stateful orchestration or hardware-in-the-loop reliably.
- Reviewers want harness source, seed corpus stats, coverage reports, crash logs, and fixed-vs-accepted disposition tied back to the threat model and security risk file.
- The deliverable belongs in your SPDF and feeds your CAPA and VEX workflows after launch.
What the FDA actually expects for fuzz testing evidence
Section 524B and the Feb 3, 2026 final guidance Cybersecurity in Medical Devices: Quality System Considerations and Content of Premarket Submissions pull fuzz testing into the same security-testing envelope as vulnerability and penetration testing. In practice, reviewers look for:
- Scope tied to the threat model and architecture views - which interfaces, which protocols, which trust boundaries.
- Methodology - the fuzzing approach (mutation, generation, coverage-guided, protocol-aware), the harness, the seed corpus, and the runtime environment.
- Duration and intensity - executions, total runtime, throughput, coverage achieved.
- Findings and disposition - crashes, hangs, undefined behavior, with each one tracked to a fix, a compensating control, or a documented accepted risk.
- Tester independence and qualifications - the same bar as the rest of the pen test report, with a named, qualified signatory.
"We ran AFL++ for an hour on the management port" is not a methodology. Neither is "the BLE scanner found no issues." Reviewers want to see a harness per protocol, with a defensible reason it exercises the threat-model paths.
Why generic fuzzers fail on medical protocols
Off-the-shelf fuzzers were designed for parsers and file formats. Medical devices speak negotiated, stateful, framed protocols over transports that punish naive injection.
- Stateful handshakes. DICOM requires an Association establishment (A-ASSOCIATE-RQ/AC, presentation context negotiation) before any P-DATA flows. HL7 v2 over MLLP has start/end framing and an ACK/NAK loop. MQTT requires CONNECT/CONNACK before anything else and keeps a persistent session. A stateless mutator never gets past the handshake.
- Checksums and length prefixes. Proprietary binary protocols, BLE GATT extensions, and many serial protocols include CRCs, length prefixes, and sequence numbers. Bit-flipping the payload invalidates the framing and the device drops the frame before the parser ever sees it.
- MTU and fragmentation. BLE ATT operates at 23-byte default MTU, negotiable up to ~512. CoAP uses block-wise transfer (Block1/Block2) for any payload larger than a single datagram. Fuzz frames that ignore these boundaries are silently truncated.
- Association and presentation contexts. DICOM requires the right SOP Class UID and transfer syntax in the negotiated presentation context. Send a C-STORE for a context that was not accepted and the device discards it - no parser ever runs.
- TLS and authentication wrappers. MQTT and many HL7 v2 deployments are wrapped in TLS with mutual auth. Without a working TLS termination in the harness, the fuzzer never reaches the protocol logic.
- Hardware-in-the-loop reality. BLE GATT on a real device requires a controlled radio, a paired and bonded central, and the device in the right power and connection state. CAN/CANopen fuzzing requires bus access and termination. None of this is push-a-button.
A harness has to handle the wrapper so the fuzzer can spend its budget on the part of the protocol that actually contains bugs.
Harness generation per protocol
The table below summarizes the practical, FDA-defensible starting points for the protocols medical devices most commonly expose. Tooling choice is a starting point - the harness, not the framework, is the deliverable.
| Protocol | Transport | Statefulness | Grammar source | Tooling start | Seed corpus | Coverage signal |
|---|---|---|---|---|---|---|
| HL7 v2 | MLLP / TCP (often TLS) | Framed, ACK/NAK loop | v2.x message profiles, conformance statement | boofuzz with MLLP wrapper | ADT, ORU, ORM, MDM exemplars from spec | ACK rate, parser exceptions, log delta |
| DICOM | TCP (DUL), optional TLS | Association negotiation, presentation contexts | DICOM Part 5/7/8, IODs, SOP Classes | pynetdicom + boofuzz, custom mutator | Valid C-STORE/C-FIND per accepted context | Association success rate, SCP error responses |
| BLE ATT / GATT | BLE link layer + L2CAP | Paired/bonded, MTU-negotiated | Service/characteristic enumeration, device spec | Defensics BLE, sweyntooth-style harness, custom on nRF52/ESP | Valid characteristic writes, indication round-trips | Disconnect rate, link supervision timeouts, watchdog resets |
| MQTT | TCP, almost always TLS | Session, QoS state, retained messages | MQTT 3.1.1 / 5.0 spec, broker schema | boofuzz or libFuzzer with TLS shim | Valid CONNECT, SUBSCRIBE, PUBLISH per topic tree | CONNACK reasons, broker logs, ACL violations |
| CoAP | UDP, optional DTLS | Block-wise transfer, observe | RFC 7252 + 7959, device resource map | AFL++ persistent on a CoAP library wrapper | Valid GET/POST per resource, multi-block PUT | Response code distribution, DTLS handshake errors |
| Proprietary binary | Serial, USB-CDC, sub-GHz RF, CAN | Custom framing, checksums, sequence | Reverse-engineered from pcaps + firmware | Custom harness over boofuzz blocks or libFuzzer | Captured exemplars, mutated within framing constraints | CRC accept rate, command ACKs, controller reboots |
A few harness-design rules that apply across all of these
- Make the harness stateful in code, not in luck. Encode the handshake explicitly so every iteration starts from a known state.
- Treat checksums and length prefixes as fix-ups, not fuzz targets. Mutate the payload, then recompute the wrapper so the device parses what you sent.
- Define an oracle that the device can signal. A crash log, a watchdog reset, an MQTT disconnect reason, a DICOM SCP error class - anything stronger than "the connection closed."
- Record everything. Seed corpus, exact mutations, transport-level captures, target state at the time of the finding. This is what the reviewer asks for.
- Run on real hardware when the threat model says so. Emulating an RTOS target on x86 is fine for triage, not sufficient as the final coverage statement.
Where AI legitimately helps (and where it doesn't)
Used well, AI compresses the slow parts of fuzz harness work:
- Grammar synthesis from pcaps and specs. LLMs are genuinely useful at turning a packet capture and a PDF spec into a first-draft message grammar (boofuzz blocks, ASN.1 sketch, Kaitai struct), which a human then corrects.
- Seed corpus expansion. Generating varied-but-valid HL7 v2 messages, DICOM objects per IOD, MQTT topic trees, and CoAP resource sets.
- Crash deduplication and triage. Clustering thousands of crashes by stack trace, register state, and message prefix, then proposing a candidate root cause for human review.
- Mutator authoring. Drafting custom mutators for libFuzzer or AFL++ that understand the protocol's framing.
Where AI consistently falls down on medical-device fuzzing:
- Stateful orchestration on real targets. Driving an Association establishment, a BLE pairing/bonding, an MQTT session restoration, or a CAN/CANopen NMT state across many iterations is brittle even for experienced engineers and worse for an agent.
- Hardware-in-the-loop control. Power cycling the device under test, recovering from a watchdog reboot, re-pairing a BLE central, restoring a UART logger - all manual or scripted human work.
- Crash triage on embedded targets. Without source, symbols, or a reliable debugger, root-causing a watchdog reset on an STM32 takes a human at the bench reading the trace buffer, not an LLM looking at a log line.
- Threat-model-aligned scope. AI does not know which characteristics actually carry safety-relevant writes on your specific device. The scoping decision is human.
The defensible pattern mirrors the AI pen testing pattern: human-led, AI-augmented, with a named tester signing for scope, coverage, and findings. See our companion piece on where AI penetration testing fails an FDA reviewer for the longer treatment.
Process flow: spec to VEX
From spec or pcap to CAPA and VEX
-
STEP 01 Human + AISpec & pcap to grammar
LLM drafts message grammar from RFCs, conformance statements, and captures; engineer corrects framing and constraints.
-
STEP 02 HumanHarness authoring
Engineer encodes state machine, transport, TLS/DTLS, checksum fix-ups, and the device-side oracle.
-
STEP 03 Human + AISeed corpus
AI expands valid exemplars per IOD, topic tree, characteristic, or resource; engineer prunes and labels.
-
STEP 04 HumanCampaign execution
Run on real target, hardware-in-the-loop where required, with crash capture, power cycle automation, and coverage logging.
-
STEP 05 Human + AITriage & dedup
AI clusters crashes by signature and proposes root cause; engineer confirms exploitability and severity.
-
STEP 06 HumanReproducer
Minimum-viable repro script and target state captured for the submission and for the dev team to fix against.
-
STEP 07 HumanCAPA & VEX
Findings flow into security risk file, CAPA, and post-launch VEX statements with not_affected, affected, or fixed disposition.
-
STEP 08 Human signsReport & signature
Named, qualified tester signs the report with methodology, coverage, findings, and disposition for the submission.
Deliverables a reviewer wants to see
Per protocol fuzzed, the submission package should include:
- Harness source (or a clear pointer to it in your DHF), with the state machine, framing, and oracle visible.
- Seed corpus statistics - count, sources, generation method, and which exemplars exercise which threat-model paths.
- Coverage report - basic block, branch, or protocol-state coverage as appropriate. "We ran it for X hours" is not coverage.
- Crash log and dedup summary - unique crashes, signatures, repro reliability, and whether each crash is reachable from the device's exposed interfaces.
- Disposition - fixed, mitigated by a compensating control, or accepted with rationale traced to the security risk file.
- Tester qualifications - the same independence and credentials statement that anchors the rest of the pen test report.
- Tooling and environment - exact tool versions, target image hash, hardware bench description, AI tooling and model versions if used.
This is the package that survives an AI request, a Major Deficiency, or a Hold letter. Without it, the reviewer cannot evaluate whether the fuzz testing actually exercised the threat model - and "we ran a fuzzer" is exactly the phrasing that earns a deficiency.
How Blue Goat Cyber runs this
Our medical-device pen testing engagements include protocol-aware fuzz harnesses per declared interface, built against your threat model and architecture views, run on real hardware where the model requires it, with a named tester owning scope and signing the report. AI is used where it earns its keep - grammar drafting, seed expansion, crash clustering - and disclosed in the report.
If you are scoping a 510(k), De Novo, or PMA submission and want fuzz testing that will not invite an AI Request or a Major Deficiency, book a strategy session.
Frequently asked questions
Does the FDA require fuzz testing for medical devices?
Section 524B and the February 3, 2026 final premarket cybersecurity guidance require security testing scoped to the threat model, with fuzz testing called out as part of the expected toolkit alongside vulnerability and penetration testing. The agency does not mandate a specific fuzzer, framework, or duration. It does expect a defensible methodology, coverage tied to the threat model, documented findings, and a named, qualified tester. For Class II and III connected devices, omitting fuzz testing is a frequent driver of AI Requests and Major Deficiencies.
How is medical-device fuzzing different from IT or web fuzzing?
Medical devices speak negotiated, framed, stateful protocols (DICOM, HL7 v2 over MLLP, BLE GATT, MQTT, CoAP, proprietary serial, CAN/CANopen) often over wireless or hardware transports, on embedded targets without crash dumps or symbols. Web and IT fuzzing typically targets stateless HTTP request parsers with rich tooling around them. The implication is that medical-device fuzz programs spend most of their effort on the harness, the transport, the state machine, and the oracle, not on the fuzzer engine itself.
Premarket vs postmarket fuzzing - what changes?
In premarket, fuzz testing is a controlled, scoped exercise that produces evidence for the submission. In postmarket, the same harnesses become regression assets you re-run when a third-party component issues a CVE, when a firmware update lands, or when a customer reports anomalous behavior. The premarket harness is the asset that pays for itself across the product's lifecycle - especially when it feeds your VEX workflow with reproducible "not affected" or "affected" determinations.
How do you fuzz BLE GATT without rooted phones?
The defensible setup is a controlled BLE central running on a dev kit (nRF52840, ESP32, Adafruit BLEFriend) or a Linux host with BlueZ and a known controller, with the device under test paired and bonded under your harness's control. Rooted phones or off-the-shelf "BLE fuzzers" are useful for triage but not as the primary evidence path - they leave too many uncontrolled variables for the reviewer. Sweyntooth-style harnesses, Defensics BLE suites, or custom firmware on a dev board all produce evidence that survives review.
Can you fuzz an RTOS target with libFuzzer or AFL++?
Sometimes, with caveats. You can compile vulnerable parser code paths (CoAP, MQTT, HL7) for the host with the device's RTOS stubbed out and run libFuzzer or AFL++ persistent-mode against them. That gives you fast coverage on the parser logic. It does not exercise the device's actual transport, timing, interrupts, or memory layout. The defensible posture is to use host fuzzing for triage and parser depth, and hardware-in-the-loop fuzzing on the real target to satisfy the threat-model coverage statement.
How long should a fuzz campaign run per protocol?
Long enough to plateau on coverage, not a fixed clock. Practical campaigns on a mature parser converge in tens of millions of executions; on a custom binary protocol, they may run for days. The reviewer-friendly statement is coverage-based ("branch coverage plateaued at X% after N hours, no new crashes in M hours") rather than time-based ("we ran for a week"). Document both.
What about hardware-in-the-loop fuzzing for CAN, USB, or serial?
Hardware-in-the-loop is required when the threat model includes those interfaces, which it usually does on any motorized or multi-board device. The harness includes the bus or interface adapter (CANable, PCAN, Kvaser; Facedancer or GreatFET for USB; controlled UART), a power-cycle relay, a watchdog or log capture, and an oracle defined in terms the device emits. Pure software fuzzing of a CAN parser on x86 is fine for triage; it does not substitute for the bench evidence.
Does AI grammar synthesis from pcaps actually work?
For well-known protocols (MQTT, CoAP, HL7 v2, DICOM), yes - LLMs produce usable first-draft grammars from the spec plus a representative capture, fast. For proprietary binary protocols, AI is helpful at the byte-grouping stage and weak at recovering state machines and checksum semantics. Expect a human engineer to spend real time on the framing, the CRC/length fix-ups, and the state transitions before the harness produces signal.
How do we tie fuzz findings to the security risk file?
Each unique finding gets a row in the security risk file with the threat it relates to, the vulnerability, the patient-safety impact, the control (fix, mitigation, or compensating control), and the residual risk. This is the AAMI TIR57 pattern. Fuzz findings that do not have a clean traceability line to a threat are the ones reviewers flag - either the threat model is incomplete or the finding's relevance is not justified. Both are fixable; both are easier to fix before submission than after.
What does "fixed vs accepted" disposition look like for fuzz findings?
For each unique crash or anomalous behavior: fixed (code change verified by re-fuzzing the same seed and corpus), mitigated (compensating control such as input length cap, rate limit, or watchdog reset with documented residual risk), or accepted (rationale tied to threat model and patient-safety impact, signed off). Reviewers do not expect zero findings; they expect every finding to have a disposition they can evaluate. Untriaged crashes left in the report are the worst possible outcome.
Related reading
- Does the FDA Accept AI Penetration Testing for Medical Devices?
- SBOM Diffing and CVE Correlation for Postmarket Medical Devices
- A Step-by-Step Guide to Threat Modeling Connected and Implantable Medical Devices
- VEX Mistakes That Trigger FDA Deficiencies
- CAN Bus and CANopen Vulnerabilities in Medical Devices