Fuzz Harness Generation for Medical

On this page

By Christian Espinosa, MBA, CISSP

Founder & CEO · Blue Goat Cyber

Published: June 3, 2026

Key Takeaways

Fuzz testing is required under the Feb 2026 guidance, scoped to the [threat model](/services/threat-modeling-services "medical device threat modeling") and documented to the same evidence bar as the rest of security testing.
Generic fuzzers fail on medical protocols because of statefulness, framing, checksums, association negotiation, MTU constraints, and block-wise transfer.
The defensible model is one use per protocol, with explicit grammar sources, seed corpora, transport setup, and coverage signal.
AI helps with grammar synthesis from pcaps and specs, seed expansion, and crash deduplication. It does not handle stateful orchestration or hardware-in-the-loop reliably.
Reviewers want use source, seed corpus stats, coverage reports, crash logs, and fixed-vs-accepted disposition tied back to the threat model and security risk file.
The deliverable belongs in your SPDF and feeds your CAPA and VEX workflows after launch.

Published June 3, 2026

TL;DR

The FDA's February 3, 2026 premarket cybersecurity guidance expects fuzz testing as part of the security testing required under Section 524B, scoped to the device's threat model and architecture views, with documented methodology, coverage, findings, and tester qualifications. Generic fuzzers (raw AFL++ against a TCP port, off-the-shelf BLE fuzzers, internet-of-things scanners) almost never satisfy that bar on a medical device because the protocols are stateful, framed, checksummed, and negotiated. A defensible fuzz program builds a use per protocol: HL7 v2 over MLLP, DICOM with association negotiation, BLE ATT/GATT under MTU constraints, MQTT with TLS and persistent sessions, CoAP with block-wise transfer, and proprietary binary protocols reconstructed from pcaps and firmware.

What the FDA actually expects for fuzz testing evidence
Why generic fuzzers fail on medical protocols
Harness generation per protocol
Where AI legitimately helps (and where it doesn't)
Process flow: spec to VEX
Deliverables a reviewer wants to see
How Blue Goat Cyber runs this
Related reading

Why this matters

The FDA's Cybersecurity in Medical Devices: Quality Management System Considerations and Content of Premarket Submissions (Feb 3, 2026 final guidance) made cybersecurity documentation a gating criterion for clearance under Section 524B of the FD&C Act. Reviewers now apply this guidance to fuzz use generation for medical devices the same way they apply software lifecycle expectations from IEC 62304 and security risk-management expectations from AAMI TIR57 and ANSI/AAMI SW96:2023.

Gaps in this area are the single most common driver of first-cycle cybersecurity Additional Information (AI) requests. The FDA's FY2024 CDRH performance reports show cybersecurity is among the top deficiency categories cited in 510(k) and PMA AI letters, behind only software documentation and clinical evidence. Treating it as a checklist exercise rather than a design-controlled engineering artifact is what creates the gap.

What the FDA actually expects for fuzz testing evidence

Section 524B and the Feb 3, 2026 final guidance Cybersecurity in Medical Devices: Quality Management System Considerations and Content of Premarket Submissions pull fuzz testing into the same security-testing envelope as vulnerability and penetration testing. In practice, reviewers look for:

Scope tied to the threat model and architecture views - which interfaces, which protocols, which trust boundaries.
Methodology - the fuzzing approach (mutation, generation, coverage-guided, protocol-aware), the use, the seed corpus, and the runtime environment.
Duration and intensity - executions, total runtime, throughput, coverage achieved.
Findings and disposition - crashes, hangs, undefined behavior, with each one tracked to a fix, a compensating control, or a documented accepted risk.
Tester independence and qualifications - the same bar as the rest of the pen test report, with a named, qualified signatory.

"We ran AFL++ for an hour on the management port" is not a methodology. Neither is "the BLE scanner found no issues." Reviewers want to see a use per protocol, with a defensible reason it exercises the threat-model paths.

Why generic fuzzers fail on medical protocols

Off-the-shelf fuzzers were designed for parsers and file formats. Medical devices speak negotiated, stateful, framed protocols over transports that punish naive injection.

Stateful handshakes. DICOM requires an Association establishment (A-ASSOCIATE-RQ/AC, presentation context negotiation) before any P-DATA flows. HL7 v2 over MLLP has start/end framing and an ACK/NAK loop. MQTT requires CONNECT/CONNACK before anything else and keeps a persistent session. A stateless mutator never gets past the handshake.
Checksums and length prefixes. Proprietary binary protocols, BLE GATT extensions, and many serial protocols include CRCs, length prefixes, and sequence numbers. Bit-flipping the payload invalidates the framing and the device drops the frame before the parser ever sees it.
MTU and fragmentation. BLE ATT operates at 23-byte default MTU, negotiable up to ~512. CoAP uses block-wise transfer (Block1/Block2) for any payload larger than a single datagram. Fuzz frames that ignore these boundaries are silently truncated.
Association and presentation contexts. DICOM requires the right SOP Class UID and transfer syntax in the negotiated presentation context. Send a C-STORE for a context that was not accepted and the device discards it - no parser ever runs.
TLS and authentication wrappers. MQTT and many HL7 v2 deployments are wrapped in TLS with mutual auth. Without a working TLS termination in the use, the fuzzer never reaches the protocol logic.
Hardware-in-the-loop reality. BLE GATT on a real device requires a controlled radio, a paired and bonded central, and the device in the right power and connection state. CAN/CANopen fuzzing requires bus access and termination. None of this is push-a-button.

A use has to handle the wrapper so the fuzzer can spend its budget on the part of the protocol that actually contains bugs.

Harness generation per protocol

The table below summarizes the practical, FDA-defensible starting points for the protocols medical devices most commonly expose. Tooling choice is a starting point - the use, not the framework, is the deliverable.

Protocol	Transport	Statefulness	Grammar source	Tooling start	Seed corpus	Coverage signal
HL7 v2	MLLP / TCP (often TLS)	Framed, ACK/NAK loop	v2.x message profiles, conformance statement	boofuzz with MLLP wrapper	ADT, ORU, ORM, MDM exemplars from spec	ACK rate, parser exceptions, log delta
DICOM	TCP (DUL), optional TLS	Association negotiation, presentation contexts	DICOM Part 5/7/8, IODs, SOP Classes	pynetdicom + boofuzz, custom mutator	Valid C-STORE/C-FIND per accepted context	Association success rate, SCP error responses
BLE ATT / GATT	BLE link layer + L2CAP	Paired/bonded, MTU-negotiated	Service/characteristic enumeration, device spec	Defensics BLE, sweyntooth-style use, custom on nRF52/ESP	Valid characteristic writes, indication round-trips	Disconnect rate, link supervision timeouts, watchdog resets
MQTT	TCP, almost always TLS	Session, QoS state, retained messages	MQTT 3.1.1 / 5.0 spec, broker schema	boofuzz or libFuzzer with TLS shim	Valid CONNECT, SUBSCRIBE, PUBLISH per topic tree	CONNACK reasons, broker logs, ACL violations
CoAP	UDP, optional DTLS	Block-wise transfer, observe	RFC 7252 + 7959, device resource map	AFL++ persistent on a CoAP library wrapper	Valid GET/POST per resource, multi-block PUT	Response code distribution, DTLS handshake errors
Proprietary binary	Serial, USB-CDC, sub-GHz RF, CAN	Custom framing, checksums, sequence	Reverse-engineered from pcaps + firmware	Custom use over boofuzz blocks or libFuzzer	Captured exemplars, mutated within framing constraints	CRC accept rate, command ACKs, controller reboots

The fuzzer is the engine. The use - framing, state, transport, oracle - is the deliverable.

A few harness-design rules that apply across all of these

Make the use stateful in code, not in luck. Encode the handshake explicitly so every iteration starts from a known state.
Treat checksums and length prefixes as fix-ups, not fuzz targets. Mutate the payload, then recompute the wrapper so the device parses what you sent.
Define an oracle that the device can signal. A crash log, a watchdog reset, an MQTT disconnect reason, a DICOM SCP error class - anything stronger than "the connection closed."
Record everything. Seed corpus, exact mutations, transport-level captures, target state at the time of the finding. This is what the reviewer asks for.
Run on real hardware when the threat model says so. Emulating an RTOS target on x86 is fine for triage, not sufficient as the final coverage statement.

Where AI legitimately helps (and where it doesn't)

Used well, AI compresses the slow parts of fuzz use work:

Grammar synthesis from pcaps and specs. LLMs are genuinely useful at turning a packet capture and a PDF spec into a first-draft message grammar (boofuzz blocks, ASN.1 sketch, Kaitai struct), which a human then corrects.
Seed corpus expansion. Generating varied-but-valid HL7 v2 messages, DICOM objects per IOD, MQTT topic trees, and CoAP resource sets.
Crash deduplication and triage. Clustering thousands of crashes by stack trace, register state, and message prefix, then proposing a candidate root cause for human review.
Mutator authoring. Drafting custom mutators for libFuzzer or AFL++ that understand the protocol's framing.

Where AI consistently falls down on medical-device fuzzing:

Stateful orchestration on real targets. Driving an Association establishment, a BLE pairing/bonding, an MQTT session restoration, or a CAN/CANopen NMT state across many iterations is brittle even for experienced engineers and worse for an agent.
Hardware-in-the-loop control. Power cycling the device under test, recovering from a watchdog reboot, re-pairing a BLE central, restoring a UART logger - all manual or scripted human work.
Crash triage on embedded targets. Without source, symbols, or a reliable debugger, root-causing a watchdog reset on an STM32 takes a human at the bench reading the trace buffer, not an LLM looking at a log line.
Threat-model-aligned scope. AI does not know which characteristics actually carry safety-relevant writes on your specific device. The scoping decision is human.

The defensible pattern mirrors the AI pen testing pattern: human-led, AI-augmented, with a named tester signing for scope, coverage, and findings. See our companion piece on where AI penetration testing fails an FDA reviewer for the longer treatment.

Process flow: spec to VEX

See also: BLE & RF Penetration Testing, Medical Device Penetration Testing Cost: 2024 Guide, and When to Start Medical Device Cybersecurity.

Fuzz program flow

From spec or pcap to CAPA and VEX

Human owns AI accelerates

STEP 01 Human + AI

Spec & pcap to grammar

LLM drafts message grammar from RFCs, conformance statements, and captures; engineer corrects framing and constraints.
STEP 02 Human

use authoring

Engineer encodes state machine, transport, TLS/DTLS, checksum fix-ups, and the device-side oracle.
STEP 03 Human + AI

Seed corpus

AI expands valid exemplars per IOD, topic tree, characteristic, or resource; engineer prunes and labels.
STEP 04 Human

Campaign execution

Run on real target, hardware-in-the-loop where required, with crash capture, power cycle automation, and coverage logging.
STEP 05 Human + AI

Triage & dedup

AI clusters crashes by signature and proposes root cause; engineer confirms exploitability and severity.
STEP 06 Human

Reproducer

Minimum-viable repro script and target state captured for the submission and for the dev team to fix against.
STEP 07 Human

CAPA & VEX

Findings flow into security risk file, CAPA, and post-launch VEX statements with not_affected, affected, or fixed disposition.
STEP 08 Human signs

Report & signature

Named, qualified tester signs the report with methodology, coverage, findings, and disposition for the submission.

Flow

Spec / PCAP→ Grammar→ use→ Corpus→ Campaign→ Triage→ Reproducer→ CAPA / VEX

Deliverables a reviewer wants to see

Per protocol fuzzed, the submission package should include:

use source (or a clear pointer to it in your DHF), with the state machine, framing, and oracle visible.
Seed corpus statistics - count, sources, generation method, and which exemplars exercise which threat-model paths.
Coverage report - basic block, branch, or protocol-state coverage as appropriate. "We ran it for X hours" is not coverage.
Crash log and dedup summary - unique crashes, signatures, repro reliability, and whether each crash is reachable from the device's exposed interfaces.
Disposition - fixed, mitigated by a compensating control, or accepted with rationale traced to the security risk file.
Tester qualifications - the same independence and credentials statement that anchors the rest of the pen test report.
Tooling and environment - exact tool versions, target image hash, hardware bench description, AI tooling and model versions if used.

This is the package that survives an AI request, a Major Deficiency, or a Hold letter. Without it, the reviewer cannot evaluate whether the fuzz testing actually exercised the threat model - and "we ran a fuzzer" is exactly the phrasing that earns a deficiency.

Need help? Our team supports manufacturers with FDA cybersecurity submissions end-to-end. Explore our medical device cybersecurity services or book a discovery call.

How Blue Goat Cyber runs this

Our medical-device pen testing engagements include protocol-aware fuzz uses per declared interface, built against your threat model and architecture views, run on real hardware where the model requires it, with a named tester owning scope and signing the report. AI is used where it earns its keep - grammar drafting, seed expansion, crash clustering - and disclosed in the report.

If you are scoping a 510(k), De Novo, or PMA submission and want fuzz testing that will not invite an AI Request or a Major Deficiency, book a strategy session.

FAQ

Does the FDA require fuzz testing for medical devices?

Section 524B and the February 3, 2026 final premarket cybersecurity guidance require security testing scoped to the threat model, with fuzz testing called out as part of the expected toolkit alongside vulnerability and penetration testing. The agency does not mandate a specific fuzzer, framework, or duration. It does expect a defensible methodology, coverage tied to the threat model, documented findings, and a named, qualified tester. For Class II and III connected devices, omitting fuzz testing is a frequent driver of AI Requests and Major Deficiencies.

How is medical-device fuzzing different from IT or web fuzzing?

Medical devices speak negotiated, framed, stateful protocols (DICOM, HL7 v2 over MLLP, BLE GATT, MQTT, CoAP, proprietary serial, CAN/CANopen) often over wireless or hardware transports, on embedded targets without crash dumps or symbols. Web and IT fuzzing typically targets stateless HTTP request parsers with rich tooling around them. The implication is that medical-device fuzz programs spend most of their effort on the use, the transport, the state machine, and the oracle, not on the fuzzer engine itself.

Premarket vs postmarket fuzzing - what changes?

In premarket, fuzz testing is a controlled, scoped exercise that produces evidence for the submission. In postmarket, the same uses become regression assets you re-run when a third-party component issues a CVE, when a firmware update lands, or when a customer reports anomalous behavior. The premarket use is the asset that pays for itself across the product's lifecycle - especially when it feeds your VEX workflow with reproducible "not affected" or "affected" determinations.

How long should a fuzz campaign run per protocol?

Long enough to plateau on coverage, not a fixed clock. Practical campaigns on a mature parser converge in tens of millions of executions; on a custom binary protocol, they may run for days. The reviewer-friendly statement is coverage-based ("branch coverage plateaued at X% after N hours, no new crashes in M hours") rather than time-based ("we ran for a week"). Document both.

How do we tie fuzz findings to the security risk file?

Each unique finding gets a row in the security risk file with the threat it relates to, the vulnerability, the patient-safety impact, the control (fix, mitigation, or compensating control), and the residual risk. This is the AAMI TIR57 pattern. Fuzz findings that do not have a clean traceability line to a threat are the ones reviewers flag - either the threat model is incomplete or the finding's relevance is not justified. Both are fixable; both are easier to fix before submission than after.

What does "fixed vs accepted" disposition look like for fuzz findings?

For each unique crash or anomalous behavior: fixed (code change verified by re-fuzzing the same seed and corpus), mitigated (compensating control such as input length cap, rate limit, or watchdog reset with documented residual risk), or accepted (rationale tied to threat model and patient-safety impact, signed off). Reviewers do not expect zero findings; they expect every finding to have a disposition they can evaluate. Untriaged crashes left in the report are the worst possible outcome.

About the author

Christian Espinosa, CISSP, Founder, Blue Goat Cyber. Christian leads a team focused exclusively on medical device cybersecurity for FDA premarket submissions and postmarket compliance. Read more about Christian.

Fuzz Harness Generation for Medical

Key Takeaways

Table of Contents

Why this matters

What the FDA actually expects for fuzz testing evidence

Why generic fuzzers fail on medical protocols

Harness generation per protocol

A few harness-design rules that apply across all of these

Where AI legitimately helps (and where it doesn't)

Process flow: spec to VEX

From spec or pcap to CAPA and VEX

Deliverables a reviewer wants to see

How Blue Goat Cyber runs this

FAQ

About the author

Keep reading

Keep going: the 524B and eSTAR working set

Put this into practice on your device

Get FDA cleared without the cybersecurity headaches.

Fuzz Harness Generation for Medical

Key Takeaways

Table of Contents

Why this matters

What the FDA actually expects for fuzz testing evidence

Why generic fuzzers fail on medical protocols

Harness generation per protocol

A few harness-design rules that apply across all of these

Where AI legitimately helps (and where it doesn't)

Process flow: spec to VEX

From spec or pcap to CAPA and VEX

Deliverables a reviewer wants to see

How Blue Goat Cyber runs this

FAQ

Related reading

About the author

Keep reading

Keep going: the 524B and eSTAR working set

Put this into practice on your device

Get FDA cleared without the cybersecurity headaches.