
Published: June 3, 2026
Published June 3, 2026
The FDA's February 3, 2026 premarket cybersecurity guidance does not explicitly accept or reject AI-performed penetration testing. It does require that security testing, including penetration testing, be credible, scoped to the threat model, performed by independent and qualified personnel, and documented with methods, scope, duration, and results. An AI agent running unscoped tooling against a medical device cannot satisfy those criteria on its own. The defensible posture today is human-led, AI-augmented: a qualified tester owns the scope, the threat model alignment, the report, and the signature, while AI accelerates reconnaissance, payload variation, fuzz harness generation, SBOM and CVE correlation, and report drafting. A pure-AI pen test submitted to support a 510(k), De Novo, or PMA is a deficiency letter waiting to happen.
Key takeaways
- The Feb 2026 guidance and Section 524B do not name AI testers. They require credibility, independence, qualified personnel, and documented methodology.
- AI is genuinely useful for reconnaissance, fuzzing, SBOM and CVE correlation, payload variation, and report drafting.
- AI alone fails on five fronts: accountable tester qualifications, device-specific clinical and protocol context, reproducibility and chain of custody, threat model alignment, and patient safety implications of false negatives.
- AI also cannot perform hardware testing. JTAG and SWD probing, firmware extraction, glitching, side-channel analysis, RF testing, and bench setup all require a human at a physical bench with calibrated tooling.
- The model that holds up in an FDA submission is human-led, AI-augmented, with a named tester signing the report.
- Procurement should ask any vendor pitching "AI penetration testing" a short list of pointed questions before signing.
What the FDA actually requires
The Feb 3, 2026 final guidance on Cybersecurity in Medical Devices: Quality System Considerations and Content of Premarket Submissions and Section 524B of the FD&C Act both require, as part of the Secure Product Development Framework (SPDF):
- Security testing that includes vulnerability testing, penetration testing, and security assessment of unresolved anomalies
- Testing scoped to the device's threat model and architecture views
- Documentation of methods, scope, duration, tooling, findings, and tester qualifications
- Testing independent of the development team, sufficient to demonstrate the adequacy of cybersecurity controls
Neither the statute nor the guidance says "humans only." Neither says "AI is acceptable." The bar is credibility and evidence. That distinction is the whole story.
Where AI legitimately accelerates a medical device pen test
Used well, AI compresses time on the parts of an engagement that are mechanical, repetitive, or pattern-matching:
- Reconnaissance and attack surface mapping across firmware, mobile companion apps, cloud back ends, and RF interfaces
- SBOM diffing and CVE correlation, including chained vulnerability identification across components
- Fuzz harness generation for HL7, DICOM, BLE GATT, MQTT, CoAP, and proprietary binary protocols
- Payload variation and mutation for abuse and misuse case testing
- Static and dynamic analysis triage, deduplicating findings and mapping them to CWE
- Report drafting, traceability matrices, and mapping findings to the threat model and to FDA-expected deliverables
A qualified tester who refuses to use these tools in 2026 is leaving real coverage on the table.
Where AI-only testing fails an FDA reviewer
There are five places a pure-AI pen test breaks down in a 510(k), De Novo, or PMA review.
1. Accountable tester qualifications
The guidance expects named personnel with documented competency. "An autonomous agent ran the scan" is not a qualification statement a reviewer can evaluate. There is no resume for a model, no continuing education, no signature on the report that means anything in a regulatory context.
2. Device-specific clinical and protocol context
Medical devices are bespoke. Class II and III devices carry clinical workflow abuse cases, IEC 62304 software safety classifications, IEC 60601 essential performance considerations, custom RF stacks, proprietary serial and BLE protocols, and hazard-to-exploit chains that only matter in the context of the device's intended use. An LLM agent that does not understand the device's clinical workflow will miss the exploits that actually create patient harm and over-report the ones that do not.
3. Reproducibility and chain of custody
FDA reviewers and notified bodies want testing they can re-examine. Nondeterministic agent traces, hidden prompts, and undisclosed model versions undercut that. Pen test reports need repeatable steps, defined tooling versions, and a clear evidence trail. AI assistance is fine; AI as the sole black box is not.
4. Threat model alignment
The guidance ties testing directly to the threat model. Pen testing must exercise the STRIDE elements, attack paths, and abuse cases identified in the threat model and architecture views. The threat model is device-specific, written by humans, and not something an AI agent can infer from a binary alone. Without alignment, the test exercises generic attacker behavior and leaves device-specific paths untested.
5. Patient safety implications of false negatives
In conventional IT, a missed vulnerability is a finding for next quarter. In a Class II or III device, a missed vulnerability can become patient harm. False-negative rates that are acceptable in commercial pen testing are not acceptable when the consequence is a hazard. The threshold for missed coverage is lower, and a human is accountable for that threshold.
The hardware problem AI cannot solve
This is the question prospects raise most often, and it is the hardest one for AI-only vendors to answer: AI does not have hands. A meaningful medical device pen test is not a pure software exercise. It involves physical work on physical hardware, and that work cannot be outsourced to a model.
A representative hardware testing workflow on a connected Class II device might include:
- Bench setup for the device under test: power supplies, isolation transformers, signal generators, patient simulators, RF shielding, and Faraday enclosures so BLE and proprietary RF tests do not bleed into adjacent equipment.
- Enclosure teardown and identification of test points, debug headers, and unpopulated pads.
- Hardware reconnaissance with a multimeter, logic analyzer, and oscilloscope to identify UART, SPI, I2C, JTAG, and SWD interfaces and to determine voltage levels and pinouts.
- Debug port exploitation including JTAG and SWD probing with hardware tools such as a Bus Pirate, J-Link, or Black Magic Probe to attempt halt, memory read, and firmware extraction.
- Chip-off and in-circuit firmware extraction from SPI flash, eMMC, or microcontroller internal flash when debug interfaces are locked.
- Glitching and fault injection (voltage and clock) to bypass secure boot, read-out protection, or debug fuses on the MCU.
- Side-channel measurement for power and electromagnetic analysis against cryptographic implementations.
- RF and wireless testing of BLE, NFC, MedRadio, proprietary 2.4 GHz, sub-GHz, and inductive links using SDRs (HackRF, Ubertooth, Proxmark, BladeRF) and protocol-aware tooling.
- Peripheral and accessory abuse including malicious cables, rogue chargers, USB and serial fuzzing of cradles and programmers.
- Tamper response validation against the controls claimed in the threat model and labeling.
None of this happens in a chat window. Every step requires a tester physically present with the device, the right instruments, a calibrated bench, and the experience to read what the instruments are showing. An AI agent cannot solder, cannot probe a test pad, cannot set up a Faraday cage, cannot decide that the suspicious trace on the oscilloscope is worth chasing, and cannot stop testing when the device shows a thermal or electrical fault that risks damaging the unit under test.
The FDA does not require hardware testing in every case. It does require testing scoped to the threat model and architecture views. For any device with physical attack surface, the threat model will identify those interfaces, and a credible pen test must exercise them. A vendor that cannot demonstrate a hardware bench, calibrated tooling, and named hardware testers cannot deliver that coverage, regardless of how sophisticated the AI tooling around it is.
This is also where AI is most useful in support of a human hardware tester: parsing datasheets for an unfamiliar MCU, identifying flash chip families from package markings, generating fuzzing wordlists for an extracted protocol, decoding captured RF frames, and drafting the writeup. Leverage for the human at the bench. Not a replacement for the bench.
The model that works: human-led, AI-augmented
The defensible posture for an FDA-regulated pen test today:
- Human-owned scope anchored to the threat model, architecture views, and intended use.
- AI-accelerated execution across recon, fuzzing, SCA, payload generation, and triage.
- Human-driven exploitation and chaining of findings into clinically meaningful attack paths.
- Human-authored report with explicit disclosure of AI tooling used, model versions, and what humans verified.
- Named, qualified testers with documented independence from the development team, signing the report.
This is how the Feb 2026 guidance reads in practice. The human is accountable; AI is leverage.
Who does what across a medical device pen test
A practical, phase-by-phase view of where humans, AI augmentation, and AI-only approaches land in a regulated engagement. Green columns show what the FDA requires a human to own. Amber shows where AI legitimately accelerates the human's work. Red shows where AI-only engagements fail review.
Human vs AI across the engagement
-
PHASE 01 HumanScope & threat-model alignment
Qualified tester scopes against architecture views, data flows, and intended use.
-
PHASE 02 Human + AIRecon & SBOM/CVE correlation
AI accelerates enumeration and CVE chaining; tester prioritizes and validates.
-
PHASE 03 Human onlyHardware bench & firmware extraction
JTAG/SWD, glitching, side-channel, RF. Physical work no AI can perform.
-
PHASE 04 Human + AIFuzz harnesses & payload generation
LLM drafts harnesses and mutations; tester targets them at threat-model paths.
-
PHASE 05 HumanExploitation, chaining & clinical impact
Tester drives exploitation, judges patient-safety impact, and signs off.
-
PHASE 06 Human signsReport, traceability & signature
AI drafts sections; named, qualified tester authors, edits, and signs the report.
Green = human-owned. Amber = AI-accelerated, human-validated.
The table below maps the same phases to what fails in an AI-only model.
| Engagement phase | Human-led required | AI-augmented recommended | AI-only not FDA-ready |
|---|---|---|---|
| Scoping against threat model & architecture views | Tester + threat modeler | LLM summarizes threat model artifacts | Cannot reliably infer device-specific scope |
| Reconnaissance & attack surface mapping | Tester reviews and prioritizes | LLM agents accelerate enumeration | Surface coverage, no prioritization |
| Hardware bench, JTAG/SWD, firmware extraction | Tester at calibrated bench | Physical work, not applicable | Not possible |
| Glitching, side-channel, RF testing | Tester with specialized instruments | Physical work, not applicable | Not possible |
| Fuzz harness & abuse case generation | Tester designs against threat model | LLM drafts harnesses, mutations, payloads | Generic harnesses, miss device-specific paths |
| SBOM / SCA / CVE correlation | Tester validates exploitability | LLM correlates and chains findings | High false-positive rate, no validation |
| Exploitation & vulnerability chaining | Tester drives, signs off on impact | LLM suggests chains for tester review | Nondeterministic, not reproducible |
| Clinical workflow abuse cases | Tester with med-device domain expertise | LLM helps draft scenarios | No clinical context, misses patient-harm paths |
| Reporting & traceability to threat model | Tester authors, signs | LLM drafts sections for tester edit | Unsigned, unattributable, fails FDA evidence bar |
| Independence & qualifications statement | Named tester credentials in report | Not applicable | No accountable signatory |
A short checklist for vendor selection
Before signing with any firm pitching "AI penetration testing" for a regulated medical device, ask:
- Who is the named, qualified tester signing the report, and what are their credentials?
- How is the test scoped to the device's threat model and architecture views?
- What AI tools and model versions are used, and for which steps?
- Which findings are AI-generated and which are human-verified before reporting?
- How is the testing reproducible, and what evidence is preserved for the FDA or a notified body?
- How does the firm handle abuse and misuse cases that require clinical workflow understanding?
- Does the firm have a hardware bench, calibrated instruments, and named hardware testers for JTAG, SWD, firmware extraction, glitching, side-channel, and RF work?
- What is the firm's posture on false negatives in safety-critical contexts?
If the vendor cannot answer those clearly, the pen test will not hold up under FDA review.
How Blue Goat Cyber runs this
Our medical device pen testing engagements are human-led and AI-augmented. A qualified tester scopes the engagement against your threat model, owns the exploitation and chaining, signs the report, and discloses AI tooling used. AI handles the parts that benefit from it, and humans handle the parts the FDA expects humans to handle. The deliverable is built around the five FDA-required report elements: independence, scope, duration, methods, and results.
If you are evaluating AI-only pen test vendors, or you have an upcoming 510(k), De Novo, or PMA submission and want a credible, defensible pen test, let's scope it.
Frequently asked questions
Can AI do penetration testing for medical devices?
AI can do parts of it. Reconnaissance, SBOM and CVE correlation, fuzz harness generation, payload mutation, finding triage, and report drafting are all reasonable AI workloads. The pieces AI cannot do on a regulated medical device include scoping the test to the threat model and architecture views, performing hardware work at a bench (JTAG, SWD, firmware extraction, glitching, side-channel, RF), exploiting findings into clinically meaningful attack paths, and signing a report as a qualified, independent tester. A medical device pen test that delegates those pieces to AI will not hold up under FDA review.
Does the FDA accept AI penetration testing?
The FDA's February 3, 2026 premarket cybersecurity guidance does not explicitly accept or reject AI-performed penetration testing. It requires that security testing be credible, scoped to the device's threat model, performed by independent and qualified personnel, and documented with methods, scope, duration, tooling, findings, and results. A human-led, AI-augmented engagement meets that bar. An AI-only engagement does not, because there is no named qualified tester, no reproducibility guarantee, and no way to evidence threat-model-aligned coverage.
What's the difference between automated penetration testing and AI penetration testing?
Automated penetration testing usually refers to scripted, scanner-driven workflows that run a fixed playbook (Nessus, OpenVAS, commercial autonomous platforms). AI penetration testing typically adds an LLM agent on top to plan, vary payloads, and chain findings. Both have a place inside a medical device pen test, and both have the same limitation: neither performs hardware work, neither owns the threat model, and neither can sign an FDA-grade report. Treat them as tools inside a human-led engagement, not as substitutes for one.
Which penetration testing platforms combine human experts with AI?
The credible model in regulated medical device work is a boutique firm with named medical device security testers who use AI tooling for the parts that benefit from it (recon, fuzzing, SCA, triage, drafting) and human testers for scoping, hardware work, exploitation, and reporting. Ask vendors directly which steps are AI-driven, which are human-driven, what tools and model versions are used, and who signs the report. If they cannot answer, the engagement is not FDA-ready.
Will AI replace medical device penetration testers?
Not for FDA-regulated work in any near-term timeframe. AI will continue to compress the mechanical parts of an engagement, which is a good thing. It will not replace the named human accountable for scope, hardware testing, clinical workflow understanding, and a signed report, because those are the things the FDA, notified bodies, and patients rely on. The realistic trajectory is fewer hours spent on triage and reporting, more hours spent on bench work and exploitation, and the same human accountability at the top of the document.
Can AI perform hardware testing on a medical device?
No. Hardware testing requires a physical bench, calibrated instruments, and a tester present with the device. JTAG and SWD probing, firmware extraction from SPI flash or eMMC, voltage and clock glitching, side-channel measurement, and RF testing across BLE, NFC, MedRadio, and proprietary stacks all need hands and hardware. AI can help interpret datasheets, parse captured frames, and draft writeups around the work, but it cannot perform the work.
Is an AI-only pen test cheaper than a human-led engagement?
It looks cheaper on the proposal. It is more expensive in the FDA submission. The cost of an AI-only pen test surfaces later as deficiency letters, additional cycles, delayed clearance, and a second (human) engagement to fix the gaps the first one left. For a Class II or III device, a single missed clearance cycle dwarfs the savings from skipping a credible engagement.
Does an AI-only pen test increase the risk of a cybersecurity deficiency letter?
Yes, materially. The most common deficiency patterns the FDA flags - missing threat-model alignment, unscoped testing, unclear methodology, untraceable findings, no named independent tester, and absent hardware or RF coverage - all map directly to the gaps an AI-only engagement leaves behind. The fix in a response letter is usually to commission the human-led pen test that should have happened in the first place.
How long does a human-led, AI-augmented medical device pen test take?
For a typical connected SiMD with a mobile companion and cloud backend, two to four weeks of active testing is realistic. AI-augmented workflows compress triage, reporting, and SCA work but do not change the time required for bench testing, exploitation, or threat-model-aligned scoping. A three-day "AI pen test" on a complex connected device is a red flag, not a feature.
What should a medical device manufacturer ask AI pen test vendors first?
Three questions tend to settle it quickly: (1) who is the named, qualified tester signing the report; (2) which engagement steps are performed by humans at a bench versus by AI agents; (3) how the engagement is scoped to the device's threat model and architecture views. If the vendor cannot answer those three clearly, the engagement will not hold up under FDA review, regardless of how impressive the underlying AI platform is.