We Open-Sourced the BIP-39 Scanning Engine. Here Is What Is Inside.

BIP-39 recovery tool scanning disk sectors for cryptocurrency seed phrase data with hex and block analysis interface

We just open-sourced the complete scanning engine that powers the BIP39 Recovery Tool. Here is why that matters, what is inside, and what you can do with it.

If you have ever lost a BIP-39 seed phrase, you already know the feeling. That specific, slow-dawning horror when you realize the twelve or twenty-four words that stand between you and your wallet are gone. Not deleted in the abstract sense. Gone from your memory, gone from the notebook you were sure you kept, gone from whatever document you half-remember creating on a laptop you may or may not still own.

The BIP39 Recovery Tool exists to answer a single question: are those words still somewhere on a storage device, as raw bytes, waiting to be found? In most cases where a user ever had the phrase on a machine at all, the answer is yes. Files get deleted. Sectors get overwritten. Browser caches get cleared. But bytes are stubborn. They linger in places filesystems stopped tracking long before the operating system did.

Today we published the core engine that does that work. Not a demo. Not a sanitized subset. The real thing, under MIT, on GitHub, with every line of code visible.

github.com/mmediasoftwarelab/BIP39RecoveryTool-public

This post explains what is in the repository, how the engine works at a technical level, why we published it, and who it is for.

What Is in the Repository

The repository contains eighteen C++17 source files. No UI code. No installer logic. No licensing system. Just the six components that do the actual work of finding seed phrases on storage devices, plus the BIP-39 wordlist in three header-only formats optimized for different use cases.

The components are:

  • BIP39Checksum – spec-compliant cryptographic validator. Takes a word sequence, maps it to the BIP-39 bitstream, SHA-256 hashes the entropy, and compares checksum bits. False positive rate for random word sequences: one in 256.
  • Bip39Sequence – sliding-window extractor. Tokenizes raw bytes on non-alpha boundaries and pulls every consecutive BIP-39 word run of exactly 12, 15, 18, 21, or 24 words. Handles overlapping windows so nothing is missed at sequence boundaries.
  • LowLevelScanner – raw disk engine. Opens a physical device with CreateFileW, reads every byte in configurable block sizes (default 1 MB), runs the two-phase detection pipeline, and emits Qt signals for each hit.
  • Scanner – filesystem walker. Recurses through directories with QDirIterator, runs a fast pre-filter before invoking full sequence extraction on candidates.
  • ScanWorker – Qt thread wrapper. Moves the scanner onto a background thread using Qt’s moveToThread pattern, manages stop/pause state with std::atomic<bool>, and writes a CSV on completion.
  • DriveUtils – device resolution. Converts a Windows drive letter to its physical device path using DeviceIoControl with IOCTL_STORAGE_GET_DEVICE_NUMBER.

Everything compiles against Qt 6.9 with MinGW 64-bit on Windows 10 or 11. The checksum and SHA-256 components have no Qt dependency at all and can be used in any C++17 project.

How the Detection Pipeline Works

This is the part that tends to interest engineers, so let’s go deeper.

When the raw disk scanner opens a physical drive, it does not know anything about the filesystem. It does not care whether the device is NTFS, FAT32, exFAT, or completely unformatted. It reads sectors sequentially in 1 MB blocks and runs every block through a two-phase detection pipeline.

Phase 1 – Word Pair Detection

The first phase is a fast heuristic. For each block, the scanner searches for any occurrence of a BIP-39 word followed by another BIP-39 word within a short proximity window. The word match is boundary-checked: a match is only valid if the characters immediately before and after the word are delimiters (whitespace, punctuation, or null bytes). This prevents partial matches inside longer tokens.

If a word pair is found, the block is flagged and proceeds to phase 2. If no pair is found, the block is skipped entirely. This two-stage approach means the expensive sequence extraction only runs on blocks that have already shown evidence of BIP-39 content, which keeps the scan fast across the overwhelming majority of blocks on a real drive that contain no relevant data at all.

Phase 2 – Sequence Extraction and Checksum Validation

The second phase is where the cryptographic work happens. Bip39Sequence::extract() tokenizes the block on non-alphabetic boundaries and applies a sliding window to find every consecutive BIP-39 word sequence of a valid length (12, 15, 18, 21, or 24 words). A 13-word run produces two candidates: offset 0 through 11, and offset 1 through 12. Nothing is missed at the edges.

Each candidate sequence is then passed to BIP39Checksum::validate(), which implements the full BIP-39 checksum algorithm as specified in FIPS 180-4:

  1. Each word is mapped to its 11-bit index in the 2048-word canonical list.
  2. The indices are concatenated into a single bitstream, MSB-first.
  3. The bitstream is split: the first (N*11 - CS) bits are entropy; the last CS bits are the checksum, where CS = N/3 for standard phrase lengths.
  4. SHA-256 is computed over the entropy bytes.
  5. The first CS bits of the hash are compared to the extracted checksum bits.

A sequence only triggers a match event if it passes that cryptographic check. A false positive requires a random sequence of BIP-39 words whose SHA-256 checksum bits happen to align by chance. The probability is at most 1 in 256 for 12-word phrases, and lower for longer ones. In practice, on a real storage device, every match the scanner reports warrants serious attention.

The Wordlist Strategy

One detail that is easy to overlook but matters in a hot scanning path: the wordlist is provided in three separate formats.

bip39_wordlist.h     - std::vector<QString>      (Qt UI, Scanner, Bip39Sequence)
bip39_wordlist_std.h - std::vector<std::string>  (BIP39Checksum, non-Qt code)
bip39_wordlist_raw.h - std::vector<QByteArray>   (LowLevelScanner raw-byte matching)

All three contain the same canonical 2048-word BIP-39 English list. The split exists to avoid conversion overhead in tight loops. When you are scanning a 1 TB drive in 1 MB blocks, the cost of converting between string types on every candidate is not free. Keeping the raw scanner working with QByteArray directly and the checksum validator working with std::string directly eliminates a class of unnecessary allocations in the hot path.

Zero Network Calls – and Why That Is Non-Negotiable

The air-gap property is not a marketing claim. It is an architectural constraint that was designed in from the start, and it is verifiable by anyone who can run grep.

grep -rn "http\|socket\|network\|QNetworkAccess\|curl\|telemetry" *.h *.cpp

You will find nothing. No QNetworkAccessManager. No curl. No socket(). No HTTP endpoints. No analytics calls. No license check that reaches out to a server. The scan begins on your CPU and ends with a CSV on your disk. Nothing else is involved.

Consider why this matters specifically for this use case. A tool that scans storage devices for seed phrases is, by definition, a tool that will find seed phrases if they exist. It runs on machines that may contain – or may have once contained – active wallet credentials. If that tool were making network calls of any kind, every security-conscious user would be right to wonder what those calls contain. The question would be unanswerable without source code access.

With source code access, the question answers itself. There is nothing to wonder about.

The commercial product that wraps this engine, available at mmediasoftwarelab.com, inherits this property. The UI, the installer, and the licensing system sit on top of this engine. The engine does not change because a wrapper exists around it.

Why We Published It

Publishing production source code for a commercial product is not a default decision. It requires a reason. Ours comes down to one observation: security tools that handle sensitive data without auditable source code are asking for a level of trust they have not earned.

The landscape of cryptocurrency recovery tools is, to be charitable, mixed. There are legitimate tools built by serious engineers. There are also tools that exist specifically to harvest the credentials they claim to recover. The average user cannot distinguish between them by looking at a website or reading a description. A binary is a black box. A signed binary from an unknown publisher is still a black box.

Publishing the engine under MIT changes that calculus. A security researcher can audit every line. A developer can build from source and run the compiled binary they built themselves, without trusting our installer at all. An advanced user who is not comfortable running a commercial binary on a machine that might hold wallet data can verify the air-gap claim and the checksum logic before they commit to running anything.

We also published it because the BIP-39 recovery problem is one where the ecosystem benefits from good reference implementations. The BIP-39 specification is clear and publicly documented. But a production-grade implementation that handles raw disk I/O, boundary-safe word matching, overlapping sequence windows, and thread-safe pause and resume – one that has been tested against real drives and real recovery scenarios – is a different artifact from reading the spec and writing something from scratch. If other developers working on wallet recovery tooling can build on this rather than reinventing it, that is a better outcome for everyone who ever loses access to a wallet.

Technical Decisions Worth Noting

A few implementation choices in this codebase are worth calling out explicitly, because they reflect specific tradeoffs that are not always obvious from reading the code alone.

Stack-Allocated Bit Array in BIP39Checksum

The checksum validator allocates its working bit array on the stack:

uint8_t bits[264] = {};  // 24 * 11 = 264 maximum bits

This is deliberate. The validator is called for every candidate sequence that passes phase 1, which on a large drive can be thousands of times per second. Heap allocation on every call would be measurably slower and would create unnecessary pressure on the allocator. The maximum size is bounded by the BIP-39 spec (24 words times 11 bits = 264 bits), so the stack allocation is safe and the size is known at compile time.

Atomic Stop Flag in the Hot Path

The scan loop uses std::atomic<bool> stopRequested for its exit condition rather than a mutex-guarded flag. The tradeoff is intentional. A mutex check on every block iteration would serialize the scan thread with any thread that needs to signal a stop, which introduces unnecessary latency in the common case where the scan is running normally. An atomic flag is cheap to read, lock-free, and sufficient for the single-writer, single-reader pattern here.

Pause and resume use a different mechanism: QMutex combined with QWaitCondition. When the scan is paused, the worker thread blocks on the wait condition and burns no CPU. When resume is called, the condition is signaled and the thread wakes immediately. This is the correct pattern for a condition that is expected to hold for extended periods, as opposed to the stop flag which is expected to be checked and then acted on quickly.

Two-Phase Detection and the Cost of False Positives

The word-pair heuristic in phase 1 is designed to be fast and to have a low miss rate, not a low false positive rate. Its job is to avoid running the expensive checksum validation on blocks that obviously contain no BIP-39 content. False positives at this stage are fine – they just mean the block proceeds to phase 2, which then correctly rejects them. False negatives at this stage would mean missed seed phrases, which is the one failure mode the tool cannot tolerate.

The checksum validation in phase 2 is where precision matters. A false positive at that stage produces a match event that the user investigates and finds contains no real wallet. A false negative at that stage means a genuine seed phrase was found and discarded. The cryptographic checksum check keeps the false positive rate at or below 1 in 256, which in practice means essentially every flagged match is worth examining.

Who This Is For

There are several distinct groups who will find this repository useful, and they are looking for different things.

Developers building wallet recovery tooling can integrate these components directly. BIP39Checksum and Bip39Sequence have no Qt dependency and can be dropped into any C++17 project. LowLevelScanner requires Windows and Qt but handles all the raw I/O complexity so you do not have to. The wordlist headers are ready to use without modification.

Security researchers who want to audit recovery tools before recommending them, or who are investigating what a tool does before running it on a sensitive machine, now have a complete implementation to read. The air-gap claim is verifiable. The checksum algorithm is documented with spec references. The word-matching boundary logic is explicit and testable.

Users of the commercial product who want to understand exactly what the software they are running does to their storage device now have that ability. You do not have to trust our word that the tool does not exfiltrate data. You can read the code that handles your data and verify it yourself.

Cryptocurrency professionals and institutional users who need to evaluate tools for use in a professional context, whether that is a recovery service, a law firm handling estate disputes over crypto assets, or a forensic team, now have a basis for that evaluation that goes beyond vendor claims.

Advanced users who are not comfortable running a commercial installer on a machine that might still hold wallet data can build from source. The build instructions are in the README. Add the headers and source files to a Qt .pro file, compile, and you have a binary you built yourself from code you read yourself.

The Commercial Product

The engine in this repository is one part of a larger application. The BIP39 Recovery Tool available at mmediasoftwarelab.com wraps this engine in a Windows application with a full UI, drive selection, progress reporting, result management, and an installer that handles deployment correctly on modern Windows systems.

The commercial product is built for users who need results, not users who want to integrate a library. If you lost a seed phrase, are not a developer, and need the best chance of recovering it without touching a compiler, the commercial tool is the right choice. It runs the same detection pipeline published here, surfaces every match with context, and handles the operational complexity of scanning a physical drive safely.

If you are a developer, a researcher, or someone who prefers to build from source, this repository gives you everything you need to run the core engine without the commercial wrapper.

Both paths lead to the same engine. That is the point.

Getting Started

The repository is at github.com/mmediasoftwarelab/BIP39RecoveryTool-public. It requires Qt 6.9 or later with the MinGW 64-bit toolchain on Windows 10 or 11. To integrate into an existing Qt project:

HEADERS += scanner.h scanworker.h lowlevelscanner.h driveutils.h \
           bip39_checksum.h bip39_sequence.h sha256.h \
           bip39_wordlist.h bip39_wordlist_raw.h bip39_wordlist_std.h

SOURCES += scanner.cpp scanworker.cpp lowlevelscanner.cpp driveutils.cpp \
           bip39_checksum.cpp bip39_sequence.cpp sha256.cpp

The checksum validator and SHA-256 implementation have no Qt dependency. If you only need BIP-39 validation in a non-Qt C++17 project, those four files are sufficient.

LowLevelScanner and DriveUtils require Windows system headers (windows.h, winioctl.h). All other components are cross-platform.

The LICENSE is MIT. Use it, fork it, integrate it, build on it. Attribution appreciated but not required.

What Comes Next

The engine is stable and production-tested. The commercial application has been running this code in real recovery scenarios. That said, there are areas we expect to develop over time: broader platform support is a natural direction given that the checksum and sequence components are already cross-platform, and contributions from the community that improve performance or extend language support beyond the English wordlist are welcome.

If you find a bug, open an issue. If you find a security concern, contact us directly at support@mmediasoftwarelab.com before publishing. We take both seriously.

The release is tagged v1.0.0. The code in that tag is the code running in the commercial product today. Future changes to the engine will be reflected here.

The BIP39 Recovery Tool is available at mmediasoftwarelab.com. The engine source is at github.com/mmediasoftwarelab/BIP39RecoveryTool-public. MIT licensed. Questions and audit findings welcome.

🤖
Support Bot
"Have you tried restarting your computer? Please check our knowledge base. Your ticket has been escalated. Estimated response: 5-7 business days."
❌ Corporate Script Theater
👨‍💻
Developer (M Media)
"Checked your logs. Line 247 in config.php — the timeout value needs to be increased. Here's the exact fix + why it happened. Pushed a patch in v2.1.3."
✓ Real Technical Support

Support From People Who Understand the Code

Ever contact support and immediately know you're talking to someone reading a script? Someone who's never actually used the product? Yeah, we hate that too.

M Media support means talking to developers who wrote the code, understand the edge cases, and have probably hit the same problem you're dealing with. No ticket escalation theatrics. No "have you tried restarting?" when your question is clearly technical.

Documentation written by people who got stuck first. Support from people who fixed it.

We don't outsource support to the lowest bidder or train AI on canned responses. When you ask a question, you get an answer from someone who can actually read the logs, check the source code, and explain what's happening under the hood.

Real troubleshooting, not corporate scripts
Documentation that assumes you're competent
Email support that doesn't auto-close tickets
Updates based on actual user feedback
Tracking Scripts
Telemetry Services
Anonymous Statistics
Your Privacy

No Bloat. No Spyware. No Nonsense.

Modern software has become surveillance dressed as convenience. Every click tracked, every behavior analyzed, every action monetized. M Media software doesn't play that game.

Our apps don't phone home, don't collect telemetry, and don't require accounts for features that should work offline. No analytics dashboards measuring your "engagement." No A/B tests optimizing how long you stay trapped in the interface.

We build tools, not attention traps.

The code does what it says on the tin — nothing more, nothing less. No hidden services running in the background. No dependencies on third-party APIs that might disappear tomorrow. No frameworks that require 500MB of node_modules to display a button.

Your data stays on your device
No "anonymous" usage statistics
Minimal dependencies, fewer risks
Respects CPU, RAM, and battery