IDL Guesser: Recovering Instruction Layouts from Closed-Source Solana Programs

IDL Guesser

Recovering Instruction Layouts from Closed-Source Solana Programs

April 4, 2025

The Solana ecosystem is vibrant, but many programs are not open sourced yet. Statistics compiled by Syndica in February 2024 shows that nearly 50% of Solana’s top 100 programs by compute units publish their Interface Definition Languages (IDLs). However, this number drops to just 20% for the top 1000 programs. Furthermore, even published IDLs aren't always reliable. It's not uncommon to see the published IDLs are outdated and mismatch the deployed on-chain programs.

As auditors and security researchers, when we identify an interesting pattern that potentially lead to vulnerabilities, we often look for similar weaknesses in other programs. However, without source code or accurate IDLs, this process is often limited to basic GitHub searches, frequently yielding unmaintained projects.

Most Solana programs are written in Rust and compiled to Solana Bytecode Format (sBPF), an eBPF-based format. Reverse-engineering compiled Rust is challenging, and the sBPF-related reverse-engineering toolchain is still developing. This opacity hinders not only malicious actors but also slows down the work of whitehat hackers and security researchers in identifying and responsibly disclosing vulnerabilities.

To analyze any closed-source Solana program – whether dynamically or statically – the fundamental prerequisite is understanding how to interact with it. This means knowing its instructions, the accounts each instruction requires, and the properties of those accounts (like signer or writable status).

To address these challenges, our security researcher Qi Qin led the effort and developed a prototype tool called IDL Guesser. This tool aims to automatically recover instruction definitions, required accounts (including signer/writable flags), and parameter information directly from closed-source Solana program binaries.

This blog outlines the approach behind IDL Guesser and discusses potential areas for future improvement.

‍

Leveraging Anchor Patterns for Reverse Engineering

Since a large portion of Solana development utilizes the Anchor framework – and the concept of an IDL originates there – IDL Guesser currently focuses specifically on Anchor-based programs. Anchor significantly simplifies development by providing macros and helper functions for common tasks and checks. Crucially, this leads to predictable, standardized code structures in the compiled output, which we can leverage for our analysis through pattern matching.

For debugging purposes, the Anchor CLI even offers an anchor expand command, which reveals the code generated by its macros. Examining this expanded code provides valuable insight into the patterns Anchor employs, guiding our reverse-engineering efforts.

‍

Entrypoint and Dispatch Logic

Let's look at a typical Anchor program's entrypoint structure after macro expansion:

‍The program first deserializes the raw input. It performs basic checks (like verifying the program_id and ensuring instruction_data is at least 8 bytes long) before entering the dispatch function. Inside dispatch, the first 8 bytes of instruction_data are interpreted as the instruction discriminator. This discriminator determines which specific instruction handler function should be executed.

According to Anchor's documentation, this 8-byte discriminator is derived from the instruction's namespace and name (e.g., global:initialize_config) by taking the first 8 bytes of its SHA-256 hash. Instead of trying to extract these raw discriminator bytes from the compiled code (which can be complex), IDL Guesser takes a simpler approach: it focuses on extracting the instruction names first, and then calculates the corresponding discriminators using Anchor's standard hashing method.

‍

Identifying Instruction Handlers

How do we find the instruction names and their corresponding handler functions? Anchor provides another helpful pattern. Consider the beginning of a typical instruction handler:

‍Anchor inserts a sol_log call at the beginning of each handler, logging the instruction's name (e.g., "Instruction: InitializeConfig") for log parsing purpose. This logging provides a distinct signature we can search for in the compiled binary.

The ***sol_log*** syscall sequence used to log the instruction name ***InitializeConfig***.

At the assembly level (as shown above), this log call often translates into specific lddw, mov64, and call instructions setup to invoke the sol_log syscall with the instruction name string.

By identifying these patterns, IDL Guesser can reliably locate the entry points of instruction handlers and extract their names.

‍

Extracting Account Information

Immediately following the initial logging and parameter deserialization (which often gets inlined by the compiler), the handler typically calls a corresponding try_accounts function. This function is responsible for parsing and validating the accounts required by the instruction.

The call to the ***try_accounts*** function (***sub_662B0*** here) following the parameter deserialization.

Let's examine the Accounts struct and the generated try_accounts function for initialize_config:

‍The try_accounts function performs several crucial actions:

It iterates through the expected accounts, attempting to parse them based on their type (Account, Signer, Program, etc.). If parsing fails within a nested try_accounts call (like for funder), Anchor conveniently attaches the account name (e.g., "funder") to the error message. This allows us to extract account names by looking for these specific error-handling patterns.
It applies constraint checks (mut, signer, has_one, seeds, owner, rent exemption, etc.) based on the attributes defined in the Accounts struct. Importantly, each constraint violation typically maps to a unique Anchor ErrorCode (e.g., ConstraintMut is 2000, ConstraintSigner is 2002).

The check for sufficient account keys for account that need to be initialized, branching to error ***3005*** if failed.

Constraint check conditional jumps, leading to specific error codes like ***2000*** (***ConstraintMut***) or ***2002*** (***ConstraintSigner***).

By analyzing the control-flow graph (CFG) of the try_accounts function, specifically follow the "happy path" (successful execution), IDL Guesser can piece together the required accounts:

The order of account processing reveals the expected sequence of accounts.
String literals associated with error messages (like "funder") reveal account names.
Specific error codes reached on failure paths (like 2000, 2002, 2005) indicate the constraints applied to each account (mutable, signer, mutable, etc.).

‍

Extracting Parameters

While extracting instruction names and account details relies on relatively distinct patterns (log strings, error codes, specific function calls), recovering information about instruction parameters is harder.

Anchor doesn't typically generate detailed error messages specific to individual parameter deserialization failures. This means the original parameter names defined in the Rust source code are usually lost during compilation. Furthermore, the code responsible for sequentially deserializing parameters from the ix_data slice is often optimized and inlined by the compiler, making reliable assembly-level pattern matching very difficult.

A more promising future direction might involve symbolic execution to determine the expected byte-length and potentially the type of each parameter by analyzing how ix_data is consumed.

However, in the IDL Guesser prototype, an alternative and simpler approach leveraging dynamic analysis is adopted: since Solana instruction data is usually short due to transaction size limits, we could iteratively probe the handler function. By slightly increasing the length of mocked input data and observing changes in the execution trace (e.g., passing a previously failing check), we might infer the boundaries and potentially the types of parameters as new deserialization steps succeed.

This iterative feedback loop technique is also used to reconstruct the internal layout of an account. The specific details will not be elaborated here, but interested readers can explore the relevant implementation in the source code. Furthermore, it could potentially be used to verify or refine the recovered instruction and account information.

‍

Implementation Details

With these patterns identified, the implementation involves disassembling the sBPF bytecode and performing pattern matching on the instruction sequences and CFG. We built upon the existing solana-sbpf project (see static_analysis.rs), which provides a foundation for Solana program analysis.

We made some modifications to the base static analysis implementation. The original version tended to split basic blocks excessively after every function call. We adjusted this to create larger, more manageable blocks. Additionally, we added special handling for syscalls like abort or panic. These changes result in a more precise CFG that simplifies the pattern matching process.

The full implementation is open-sourced and available at the IDLGuesser repository.

The current code handles many common scenarios but also includes logic for some corner cases, such as UncheckedAccount and Sysvar accounts.

The try_accounts logic for these is often inlined by the compiler, creating patterns similar to init accounts. However, challenges remain, particularly with multiple consecutive UncheckedAccount instances or more complex structures like optional accounts and nested account contexts, which this prototype doesn't fully handle yet.

‍

Conclusion

IDL Guesser demonstrates a viable approach to recovering essential structural information – instructions with corresponding accounts and parameters information – from closed-source, Anchor-based Solana programs by leveraging the framework's code generation patterns and simple dynamic analysis. While the prototype has limitations and may require manual reverse-engineering and cross-referencing with on-chain data in complex cases, it successfully recovers IDL-like information for a significant number of programs.

We found this capability is helpful, enabling broader analysis of transaction data and facilitating automated scanning for basic vulnerabilities related to account constraints (e.g., missing signer checks). By shedding light on the inner workings of closed-source programs, we hope tools like IDL Guesser could contribute to the efforts of safeguarding Solana ecosystem.

‍

New! - be the first to try WatchTower real-time monitor!

April 4, 2025