A few days ago Zellic released an updated version of their smart contract dataset. The new version, all-ethereum-contracts is like the previous dataset containing all smart contract deployments done historically up until recent time (February 2025 in this case), only difference now is that they give us the raw bytecode instead of the original source code. Given this new dataset, let's use it to create a simple static analysis script to extract ABI selectors from the bytecode.
You can read more about function selectors here and event selectors here. The ABI spec of Solidity (which Vyper also uses) can be read here.
There are multiple existing solutions already (in no specific order):
Some of them use static analysis and some of them use dynamic analysis to infer the bytecode selectors. Some of them also try to infer the full ABI JSON, but we are focusing on only retrieving the selectors in this post. That said, most implementations just use a pre-image dictionary to resolve the full ABI from the selector (same technique can be used with our version).
The log selector logic works similarly for both Solidity and Vyper. Same is true for the function selectors, but the JUMP table where they are used is very different between the compilers. Vyper did recently a write-up on how their constant time jump tables work. Solidity hasn't really done any write-ups as far as I know, but if you look inside the source code then you can see they mention some patterns.
Instead of looking at what the developers tell us, we will look at the bytecode. Given we know the ABI ahead of time, we can see where and how the selectors are placed within the bytecode. Then create a sliding window to understand the recurring patterns. Then we find the most common patterns and we can then create a script around that to evaluate it.
Sadly there isn't any good bytecode dataset for doing this which includes all major compiler versions for both Solc and Vyper, at least I don't know of any that isn't outdated. Therefore I created one, it's created by sampling block intervals from the all-ethereum-contracts dataset and deduped based on the provided bytecode hash. Then we get the compiler version from Etherscan based on the address.
We then ended up with the following
We postprocess this with the verified contract response from Etherscan to get out the ABI selectors from the returned ABI.
            
from evm import get_opcodes_from_bytecode, PushOpcode, JUMP_DEST
from collections import defaultdict
from copy import deepcopy
from tqdm import tqdm
import random
import glob
import json
import os
WINDOW_SIZE = 16
MAX_PATTERNS = 10
FOLDER_PATH = os.environ.get("FOLDER_PATH")
assert FOLDER_PATH is not None
patterns = {
    "solc": defaultdict(int),
    "vyper": defaultdict(int),
}
def transform_opcodes_window(bytecode, opcodes, selectors, index):
    opcodes_window = opcodes[index : index + WINDOW_SIZE]
    min_index = float("inf")
    for index, op in enumerate(opcodes_window):
        if not isinstance(op, PushOpcode):
            continue
        op_args_int = int.from_bytes(bytes.fromhex(op.args), byteorder="big")
        if op.args in selectors["functions"]:
            opcodes_window[index] = "<func_selector>"
            min_index = min(index, min_index)
        elif op.args in selectors["events"]:
            opcodes_window[index] = "<log_selector>"
            min_index = min(index, min_index)
        elif op_args_int < len(opcodes) and bytecode[op_args_int] == JUMP_DEST:
            opcodes_window[index] = f"<jumpdest>"
        else:
            opcodes_window[index] = f"{op.name} <data>"
    return opcodes_window, min_index
def main():
    for file in tqdm(glob.glob(os.path.join(FOLDER_PATH, "**/*.json"))):
        with open(file, "r") as file:
            data = json.load(file)
            bytecode = data["bytecode"]
            bytecode = bytes.fromhex(bytecode.lstrip("0x"))
            selectors = data["selectors"]
            if selectors is None:
                continue
        compiler = data["compiler"]["kind"]
        opcodes = get_opcodes_from_bytecode(bytecode)
        for index, _ in enumerate(opcodes):
            opcodes_window, min_index = transform_opcodes_window(
                bytecode, opcodes, selectors, index
            )
            if min_index == float("inf"):
                continue
            opcodes_window_og = deepcopy(opcodes_window)
            while len(opcodes_window) > 2 and (
                "<func_selector>" in opcodes_window
                or "<log_selector>" in opcodes_window
            ):
                current_window = " ".join(list(map(str, list(opcodes_window))))
                patterns[compiler][current_window] += 1
                opcodes_window = opcodes_window[:-1]
            opcodes_window = opcodes_window_og
            while len(opcodes_window) > 2 and (
                "<func_selector>" in opcodes_window
                or "<log_selector>" in opcodes_window
            ):
                current_window = " ".join(list(map(str, list(opcodes_window))))
                patterns[compiler][current_window] += 1
                opcodes_window = opcodes_window[1:]
    compiler_patterns = {}
    for compiler in patterns:
        compiler_patterns[compiler] = []
        for pattern in sorted(
            list(patterns[compiler].keys()),
            key=lambda x: patterns[compiler][x],
            reverse=True,
        ):
            for v in compiler_patterns[compiler]:
                # If it's a subset, let's skip.
                if v in pattern or pattern in v:
                    break
            else:
                compiler_patterns[compiler].append(pattern)
            if len(compiler_patterns[compiler]) > MAX_PATTERNS:
                break
    print(json.dumps(compiler_patterns, indent=4))
if __name__ == "__main__":
    main()
            
        
        Then we get out the following patterns:
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
            JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
            DUP1 <func_selector> EQ PUSH2 <jumpdest>
            <func_selector> EQ PUSH2 <jumpdest>
            EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ
            DUP1 <func_selector> EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
            JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
            PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data>
            PUSH1 <data> MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
            MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
            There is one obvious pattern I see we are missing which is for the case when the selector is
            
                0x000000
            where the compiler usually will
            optimize the check
            into
            a
            ISZERO comparison. We also did not see
            cases for when the function selectors are split
                with GT
            as another optimization for larger contracts with many functions.
        
Note 1: We did not test gigahorse in this comparison because of it's long execution time, happy to retry if there is a config I can tune to get a response quicker.
Note 2: that this isn't an entirely fair evaluation as some of these tools do more than just extracting the selectors and therefore has additional complexity
Note 3: originally this blogpost was written with focus on the function selectors, there has been an update to the post to include more info on the log selectors, but this benchmark is still only for the function selectors.
| Rank | Model | F1-Score | Recall | Precision | 
|---|---|---|---|---|
| 🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 | 
| 🥈 2 | sevm | 0.8980 | 0.8157 | 0.9989 | 
| 🥉 3 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 | 
| 4 | whatsabi | 0.7986 | 0.6655 | 0.9983 | 
| 5 | heimdall | 0.7886 | 0.6514 | 0.9989 | 
Obviously the dynamic analysis approaches beat the static analysis approaches. However, our naive implementation is still able to get a pretty good F1-score.
But these relationships should be possible to learn by a simple neural network and that should then (hopefully) also improve on our existing naive approach.
            
WINDOW_SIZE = 5
class SelectorDetector(torch.nn.Module):
    def __init__(self, vocab_size, classes):
        super(SelectorDetector, self).__init__()
        self.head =  torch.nn.Sequential(
            torch.nn.Embedding(vocab_size, 128),
        )
        self.body = torch.nn.Sequential(
            torch.nn.Linear(128, 256),
            torch.nn.BatchNorm1d(WINDOW_SIZE),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.3),
            torch.nn.Linear(256, 128),
            torch.nn.Sigmoid(),
            torch.nn.Linear(128, classes + 1),
        )
        self.apply(self._init_weights)
    def _init_weights(self, module):
        if isinstance(module, torch.nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight)
            torch.nn.init.zeros_(module.bias)
        elif isinstance(module, torch.nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0, std=0.1)
    def forward(self, X):
        out = self.body(self.head(X))
        return out.mean(dim=1)
    
            
        
        Now let's evaluate it and look at the results
| Rank | Model | F1-Score | Recall | Precision | 
|---|---|---|---|---|
| 🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 | 
| 🥈 2 | Our torch model | 0.9283 | 0.9429 | 0.9142 | 
| 🥉 3 | sevm | 0.8980 | 0.8157 | 0.9989 | 
| 4 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 | 
| 5 | whatsabi | 0.7986 | 0.6655 | 0.9983 | 
| 6 | heimdall | 0.7886 | 0.6514 | 0.9989 | 
Nice! We are now only behind a dynamic analysis solution, not bad.
There are similar patterns for logs, although they follow much less of a pattern like for function selectors.
JUMPI CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD
            PUSH2 <data> JUMPI CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data>
            CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD DUP1
            PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD DUP1 DUP3
            MLOAD DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3
            DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3 LOG3
            DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3 LOG3 PUSH2 <data>
            PUSH1 <data> MLOAD DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS
            CALLVALUE GT ISZERO PUSH2 <data> JUMPI PUSH32 <selector> CALLER CALLVALUE
            GT ISZERO PUSH2 <data> JUMPI PUSH32 <selector> CALLER CALLVALUE PUSH1 <data>
            PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data> MSTORE
            MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data> MSTORE PUSH1 <data>
            POP SSTORE PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD
            SSTORE PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data>
            PUSH2 <data> MSTORE PUSH32 <selector> PUSH1 <data> PUSH2 <data> LOG1 STOP JUMPDEST
            PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE PUSH32 <selector> PUSH1 <data> PUSH2 <data> LOG1
            JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE PUSH2 <data>
            PUSH2 <data> JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data>
            JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE
            PUSH2 <data> PUSH2 <data> JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD
            There are a few things to note from the patterns above. The compiler might choose to optimize to a PUSH{N..32} in the case the hash of the
            topic has leading zeros. Depending on the compiler, it might also optimize it to be a CODECOPY. There is also no guarantee that
            the 
There are rollups that support both the EVM and a non EVM. In addition to allowing them to communicate, example is Stylus from Arbitrum and EraVM from ZKsync. One question that came to mind is if these non EVM binaries would have the same selector logic or if they handled it in a different way.
Stylus is WASM runtime that allows users to write smart contract in traditional programming langues (C++, Rust, Go, etc) instead of Solidity. Since it's WASM, all related tooling just works.
        wasm-objdump
            
            
            
from wasm_tob import (
    decode_module,
    format_instruction,
    format_lang_type,
    format_mutability,
    SEC_DATA,
    SEC_ELEMENT,
    SEC_GLOBAL,
    SEC_CODE,
    decode_bytecode,
    INSN_ENTER_BLOCK,
    INSN_LEAVE_BLOCK,
)
def pad_hex(val):
    if len(val) % 2 == 0:
        return val
    return val.replace("0x", "0x0")
def format_instruction(insn, data_sections):
    text = insn.op.mnemonic
    if not insn.imm:
        return text
    def format_isnt(text, x):
        # We are converting from int to uint because that is how the selectors are encoded
        if text == "i32.const":
            if x >= (1 << 255):  
                x -= (1 << 256)  
            return pad_hex(hex(x))
        if text == "i64.const":
            if x >= (1 << 255):  
                x -= (1 << 256)  
            return pad_hex(hex(x))
        return x
    
    args = [
        getattr(insn.op.imm_struct, x.name).to_string(format_isnt(text, getattr(insn.imm, x.name)))
        for index, x in enumerate(insn.op.imm_struct._meta.fields)
    ]
    base = text + ' ' + ', '.join(args)
    if text == "i64.load":
        load_section = int(args[1], 16)
        data = None
        for i in data_sections:
            if i[0] < load_section and load_section < i[1]:
                delta = load_section - i[0]
                data = i[2][delta:delta+8].hex()
        if data is None:
            return base
        return base + f"# data: {data}"
    else:
        return base 
def disas(raw):
    mod_iter = iter(decode_module(raw))
    header, header_data = next(mod_iter)
    data_sections = []
    
    for cur_sec, cur_sec_data in mod_iter:
        if cur_sec_data.id == SEC_DATA:
            for idx, entry in enumerate(cur_sec_data.payload.entries):
                offset = entry.offset[0].imm.value
                data = entry.data.tobytes()
                data_sections.append((
                    offset, offset + len(data), data
                ))
    mod_iter = iter(decode_module(raw))
    header, header_data = next(mod_iter)
    
    for cur_sec, cur_sec_data in mod_iter:
        if cur_sec_data.id == SEC_CODE:
            code_sec = cur_sec_data.payload
            for i, func_body in enumerate(code_sec.bodies):
                print('{x} sub_{id:04X} {x}'.format(x='=' * 35, id=i))
                indent = 0
                raw = func_body.code.tobytes()
                for cur_insn in decode_bytecode(raw):
                    if cur_insn.op.flags & INSN_LEAVE_BLOCK:
                        indent -= 1
                    print('  ' * indent + format_instruction(cur_insn, data_sections))
                    if cur_insn.op.flags & INSN_ENTER_BLOCK:
                        indent += 1
if __name__ == "__main__":
    with open('example_stylus_pencil.wasm', 'rb') as raw:
        raw = raw.read()
    disas(raw)
                
                
If we run the script above on a example ERC20 stylus contract and just grep for some of the selectors, we can see that the function selectors and log selectors are in the binary.
            
# name() 
python3 decompile.py | grep "0x06fdde03" -A 3
    i32.const '0x06fdde03'
    i32.ne
    br_if 18
    get_local 1
# transferFrom(address,address,uint256)
python3 decompile.py  | grep "0x23b872dd" -A 3  
    i32.const '0x23b872dd'
    i32.eq
    br_if 9
    get_local 2
# Approval(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py  | grep "5b200ac8c7c3b925" -A 16  
    i64.load 0, 0x8648# data: 5b200ac8c7c3b925
    i64.store 3, 0
    get_local 1
    i32.const '0x10'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8640# data: dd0314c0f7b2291e
    i64.store 3, 0
    get_local 1
    i32.const '0x08'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8638# data: d14f71427d1e84f3
    i64.store 3, 0
    get_local 1
    i32.const '0x00'
    i64.load 0, 0x8630# data: 8c5be1e5ebec7d5b
# Transfer(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py  | grep "28f55a4df523b3ef" -A 16
    i64.load 0, 0x8628# data: 28f55a4df523b3ef
    i64.store 3, 0
    get_local 1
    i32.const '0x10'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8620# data: 952ba7f163c4a116
    i64.store 3, 0
    get_local 1
    i32.const '0x08'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8618# data: 69c2b068fc378daa
    i64.store 3, 0
    get_local 1
    i32.const '0x00'
    i64.load 0, 0x8610# data: ddf252ad1be2c89b
                
            This is a custom instruction set and it has it's own compiler built on LLVM. The compiler luckily comes with a decompiler. With some help from Deepwiki, the same logic was ported over to a Python script.
            
import re
from typing import Dict, Tuple, List
class EraVMDecoder:
    def __init__(self):
        opcodes = [
            ('<invalid>', 0, 'direct'), ('nop', 1, 'nop'), ('add', 25, 'arith_comm'), 
            ('sub', 73, 'arith_ncomm'), ('mul', 169, 'arith_comm'), ('div', 217, 'arith_ncomm'),
            ('jump', 313, 'jump'), ('xor', 319, 'arith_comm'), ('and', 367, 'arith_comm'), 
            ('or', 415, 'arith_comm'), ('shl', 463, 'arith_ncomm'), ('shr', 559, 'arith_ncomm'),
            ('rol', 655, 'arith_ncomm'), ('ror', 751, 'arith_ncomm'), ('addp', 847, 'arith_ptr'),
            ('subp', 895, 'arith_ptr'), ('pack', 943, 'arith_ptr'), ('shrnk', 991, 'arith_ptr'),
            ('call', 1039, 'direct'), ('this', 1040, 'direct'), ('par', 1041, 'direct'),
            ('code', 1042, 'direct'), ('meta', 1043, 'direct'), ('ergs', 1044, 'direct'),
            ('sp', 1045, 'direct'), ('ldvl', 1046, 'direct'), ('stvl', 1047, 'direct'),
            ('stpub', 1048, 'direct'), ('inctx', 1049, 'direct'), ('lds', 1050, 'direct'),
            ('sts', 1051, 'direct'), ('callf', 1057, 'farcall'), ('calld', 1061, 'farcall'),
            ('callm', 1065, 'farcall'), ('ret', 1069, 'direct'), ('retl', 1070, 'direct'),
            ('rev', 1071, 'direct'), ('revl', 1072, 'direct'), ('pnc', 1073, 'direct'),
            ('pncl', 1074, 'direct'), ('ldm.h', 1075, 'heap'), ('stm.h', 1077, 'heap'),
            ('ldm.st', 1096, 'static'), ('stm.st', 1100, 'static')
        ]
        
        self.opcode_map = {op[1]: (op[0], op[2]) for op in opcodes}
        self.sorted_opcodes = sorted([op[1] for op in opcodes], reverse=True)
        
        self.src_modes = ['reg', 'sp_pop', 'sp_rel', 'stack_abs', 'imm', 'code']
        self.dst_modes = ['reg', 'sp_push', 'sp_rel', 'stack_abs']
        self.conditions = ['none', 'gt', 'lt', 'eq', 'ge', 'le', 'ne', 'gtlt']
        
        self.code_ref_regex = re.compile(r'code\[(?:r[0-9]+\+)?([0-9]+)\]')
    def decode_instruction(self, data: bytes) -> Dict:
        """Decode 8-byte EraVM instruction."""
        if len(data) != 8:
            raise ValueError("Instructions must be 8 bytes")
        
        ins = int.from_bytes(data, 'big')
        imm1 = (ins >> 48) & 0xFFFF
        imm0 = (ins >> 32) & 0xFFFF
        dst1 = (ins >> 28) & 0xF
        dst0 = (ins >> 24) & 0xF
        src1 = (ins >> 20) & 0xF
        src0 = (ins >> 16) & 0xF
        pred = (ins >> 13) & 0x7
        opcode = ins & 0x7FF
        
        base, src_mode, dst_mode, flags = self._analyze_opcode(opcode)
        name = self.opcode_map.get(base, ('<unknown>', 'direct'))[0]
        
        return {
            'mnemonic': name,
            'src0_reg': f'r{src0}', 'src1_reg': f'r{src1}',
            'dst0_reg': f'r{dst0}', 'dst1_reg': f'r{dst1}',
            'imm0': imm0, 'imm1': imm1,
            'predicate': self.conditions[pred] if pred < 8 else f'pred{pred}',
            'src_mode': self.src_modes[src_mode] if 0 <= src_mode < 6 else 'none',
            'dst_mode': self.dst_modes[dst_mode] if 0 <= dst_mode < 4 else 'none',
            'raw_opcode': opcode, 'base_opcode': base,
            **flags
        }
    def _analyze_opcode(self, opcode: int) -> Tuple[int, int, int, Dict]:
        """Analyze opcode to extract base instruction and operand modes."""
        base = next((op for op in self.sorted_opcodes if op <= opcode), 0)
        if base not in self.opcode_map:
            return 0, -1, -1, {}
        
        delta = opcode - base
        encoding = self.opcode_map[base][1]
        src_mode = dst_mode = -1
        flags = {}
        
        if delta > 0:
            if encoding == 'nop':
                dst_mode, src_mode = delta % 4, (delta // 4) % 6
            elif encoding == 'arith_comm':
                flags['set_flags'] = bool(delta % 2)
                dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
            elif encoding == 'arith_ncomm':
                flags.update({
                    'swap': bool(delta % 2),
                    'set_flags': bool((delta // 2) % 2)
                })
                dst_mode, src_mode = (delta // 4) % 4, (delta // 16) % 6
            elif encoding == 'arith_ptr':
                flags['swap'] = bool(delta % 2)
                dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
            elif encoding == 'jump':
                src_mode = delta % 6
            elif encoding == 'farcall':
                flags.update({
                    'is_shard': bool(delta % 2),
                    'is_static': bool((delta // 2) % 2)
                })
        
        return base, src_mode, dst_mode, flags
    def stringify_instruction(self, decoded: Dict, constants: Dict = None) -> str:
        """Convert decoded instruction to assembly string."""
        constants = constants or {}
        
        mnemonic = decoded.get('mnemonic', '<unknown>')
        if decoded.get('swap'): mnemonic += '.s'
        if decoded.get('set_flags'): mnemonic += '!'
        if decoded.get('predicate', 'none') != 'none':
            mnemonic += f".{decoded['predicate']}"
        
        operands = self._format_operands(decoded, constants)
        
        comments = []
        if 'code_ref_comment' in decoded:
            comments.append(decoded['code_ref_comment'])
        
        result = f"{mnemonic:<8} {', '.join(operands)}" if operands else mnemonic
        if comments:
            result += f" # {', '.join(comments)}"
        
        return result.strip()
    def _format_operands(self, d: Dict, constants: Dict) -> List[str]:
        """Format operands based on instruction type."""
        mnemonic = d.get('mnemonic', '').split('.')[0]
        operands = []
        
        if mnemonic in ['add', 'sub', 'mul', 'div', 'and', 'or', 'xor', 'shl', 'shr', 'rol', 'ror']:
            operands.append(self._format_src_operand(d, 0, constants))
            operands.append(d.get('src1_reg', 'r0'))
            operands.append(self._format_dst_operand(d, 0))
        
        elif mnemonic in ['this', 'par', 'code', 'meta', 'ergs', 'sp', 'ldvl', 'stvl']:
            operands.append(d.get('dst0_reg', 'r0'))
        
        elif mnemonic in ['retl', 'revl', 'pncl']:
            operands.append(str(d.get('imm0', 0)))
        
        elif mnemonic == 'jump':
            operands.append(self._format_src_operand(d, 0, constants))
        
        elif mnemonic in ['callf', 'calld', 'callm']:
            operands.extend([
                d.get('src0_reg', 'r0'),
                d.get('src1_reg', 'r0'),
                str(d.get('imm0', 0))
            ])
        
        return operands
    def _format_src_operand(self, d: Dict, src_idx: int, constants: Dict) -> str:
        """Format source operand based on addressing mode."""
        src_mode = d.get('src_mode', 'reg')
        reg_key = f'src{src_idx}_reg'
        imm_key = f'imm{src_idx}'
        
        if src_mode == 'imm':
            return str(d.get(imm_key, 0))
        elif src_mode == 'code':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            if reg == 'r0' and imm != 0:
                if imm in constants:
                    d['code_ref_comment'] = f"code[{imm}] = {constants[imm]}"
                return f"code[{imm}]"
            return f"code[{reg}]" if imm == 0 else f"code[{reg}+{imm}]"
        elif src_mode == 'stack_abs':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
        else:
            return d.get(reg_key, 'r0')
    def _format_dst_operand(self, d: Dict, dst_idx: int) -> str:
        """Format destination operand based on addressing mode."""
        dst_mode = d.get('dst_mode', 'reg')
        reg_key = f'dst{dst_idx}_reg'
        imm_key = f'imm{1 if dst_idx == 0 else 0}'
        
        if dst_mode == 'stack_abs':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
        else:
            return d.get(reg_key, 'r0')
    def analyze_binary(self, data: bytes) -> Dict:
        """Analyze complete EraVM binary."""
        if len(data) % 32:
            raise ValueError("Binary size must be multiple of 32 bytes")
        
        const_start = self._find_constant_section(data)
        
        instructions = []
        for i in range(0, const_start, 8):
            if i + 8 <= len(data):
                instr_bytes = data[i:i+8]
                try:
                    decoded = self.decode_instruction(instr_bytes)
                except:
                    mnemonic = '<padding>' if all(b == 0 for b in instr_bytes) else '<metadata>'
                    decoded = {'mnemonic': mnemonic}
                
                instructions.append({
                    'address': i,
                    'bytes': instr_bytes,
                    'decoded': decoded
                })
        
        constants = []
        for i in range(const_start, len(data), 32):
            if i + 32 <= len(data):
                cell_bytes = data[i:i+32]
                constants.append({
                    'word_number': i // 32,
                    'address': i,
                    'bytes': cell_bytes,
                    'value': "0x" + cell_bytes.hex()
                })
        
        return {
            'instructions': instructions,
            'constants': constants,
            'constant_section_start': const_start
        }
    def _find_constant_section(self, data: bytes) -> int:
        """Find where constant section begins by analyzing code references."""
        min_ref = float('inf')
        
        for i in range(0, len(data) - 7, 8):
            try:
                decoded = self.decode_instruction(data[i:i+8])
                asm_str = self.stringify_instruction(decoded)
                refs = [int(match) for match in self.code_ref_regex.findall(asm_str)]
                if refs:
                    min_ref = min(min_ref, min(refs))
            except:
                continue
            
            if i % 32 == 24 and min_ref == (i + 8) // 32:
                return i + 8
        
        return len(data)
    def format_disassembly(self, address: int, instr_bytes: bytes, 
                          decoded: Dict, constants: Dict = None) -> str:
        """Format instruction for disassembly output."""
        hex_bytes = ' '.join(f'{b:02x}' for b in instr_bytes)
        asm_str = self.stringify_instruction(decoded, constants or {})
        return f"{address:08x}: {hex_bytes:<24} {asm_str}"
def main():
    try:
        with open("example_weth.hex", "r") as f:
            data = bytes.fromhex(f.read())
        
        decoder = EraVMDecoder()
        result = decoder.analyze_binary(data)
        
        const_lookup = {c['word_number']: c['value'] for c in result['constants']}
        
        for instr in result['instructions']:
            print(decoder.format_disassembly(
                instr['address'],
                instr['bytes'],
                instr['decoded'],
                const_lookup
            ))
        
        for const in result['constants']:
            print(f"{const['word_number']}:")
            print(f"\t.cell {const['value']}")
    
    except FileNotFoundError:
        print("Example file not found - decoder ready for use")
if __name__ == "__main__":
    main()      
                        
                
            If we look at the WETH compiled for EraVM, we can again see that the selectors are within the binary.
            
# name() 
python3 eravm_decompiler.py | grep "06fdde03" -A 3
    00000090: 00 00 01 1a 04 20 00 9c  sub.s!   code[282], r2, r4 # code[282] = 0x0000000000000000000000000000000000000000000000000000000006fdde03
    00000098: 00 00 00 d6 00 00 61 3d  jump.eq  214
    000000a0: 00 00 01 1b 04 20 00 9c  sub.s!   code[283], r2, r4 # code[283] = 0x00000000000000000000000000000000000000000000000000000000095ea7b3
    000000a8: 00 00 00 ee 00 00 61 3d  jump.eq  238
# transferFrom(address,address,uint256)
python3 eravm_decompiler.py | grep "23b872dd"  -A 3
    000003f8: 00 00 01 17 04 20 00 9c  sub.s!   code[279], r2, r4 # code[279] = 0x0000000000000000000000000000000000000000000000000000000023b872dd
    00000400: 00 00 01 73 00 00 61 3d  jump.eq  371
    00000408: 00 00 01 18 04 20 00 9c  sub.s!   code[280], r2, r4 # code[280] = 0x000000000000000000000000000000000000000000000000000000002e1a7d4d
    00000410: 00 00 01 85 00 00 61 3d  jump.eq  389
# Approval(address,address,uint256)
python3 eravm_decompiler.py | grep "0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925"  -A 3
    00000948: 00 00 01 2b 04 00 00 41  add      code[299], r0, r4 # code[299] = 0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925
    00000950: 00 00 00 02 05 00 00 29  add      r0, r0, r5
    00000958: 00 00 00 03 06 00 00 29  add      r0, r0, r6
    00000960: 04 1f 04 15 00 00 04 0f  call
# Transfer(address,address,uint256)
python3 eravm_decompiler.py | grep "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef"  -A 3
    00001ef8: 00 00 01 2e 04 00 00 41  add      code[302], r0, r4 # code[302] = 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef
    00001f00: 00 00 00 06 05 00 00 29  add      r0, r0, r5
    00001f08: 00 00 00 03 06 00 00 29  add      r0, r0, r6
    00001f10: 04 1f 04 15 00 00 04 0f  call
                
            Great, we wrote a selector extractor algorithm in a few hours using a data driven approach and also benchmarked it to verify that it works.
If you liked this blog post, you might also like the following posts (not written by me):