← Back to Home

Static Analysis for EVM Contract Selectors

Published: May 18, 2025

Updated: August 12, 2025: I added a demo page for the model so that one can play with it here.

Updated: September 7, 2025: updated to include a new section, rollups with dual executions on how rollups with dual executions VMs have the ABI selectors in the bytecode. The log selectors sections was also improved.

A few days ago Zellic released an updated version of their smart contract dataset. The new version, all-ethereum-contracts is like the previous dataset containing all smart contract deployments done historically up until recent time (February 2025 in this case), only difference now is that they give us the raw bytecode instead of the original source code. Given this new dataset, let's use it to create a simple static analysis script to extract ABI selectors from the bytecode.

ABI selectors

It's expected that the readers know about Application Binary Interface and the selectors already, but we provide a high level explanation. The ABI is the standard interface that users and smart contracts use to communicate with each other. Part of the ABI is the selectors which is used to distinguish between the different functions and logs within a smart contract.

You can read more about function selectors here and event selectors here. The ABI spec of Solidity (which Vyper also uses) can be read here.

Existing solutions

There are multiple existing solutions already (in no specific order):

Some of them use static analysis and some of them use dynamic analysis to infer the bytecode selectors. Some of them also try to infer the full ABI JSON, but we are focusing on only retrieving the selectors in this post. That said, most implementations just use a pre-image dictionary to resolve the full ABI from the selector (same technique can be used with our version).

Known compiler logic

The log selector logic works similarly for both Solidity and Vyper. Same is true for the function selectors, but the JUMP table where they are used is very different between the compilers. Vyper did recently a write-up on how their constant time jump tables work. Solidity hasn't really done any write-ups as far as I know, but if you look inside the source code then you can see they mention some patterns.

Our approach

Instead of looking at what the developers tell us, we will look at the bytecode. Given we know the ABI ahead of time, we can see where and how the selectors are placed within the bytecode. Then create a sliding window to understand the recurring patterns. Then we find the most common patterns and we can then create a script around that to evaluate it.

Dataset

Sadly there isn't any good bytecode dataset for doing this which includes all major compiler versions for both Solc and Vyper, at least I don't know of any that isn't outdated. Therefore I created one, it's created by sampling block intervals from the all-ethereum-contracts dataset and deduped based on the provided bytecode hash. Then we get the compiler version from Etherscan based on the address.

We then ended up with the following

We postprocess this with the verified contract response from Etherscan to get out the ABI selectors from the returned ABI.

Code for finding the patterns

            
from evm import get_opcodes_from_bytecode, PushOpcode, JUMP_DEST
from collections import defaultdict
from copy import deepcopy
from tqdm import tqdm
import random
import glob
import json
import os

WINDOW_SIZE = 16
MAX_PATTERNS = 10
FOLDER_PATH = os.environ.get("FOLDER_PATH")
assert FOLDER_PATH is not None

patterns = {
    "solc": defaultdict(int),
    "vyper": defaultdict(int),
}


def transform_opcodes_window(bytecode, opcodes, selectors, index):
    opcodes_window = opcodes[index : index + WINDOW_SIZE]
    min_index = float("inf")
    for index, op in enumerate(opcodes_window):
        if not isinstance(op, PushOpcode):
            continue
        op_args_int = int.from_bytes(bytes.fromhex(op.args), byteorder="big")
        if op.args in selectors["functions"]:
            opcodes_window[index] = "<func_selector>"
            min_index = min(index, min_index)
        elif op.args in selectors["events"]:
            opcodes_window[index] = "<log_selector>"
            min_index = min(index, min_index)
        elif op_args_int < len(opcodes) and bytecode[op_args_int] == JUMP_DEST:
            opcodes_window[index] = f"<jumpdest>"
        else:
            opcodes_window[index] = f"{op.name} <data>"
    return opcodes_window, min_index


def main():
    for file in tqdm(glob.glob(os.path.join(FOLDER_PATH, "**/*.json"))):
        with open(file, "r") as file:
            data = json.load(file)
            bytecode = data["bytecode"]
            bytecode = bytes.fromhex(bytecode.lstrip("0x"))
            selectors = data["selectors"]
            if selectors is None:
                continue
        compiler = data["compiler"]["kind"]
        opcodes = get_opcodes_from_bytecode(bytecode)
        for index, _ in enumerate(opcodes):
            opcodes_window, min_index = transform_opcodes_window(
                bytecode, opcodes, selectors, index
            )
            if min_index == float("inf"):
                continue

            opcodes_window_og = deepcopy(opcodes_window)
            while len(opcodes_window) > 2 and (
                "<func_selector>" in opcodes_window
                or "<log_selector>" in opcodes_window
            ):
                current_window = " ".join(list(map(str, list(opcodes_window))))
                patterns[compiler][current_window] += 1
                opcodes_window = opcodes_window[:-1]

            opcodes_window = opcodes_window_og
            while len(opcodes_window) > 2 and (
                "<func_selector>" in opcodes_window
                or "<log_selector>" in opcodes_window
            ):
                current_window = " ".join(list(map(str, list(opcodes_window))))
                patterns[compiler][current_window] += 1
                opcodes_window = opcodes_window[1:]

    compiler_patterns = {}
    for compiler in patterns:
        compiler_patterns[compiler] = []
        for pattern in sorted(
            list(patterns[compiler].keys()),
            key=lambda x: patterns[compiler][x],
            reverse=True,
        ):
            for v in compiler_patterns[compiler]:
                # If it's a subset, let's skip.
                if v in pattern or pattern in v:
                    break
            else:
                compiler_patterns[compiler].append(pattern)

            if len(compiler_patterns[compiler]) > MAX_PATTERNS:
                break
    print(json.dumps(compiler_patterns, indent=4))


if __name__ == "__main__":
    main()
            
        

Then we get out the following patterns:

solc

vyper

There is one obvious pattern I see we are missing which is for the case when the selector is 0x000000 where the compiler usually will optimize the check into a ISZERO comparison. We also did not see cases for when the function selectors are split with GT as another optimization for larger contracts with many functions.

Benchmarking

Note 1: We did not test gigahorse in this comparison because of it's long execution time, happy to retry if there is a config I can tune to get a response quicker.

Note 2: that this isn't an entirely fair evaluation as some of these tools do more than just extracting the selectors and therefore has additional complexity

Note 3: originally this blogpost was written with focus on the function selectors, there has been an update to the post to include more info on the log selectors, but this benchmark is still only for the function selectors.

Rank Model F1-Score Recall Precision
🥇 1 Evmmole 0.9785 0.9588 0.9990
🥈 2 sevm 0.8980 0.8157 0.9989
🥉 3 Our naive pattern model 0.7986 0.6655 0.9983
4 whatsabi 0.7986 0.6655 0.9983
5 heimdall 0.7886 0.6514 0.9989

Obviously the dynamic analysis approaches beat the static analysis approaches. However, our naive implementation is still able to get a pretty good F1-score.

But these relationships should be possible to learn by a simple neural network and that should then (hopefully) also improve on our existing naive approach.

            
WINDOW_SIZE = 5

class SelectorDetector(torch.nn.Module):
    def __init__(self, vocab_size, classes):
        super(SelectorDetector, self).__init__()
        self.head =  torch.nn.Sequential(
            torch.nn.Embedding(vocab_size, 128),
        )
        self.body = torch.nn.Sequential(
            torch.nn.Linear(128, 256),
            torch.nn.BatchNorm1d(WINDOW_SIZE),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.3),
            torch.nn.Linear(256, 128),
            torch.nn.Sigmoid(),
            torch.nn.Linear(128, classes + 1),
        )

        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, torch.nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight)
            torch.nn.init.zeros_(module.bias)
        elif isinstance(module, torch.nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0, std=0.1)

    def forward(self, X):
        out = self.body(self.head(X))
        return out.mean(dim=1)
    
            
        

Now let's evaluate it and look at the results

Rank Model F1-Score Recall Precision
🥇 1 Evmmole 0.9785 0.9588 0.9990
🥈 2 Our torch model 0.9283 0.9429 0.9142
🥉 3 sevm 0.8980 0.8157 0.9989
4 Our naive pattern model 0.7986 0.6655 0.9983
5 whatsabi 0.7986 0.6655 0.9983
6 heimdall 0.7886 0.6514 0.9989

Nice! We are now only behind a dynamic analysis solution, not bad.

What about log selectors?

There are similar patterns for logs, although they follow much less of a pattern like for function selectors.

solc

Vyper

There are a few things to note from the patterns above. The compiler might choose to optimize to a PUSH{N..32} in the case the hash of the topic has leading zeros. Depending on the compiler, it might also optimize it to be a CODECOPY. There is also no guarantee that the

Aside: EVM rollups with dual execution models

There are rollups that support both the EVM and a non EVM. In addition to allowing them to communicate, example is Stylus from Arbitrum and EraVM from ZKsync. One question that came to mind is if these non EVM binaries would have the same selector logic or if they handled it in a different way.

Arbitrum stylus

Stylus is WASM runtime that allows users to write smart contract in traditional programming langues (C++, Rust, Go, etc) instead of Solidity. Since it's WASM, all related tooling just works.

wasm-objdump
            
from wasm_tob import (
    decode_module,
    format_instruction,
    format_lang_type,
    format_mutability,
    SEC_DATA,
    SEC_ELEMENT,
    SEC_GLOBAL,
    SEC_CODE,
    decode_bytecode,
    INSN_ENTER_BLOCK,
    INSN_LEAVE_BLOCK,
)

def pad_hex(val):
    if len(val) % 2 == 0:
        return val
    return val.replace("0x", "0x0")

def format_instruction(insn, data_sections):
    text = insn.op.mnemonic

    if not insn.imm:
        return text

    def format_isnt(text, x):
        # We are converting from int to uint because that is how the selectors are encoded
        if text == "i32.const":
            if x >= (1 << 255):  
                x -= (1 << 256)  
            return pad_hex(hex(x))
        if text == "i64.const":
            if x >= (1 << 255):  
                x -= (1 << 256)  
            return pad_hex(hex(x))
        return x
    
    args = [
        getattr(insn.op.imm_struct, x.name).to_string(format_isnt(text, getattr(insn.imm, x.name)))
        for index, x in enumerate(insn.op.imm_struct._meta.fields)
    ]
    base = text + ' ' + ', '.join(args)

    if text == "i64.load":
        load_section = int(args[1], 16)
        data = None
        for i in data_sections:
            if i[0] < load_section and load_section < i[1]:
                delta = load_section - i[0]
                data = i[2][delta:delta+8].hex()
        if data is None:
            return base
        return base + f"# data: {data}"
    else:
        return base 

def disas(raw):
    mod_iter = iter(decode_module(raw))
    header, header_data = next(mod_iter)
    data_sections = []
    
    for cur_sec, cur_sec_data in mod_iter:
        if cur_sec_data.id == SEC_DATA:
            for idx, entry in enumerate(cur_sec_data.payload.entries):
                offset = entry.offset[0].imm.value
                data = entry.data.tobytes()
                data_sections.append((
                    offset, offset + len(data), data
                ))

    mod_iter = iter(decode_module(raw))
    header, header_data = next(mod_iter)
    
    for cur_sec, cur_sec_data in mod_iter:
        if cur_sec_data.id == SEC_CODE:
            code_sec = cur_sec_data.payload
            for i, func_body in enumerate(code_sec.bodies):
                print('{x} sub_{id:04X} {x}'.format(x='=' * 35, id=i))
                indent = 0
                raw = func_body.code.tobytes()
                for cur_insn in decode_bytecode(raw):
                    if cur_insn.op.flags & INSN_LEAVE_BLOCK:
                        indent -= 1
                    print('  ' * indent + format_instruction(cur_insn, data_sections))
                    if cur_insn.op.flags & INSN_ENTER_BLOCK:
                        indent += 1

if __name__ == "__main__":
    with open('example_stylus_pencil.wasm', 'rb') as raw:
        raw = raw.read()
    disas(raw)
                
                

If we run the script above on a example ERC20 stylus contract and just grep for some of the selectors, we can see that the function selectors and log selectors are in the binary.

grep for selectors
            
# name() 
python3 decompile.py | grep "0x06fdde03" -A 3
    i32.const '0x06fdde03'
    i32.ne
    br_if 18
    get_local 1

# transferFrom(address,address,uint256)
python3 decompile.py  | grep "0x23b872dd" -A 3  
    i32.const '0x23b872dd'
    i32.eq
    br_if 9
    get_local 2

# Approval(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py  | grep "5b200ac8c7c3b925" -A 16  
    i64.load 0, 0x8648# data: 5b200ac8c7c3b925
    i64.store 3, 0
    get_local 1
    i32.const '0x10'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8640# data: dd0314c0f7b2291e
    i64.store 3, 0
    get_local 1
    i32.const '0x08'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8638# data: d14f71427d1e84f3
    i64.store 3, 0
    get_local 1
    i32.const '0x00'
    i64.load 0, 0x8630# data: 8c5be1e5ebec7d5b

# Transfer(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py  | grep "28f55a4df523b3ef" -A 16
    i64.load 0, 0x8628# data: 28f55a4df523b3ef
    i64.store 3, 0
    get_local 1
    i32.const '0x10'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8620# data: 952ba7f163c4a116
    i64.store 3, 0
    get_local 1
    i32.const '0x08'
    i32.add
    i32.const '0x00'
    i64.load 0, 0x8618# data: 69c2b068fc378daa
    i64.store 3, 0
    get_local 1
    i32.const '0x00'
    i64.load 0, 0x8610# data: ddf252ad1be2c89b

                

ZKSync EraVM

This is a custom instruction set and it has it's own compiler built on LLVM. The compiler luckily comes with a decompiler. With some help from Deepwiki, the same logic was ported over to a Python script.

eravm-objdump
            
import re
from typing import Dict, Tuple, List

class EraVMDecoder:
    def __init__(self):
        opcodes = [
            ('<invalid>', 0, 'direct'), ('nop', 1, 'nop'), ('add', 25, 'arith_comm'), 
            ('sub', 73, 'arith_ncomm'), ('mul', 169, 'arith_comm'), ('div', 217, 'arith_ncomm'),
            ('jump', 313, 'jump'), ('xor', 319, 'arith_comm'), ('and', 367, 'arith_comm'), 
            ('or', 415, 'arith_comm'), ('shl', 463, 'arith_ncomm'), ('shr', 559, 'arith_ncomm'),
            ('rol', 655, 'arith_ncomm'), ('ror', 751, 'arith_ncomm'), ('addp', 847, 'arith_ptr'),
            ('subp', 895, 'arith_ptr'), ('pack', 943, 'arith_ptr'), ('shrnk', 991, 'arith_ptr'),
            ('call', 1039, 'direct'), ('this', 1040, 'direct'), ('par', 1041, 'direct'),
            ('code', 1042, 'direct'), ('meta', 1043, 'direct'), ('ergs', 1044, 'direct'),
            ('sp', 1045, 'direct'), ('ldvl', 1046, 'direct'), ('stvl', 1047, 'direct'),
            ('stpub', 1048, 'direct'), ('inctx', 1049, 'direct'), ('lds', 1050, 'direct'),
            ('sts', 1051, 'direct'), ('callf', 1057, 'farcall'), ('calld', 1061, 'farcall'),
            ('callm', 1065, 'farcall'), ('ret', 1069, 'direct'), ('retl', 1070, 'direct'),
            ('rev', 1071, 'direct'), ('revl', 1072, 'direct'), ('pnc', 1073, 'direct'),
            ('pncl', 1074, 'direct'), ('ldm.h', 1075, 'heap'), ('stm.h', 1077, 'heap'),
            ('ldm.st', 1096, 'static'), ('stm.st', 1100, 'static')
        ]
        
        self.opcode_map = {op[1]: (op[0], op[2]) for op in opcodes}
        self.sorted_opcodes = sorted([op[1] for op in opcodes], reverse=True)
        
        self.src_modes = ['reg', 'sp_pop', 'sp_rel', 'stack_abs', 'imm', 'code']
        self.dst_modes = ['reg', 'sp_push', 'sp_rel', 'stack_abs']
        self.conditions = ['none', 'gt', 'lt', 'eq', 'ge', 'le', 'ne', 'gtlt']
        
        self.code_ref_regex = re.compile(r'code\[(?:r[0-9]+\+)?([0-9]+)\]')

    def decode_instruction(self, data: bytes) -> Dict:
        """Decode 8-byte EraVM instruction."""
        if len(data) != 8:
            raise ValueError("Instructions must be 8 bytes")
        
        ins = int.from_bytes(data, 'big')
        imm1 = (ins >> 48) & 0xFFFF
        imm0 = (ins >> 32) & 0xFFFF
        dst1 = (ins >> 28) & 0xF
        dst0 = (ins >> 24) & 0xF
        src1 = (ins >> 20) & 0xF
        src0 = (ins >> 16) & 0xF
        pred = (ins >> 13) & 0x7
        opcode = ins & 0x7FF
        
        base, src_mode, dst_mode, flags = self._analyze_opcode(opcode)
        name = self.opcode_map.get(base, ('<unknown>', 'direct'))[0]
        
        return {
            'mnemonic': name,
            'src0_reg': f'r{src0}', 'src1_reg': f'r{src1}',
            'dst0_reg': f'r{dst0}', 'dst1_reg': f'r{dst1}',
            'imm0': imm0, 'imm1': imm1,
            'predicate': self.conditions[pred] if pred < 8 else f'pred{pred}',
            'src_mode': self.src_modes[src_mode] if 0 <= src_mode < 6 else 'none',
            'dst_mode': self.dst_modes[dst_mode] if 0 <= dst_mode < 4 else 'none',
            'raw_opcode': opcode, 'base_opcode': base,
            **flags
        }

    def _analyze_opcode(self, opcode: int) -> Tuple[int, int, int, Dict]:
        """Analyze opcode to extract base instruction and operand modes."""
        base = next((op for op in self.sorted_opcodes if op <= opcode), 0)
        if base not in self.opcode_map:
            return 0, -1, -1, {}
        
        delta = opcode - base
        encoding = self.opcode_map[base][1]
        src_mode = dst_mode = -1
        flags = {}
        
        if delta > 0:
            if encoding == 'nop':
                dst_mode, src_mode = delta % 4, (delta // 4) % 6
            elif encoding == 'arith_comm':
                flags['set_flags'] = bool(delta % 2)
                dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
            elif encoding == 'arith_ncomm':
                flags.update({
                    'swap': bool(delta % 2),
                    'set_flags': bool((delta // 2) % 2)
                })
                dst_mode, src_mode = (delta // 4) % 4, (delta // 16) % 6
            elif encoding == 'arith_ptr':
                flags['swap'] = bool(delta % 2)
                dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
            elif encoding == 'jump':
                src_mode = delta % 6
            elif encoding == 'farcall':
                flags.update({
                    'is_shard': bool(delta % 2),
                    'is_static': bool((delta // 2) % 2)
                })
        
        return base, src_mode, dst_mode, flags

    def stringify_instruction(self, decoded: Dict, constants: Dict = None) -> str:
        """Convert decoded instruction to assembly string."""
        constants = constants or {}
        
        mnemonic = decoded.get('mnemonic', '<unknown>')
        if decoded.get('swap'): mnemonic += '.s'
        if decoded.get('set_flags'): mnemonic += '!'
        if decoded.get('predicate', 'none') != 'none':
            mnemonic += f".{decoded['predicate']}"
        
        operands = self._format_operands(decoded, constants)
        
        comments = []
        if 'code_ref_comment' in decoded:
            comments.append(decoded['code_ref_comment'])
        
        result = f"{mnemonic:<8} {', '.join(operands)}" if operands else mnemonic
        if comments:
            result += f" # {', '.join(comments)}"
        
        return result.strip()

    def _format_operands(self, d: Dict, constants: Dict) -> List[str]:
        """Format operands based on instruction type."""
        mnemonic = d.get('mnemonic', '').split('.')[0]
        operands = []
        
        if mnemonic in ['add', 'sub', 'mul', 'div', 'and', 'or', 'xor', 'shl', 'shr', 'rol', 'ror']:
            operands.append(self._format_src_operand(d, 0, constants))
            operands.append(d.get('src1_reg', 'r0'))
            operands.append(self._format_dst_operand(d, 0))
        
        elif mnemonic in ['this', 'par', 'code', 'meta', 'ergs', 'sp', 'ldvl', 'stvl']:
            operands.append(d.get('dst0_reg', 'r0'))
        
        elif mnemonic in ['retl', 'revl', 'pncl']:
            operands.append(str(d.get('imm0', 0)))
        
        elif mnemonic == 'jump':
            operands.append(self._format_src_operand(d, 0, constants))
        
        elif mnemonic in ['callf', 'calld', 'callm']:
            operands.extend([
                d.get('src0_reg', 'r0'),
                d.get('src1_reg', 'r0'),
                str(d.get('imm0', 0))
            ])
        
        return operands

    def _format_src_operand(self, d: Dict, src_idx: int, constants: Dict) -> str:
        """Format source operand based on addressing mode."""
        src_mode = d.get('src_mode', 'reg')
        reg_key = f'src{src_idx}_reg'
        imm_key = f'imm{src_idx}'
        
        if src_mode == 'imm':
            return str(d.get(imm_key, 0))
        elif src_mode == 'code':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            if reg == 'r0' and imm != 0:
                if imm in constants:
                    d['code_ref_comment'] = f"code[{imm}] = {constants[imm]}"
                return f"code[{imm}]"
            return f"code[{reg}]" if imm == 0 else f"code[{reg}+{imm}]"
        elif src_mode == 'stack_abs':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
        else:
            return d.get(reg_key, 'r0')

    def _format_dst_operand(self, d: Dict, dst_idx: int) -> str:
        """Format destination operand based on addressing mode."""
        dst_mode = d.get('dst_mode', 'reg')
        reg_key = f'dst{dst_idx}_reg'
        imm_key = f'imm{1 if dst_idx == 0 else 0}'
        
        if dst_mode == 'stack_abs':
            imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
            return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
        else:
            return d.get(reg_key, 'r0')

    def analyze_binary(self, data: bytes) -> Dict:
        """Analyze complete EraVM binary."""
        if len(data) % 32:
            raise ValueError("Binary size must be multiple of 32 bytes")
        
        const_start = self._find_constant_section(data)
        
        instructions = []
        for i in range(0, const_start, 8):
            if i + 8 <= len(data):
                instr_bytes = data[i:i+8]
                try:
                    decoded = self.decode_instruction(instr_bytes)
                except:
                    mnemonic = '<padding>' if all(b == 0 for b in instr_bytes) else '<metadata>'
                    decoded = {'mnemonic': mnemonic}
                
                instructions.append({
                    'address': i,
                    'bytes': instr_bytes,
                    'decoded': decoded
                })
        
        constants = []
        for i in range(const_start, len(data), 32):
            if i + 32 <= len(data):
                cell_bytes = data[i:i+32]
                constants.append({
                    'word_number': i // 32,
                    'address': i,
                    'bytes': cell_bytes,
                    'value': "0x" + cell_bytes.hex()
                })
        
        return {
            'instructions': instructions,
            'constants': constants,
            'constant_section_start': const_start
        }

    def _find_constant_section(self, data: bytes) -> int:
        """Find where constant section begins by analyzing code references."""
        min_ref = float('inf')
        
        for i in range(0, len(data) - 7, 8):
            try:
                decoded = self.decode_instruction(data[i:i+8])
                asm_str = self.stringify_instruction(decoded)
                refs = [int(match) for match in self.code_ref_regex.findall(asm_str)]
                if refs:
                    min_ref = min(min_ref, min(refs))
            except:
                continue
            
            if i % 32 == 24 and min_ref == (i + 8) // 32:
                return i + 8
        
        return len(data)

    def format_disassembly(self, address: int, instr_bytes: bytes, 
                          decoded: Dict, constants: Dict = None) -> str:
        """Format instruction for disassembly output."""
        hex_bytes = ' '.join(f'{b:02x}' for b in instr_bytes)
        asm_str = self.stringify_instruction(decoded, constants or {})
        return f"{address:08x}: {hex_bytes:<24} {asm_str}"


def main():
    try:
        with open("example_weth.hex", "r") as f:
            data = bytes.fromhex(f.read())
        
        decoder = EraVMDecoder()
        result = decoder.analyze_binary(data)
        
        const_lookup = {c['word_number']: c['value'] for c in result['constants']}
        
        for instr in result['instructions']:
            print(decoder.format_disassembly(
                instr['address'],
                instr['bytes'],
                instr['decoded'],
                const_lookup
            ))
        
        for const in result['constants']:
            print(f"{const['word_number']}:")
            print(f"\t.cell {const['value']}")
    
    except FileNotFoundError:
        print("Example file not found - decoder ready for use")


if __name__ == "__main__":
    main()      
                        
                

If we look at the WETH compiled for EraVM, we can again see that the selectors are within the binary.

grep for selectors
            
# name() 
python3 eravm_decompiler.py | grep "06fdde03" -A 3
    00000090: 00 00 01 1a 04 20 00 9c  sub.s!   code[282], r2, r4 # code[282] = 0x0000000000000000000000000000000000000000000000000000000006fdde03
    00000098: 00 00 00 d6 00 00 61 3d  jump.eq  214
    000000a0: 00 00 01 1b 04 20 00 9c  sub.s!   code[283], r2, r4 # code[283] = 0x00000000000000000000000000000000000000000000000000000000095ea7b3
    000000a8: 00 00 00 ee 00 00 61 3d  jump.eq  238

# transferFrom(address,address,uint256)
python3 eravm_decompiler.py | grep "23b872dd"  -A 3
    000003f8: 00 00 01 17 04 20 00 9c  sub.s!   code[279], r2, r4 # code[279] = 0x0000000000000000000000000000000000000000000000000000000023b872dd
    00000400: 00 00 01 73 00 00 61 3d  jump.eq  371
    00000408: 00 00 01 18 04 20 00 9c  sub.s!   code[280], r2, r4 # code[280] = 0x000000000000000000000000000000000000000000000000000000002e1a7d4d
    00000410: 00 00 01 85 00 00 61 3d  jump.eq  389

# Approval(address,address,uint256)
python3 eravm_decompiler.py | grep "0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925"  -A 3
    00000948: 00 00 01 2b 04 00 00 41  add      code[299], r0, r4 # code[299] = 0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925
    00000950: 00 00 00 02 05 00 00 29  add      r0, r0, r5
    00000958: 00 00 00 03 06 00 00 29  add      r0, r0, r6
    00000960: 04 1f 04 15 00 00 04 0f  call

# Transfer(address,address,uint256)
python3 eravm_decompiler.py | grep "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef"  -A 3
    00001ef8: 00 00 01 2e 04 00 00 41  add      code[302], r0, r4 # code[302] = 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef
    00001f00: 00 00 00 06 05 00 00 29  add      r0, r0, r5
    00001f08: 00 00 00 03 06 00 00 29  add      r0, r0, r6
    00001f10: 04 1f 04 15 00 00 04 0f  call
                

Conclusion

Great, we wrote a selector extractor algorithm in a few hours using a data driven approach and also benchmarked it to verify that it works.

Reading list

If you liked this blog post, you might also like the following posts (not written by me):