A few days ago Zellic released an updated version of their smart contract dataset. The new version, all-ethereum-contracts is like the previous dataset containing all smart contract deployments done historically up until recent time (February 2025 in this case), only difference now is that they give us the raw bytecode instead of the original source code. Given this new dataset, let's use it to create a simple static analysis script to extract ABI selectors from the bytecode.
You can read more about function selectors here and event selectors here. The ABI spec of Solidity (which Vyper also uses) can be read here.
There are multiple existing solutions already (in no specific order):
Some of them use static analysis and some of them use dynamic analysis to infer the bytecode selectors. Some of them also try to infer the full ABI JSON, but we are focusing on only retrieving the selectors in this post. That said, most implementations just use a pre-image dictionary to resolve the full ABI from the selector (same technique can be used with our version).
The log selector logic works similarly for both Solidity and Vyper. Same is true for the function selectors, but the JUMP table where they are used is very different between the compilers. Vyper did recently a write-up on how their constant time jump tables work. Solidity hasn't really done any write-ups as far as I know, but if you look inside the source code then you can see they mention some patterns.
Instead of looking at what the developers tell us, we will look at the bytecode. Given we know the ABI ahead of time, we can see where and how the selectors are placed within the bytecode. Then create a sliding window to understand the recurring patterns. Then we find the most common patterns and we can then create a script around that to evaluate it.
Sadly there isn't any good bytecode dataset for doing this which includes all major compiler versions for both Solc and Vyper, at least I don't know of any that isn't outdated. Therefore I created one, it's created by sampling block intervals from the all-ethereum-contracts dataset and deduped based on the provided bytecode hash. Then we get the compiler version from Etherscan based on the address.
We then ended up with the following
We postprocess this with the verified contract response from Etherscan to get out the ABI selectors from the returned ABI.
from evm import get_opcodes_from_bytecode, PushOpcode, JUMP_DEST
from collections import defaultdict
from copy import deepcopy
from tqdm import tqdm
import random
import glob
import json
import os
WINDOW_SIZE = 16
MAX_PATTERNS = 10
FOLDER_PATH = os.environ.get("FOLDER_PATH")
assert FOLDER_PATH is not None
patterns = {
"solc": defaultdict(int),
"vyper": defaultdict(int),
}
def transform_opcodes_window(bytecode, opcodes, selectors, index):
opcodes_window = opcodes[index : index + WINDOW_SIZE]
min_index = float("inf")
for index, op in enumerate(opcodes_window):
if not isinstance(op, PushOpcode):
continue
op_args_int = int.from_bytes(bytes.fromhex(op.args), byteorder="big")
if op.args in selectors["functions"]:
opcodes_window[index] = "<func_selector>"
min_index = min(index, min_index)
elif op.args in selectors["events"]:
opcodes_window[index] = "<log_selector>"
min_index = min(index, min_index)
elif op_args_int < len(opcodes) and bytecode[op_args_int] == JUMP_DEST:
opcodes_window[index] = f"<jumpdest>"
else:
opcodes_window[index] = f"{op.name} <data>"
return opcodes_window, min_index
def main():
for file in tqdm(glob.glob(os.path.join(FOLDER_PATH, "**/*.json"))):
with open(file, "r") as file:
data = json.load(file)
bytecode = data["bytecode"]
bytecode = bytes.fromhex(bytecode.lstrip("0x"))
selectors = data["selectors"]
if selectors is None:
continue
compiler = data["compiler"]["kind"]
opcodes = get_opcodes_from_bytecode(bytecode)
for index, _ in enumerate(opcodes):
opcodes_window, min_index = transform_opcodes_window(
bytecode, opcodes, selectors, index
)
if min_index == float("inf"):
continue
opcodes_window_og = deepcopy(opcodes_window)
while len(opcodes_window) > 2 and (
"<func_selector>" in opcodes_window
or "<log_selector>" in opcodes_window
):
current_window = " ".join(list(map(str, list(opcodes_window))))
patterns[compiler][current_window] += 1
opcodes_window = opcodes_window[:-1]
opcodes_window = opcodes_window_og
while len(opcodes_window) > 2 and (
"<func_selector>" in opcodes_window
or "<log_selector>" in opcodes_window
):
current_window = " ".join(list(map(str, list(opcodes_window))))
patterns[compiler][current_window] += 1
opcodes_window = opcodes_window[1:]
compiler_patterns = {}
for compiler in patterns:
compiler_patterns[compiler] = []
for pattern in sorted(
list(patterns[compiler].keys()),
key=lambda x: patterns[compiler][x],
reverse=True,
):
for v in compiler_patterns[compiler]:
# If it's a subset, let's skip.
if v in pattern or pattern in v:
break
else:
compiler_patterns[compiler].append(pattern)
if len(compiler_patterns[compiler]) > MAX_PATTERNS:
break
print(json.dumps(compiler_patterns, indent=4))
if __name__ == "__main__":
main()
Then we get out the following patterns:
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
DUP1 <func_selector> EQ PUSH2 <jumpdest>
<func_selector> EQ PUSH2 <jumpdest>
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ
DUP1 <func_selector> EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data>
PUSH1 <data> MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
There is one obvious pattern I see we are missing which is for the case when the selector is
0x000000
where the compiler usually will
optimize the check
into
a
ISZERO comparison. We also did not see
cases for when the function selectors are split
with GT
as another optimization for larger contracts with many functions.
Note 1: We did not test gigahorse in this comparison because of it's long execution time, happy to retry if there is a config I can tune to get a response quicker.
Note 2: that this isn't an entirely fair evaluation as some of these tools do more than just extracting the selectors and therefore has additional complexity
Note 3: originally this blogpost was written with focus on the function selectors, there has been an update to the post to include more info on the log selectors, but this benchmark is still only for the function selectors.
| Rank | Model | F1-Score | Recall | Precision |
|---|---|---|---|---|
| 🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 |
| 🥈 2 | sevm | 0.8980 | 0.8157 | 0.9989 |
| 🥉 3 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 |
| 4 | whatsabi | 0.7986 | 0.6655 | 0.9983 |
| 5 | heimdall | 0.7886 | 0.6514 | 0.9989 |
Obviously the dynamic analysis approaches beat the static analysis approaches. However, our naive implementation is still able to get a pretty good F1-score.
But these relationships should be possible to learn by a simple neural network and that should then (hopefully) also improve on our existing naive approach.
WINDOW_SIZE = 5
class SelectorDetector(torch.nn.Module):
def __init__(self, vocab_size, classes):
super(SelectorDetector, self).__init__()
self.head = torch.nn.Sequential(
torch.nn.Embedding(vocab_size, 128),
)
self.body = torch.nn.Sequential(
torch.nn.Linear(128, 256),
torch.nn.BatchNorm1d(WINDOW_SIZE),
torch.nn.ReLU(),
torch.nn.Dropout(0.3),
torch.nn.Linear(256, 128),
torch.nn.Sigmoid(),
torch.nn.Linear(128, classes + 1),
)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, torch.nn.Linear):
torch.nn.init.xavier_uniform_(module.weight)
torch.nn.init.zeros_(module.bias)
elif isinstance(module, torch.nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0, std=0.1)
def forward(self, X):
out = self.body(self.head(X))
return out.mean(dim=1)
Now let's evaluate it and look at the results
| Rank | Model | F1-Score | Recall | Precision |
|---|---|---|---|---|
| 🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 |
| 🥈 2 | Our torch model | 0.9283 | 0.9429 | 0.9142 |
| 🥉 3 | sevm | 0.8980 | 0.8157 | 0.9989 |
| 4 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 |
| 5 | whatsabi | 0.7986 | 0.6655 | 0.9983 |
| 6 | heimdall | 0.7886 | 0.6514 | 0.9989 |
Nice! We are now only behind a dynamic analysis solution, not bad.
There are similar patterns for logs, although they follow much less of a pattern like for function selectors.
JUMPI CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD
PUSH2 <data> JUMPI CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data>
CALLER PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD DUP1
PUSH20 <data> AND PUSH32 <selector> CALLVALUE PUSH1 <data> MLOAD DUP1 DUP3
MLOAD DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3
DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3 LOG3
DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS SWAP3 LOG3 PUSH2 <data>
PUSH1 <data> MLOAD DUP9 DUP2 MSTORE PUSH32 <selector> DUP7 ADDRESS
CALLVALUE GT ISZERO PUSH2 <data> JUMPI PUSH32 <selector> CALLER CALLVALUE
GT ISZERO PUSH2 <data> JUMPI PUSH32 <selector> CALLER CALLVALUE PUSH1 <data>
PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data> MSTORE
MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data> MSTORE PUSH1 <data>
POP SSTORE PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD
SSTORE PUSH1 <data> MLOAD CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH1 <data>
PUSH2 <data> MSTORE PUSH32 <selector> PUSH1 <data> PUSH2 <data> LOG1 STOP JUMPDEST
PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE PUSH32 <selector> PUSH1 <data> PUSH2 <data> LOG1
JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE PUSH2 <data>
PUSH2 <data> JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data>
JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD PUSH2 <data> MSTORE
PUSH2 <data> PUSH2 <data> JUMP JUMPDEST CALLER PUSH32 <selector> PUSH1 <data> CALLDATALOAD
There are a few things to note from the patterns above. The compiler might choose to optimize to a PUSH{N..32} in the case the hash of the
topic has leading zeros. Depending on the compiler, it might also optimize it to be a CODECOPY. There is also no guarantee that
the
There are rollups that support both the EVM and a non EVM. In addition to allowing them to communicate, example is Stylus from Arbitrum and EraVM from ZKsync. One question that came to mind is if these non EVM binaries would have the same selector logic or if they handled it in a different way.
Stylus is WASM runtime that allows users to write smart contract in traditional programming langues (C++, Rust, Go, etc) instead of Solidity. Since it's WASM, all related tooling just works.
wasm-objdump
from wasm_tob import (
decode_module,
format_instruction,
format_lang_type,
format_mutability,
SEC_DATA,
SEC_ELEMENT,
SEC_GLOBAL,
SEC_CODE,
decode_bytecode,
INSN_ENTER_BLOCK,
INSN_LEAVE_BLOCK,
)
def pad_hex(val):
if len(val) % 2 == 0:
return val
return val.replace("0x", "0x0")
def format_instruction(insn, data_sections):
text = insn.op.mnemonic
if not insn.imm:
return text
def format_isnt(text, x):
# We are converting from int to uint because that is how the selectors are encoded
if text == "i32.const":
if x >= (1 << 255):
x -= (1 << 256)
return pad_hex(hex(x))
if text == "i64.const":
if x >= (1 << 255):
x -= (1 << 256)
return pad_hex(hex(x))
return x
args = [
getattr(insn.op.imm_struct, x.name).to_string(format_isnt(text, getattr(insn.imm, x.name)))
for index, x in enumerate(insn.op.imm_struct._meta.fields)
]
base = text + ' ' + ', '.join(args)
if text == "i64.load":
load_section = int(args[1], 16)
data = None
for i in data_sections:
if i[0] < load_section and load_section < i[1]:
delta = load_section - i[0]
data = i[2][delta:delta+8].hex()
if data is None:
return base
return base + f"# data: {data}"
else:
return base
def disas(raw):
mod_iter = iter(decode_module(raw))
header, header_data = next(mod_iter)
data_sections = []
for cur_sec, cur_sec_data in mod_iter:
if cur_sec_data.id == SEC_DATA:
for idx, entry in enumerate(cur_sec_data.payload.entries):
offset = entry.offset[0].imm.value
data = entry.data.tobytes()
data_sections.append((
offset, offset + len(data), data
))
mod_iter = iter(decode_module(raw))
header, header_data = next(mod_iter)
for cur_sec, cur_sec_data in mod_iter:
if cur_sec_data.id == SEC_CODE:
code_sec = cur_sec_data.payload
for i, func_body in enumerate(code_sec.bodies):
print('{x} sub_{id:04X} {x}'.format(x='=' * 35, id=i))
indent = 0
raw = func_body.code.tobytes()
for cur_insn in decode_bytecode(raw):
if cur_insn.op.flags & INSN_LEAVE_BLOCK:
indent -= 1
print(' ' * indent + format_instruction(cur_insn, data_sections))
if cur_insn.op.flags & INSN_ENTER_BLOCK:
indent += 1
if __name__ == "__main__":
with open('example_stylus_pencil.wasm', 'rb') as raw:
raw = raw.read()
disas(raw)
If we run the script above on a example ERC20 stylus contract and just grep for some of the selectors, we can see that the function selectors and log selectors are in the binary.
# name()
python3 decompile.py | grep "0x06fdde03" -A 3
i32.const '0x06fdde03'
i32.ne
br_if 18
get_local 1
# transferFrom(address,address,uint256)
python3 decompile.py | grep "0x23b872dd" -A 3
i32.const '0x23b872dd'
i32.eq
br_if 9
get_local 2
# Approval(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py | grep "5b200ac8c7c3b925" -A 16
i64.load 0, 0x8648# data: 5b200ac8c7c3b925
i64.store 3, 0
get_local 1
i32.const '0x10'
i32.add
i32.const '0x00'
i64.load 0, 0x8640# data: dd0314c0f7b2291e
i64.store 3, 0
get_local 1
i32.const '0x08'
i32.add
i32.const '0x00'
i64.load 0, 0x8638# data: d14f71427d1e84f3
i64.store 3, 0
get_local 1
i32.const '0x00'
i64.load 0, 0x8630# data: 8c5be1e5ebec7d5b
# Transfer(address,address,uint256)
# (because it is loaded in chunks, I only grep the last 8 bytes, but you can see it chained together)
python3 decompile.py | grep "28f55a4df523b3ef" -A 16
i64.load 0, 0x8628# data: 28f55a4df523b3ef
i64.store 3, 0
get_local 1
i32.const '0x10'
i32.add
i32.const '0x00'
i64.load 0, 0x8620# data: 952ba7f163c4a116
i64.store 3, 0
get_local 1
i32.const '0x08'
i32.add
i32.const '0x00'
i64.load 0, 0x8618# data: 69c2b068fc378daa
i64.store 3, 0
get_local 1
i32.const '0x00'
i64.load 0, 0x8610# data: ddf252ad1be2c89b
This is a custom instruction set and it has it's own compiler built on LLVM. The compiler luckily comes with a decompiler. With some help from Deepwiki, the same logic was ported over to a Python script.
import re
from typing import Dict, Tuple, List
class EraVMDecoder:
def __init__(self):
opcodes = [
('<invalid>', 0, 'direct'), ('nop', 1, 'nop'), ('add', 25, 'arith_comm'),
('sub', 73, 'arith_ncomm'), ('mul', 169, 'arith_comm'), ('div', 217, 'arith_ncomm'),
('jump', 313, 'jump'), ('xor', 319, 'arith_comm'), ('and', 367, 'arith_comm'),
('or', 415, 'arith_comm'), ('shl', 463, 'arith_ncomm'), ('shr', 559, 'arith_ncomm'),
('rol', 655, 'arith_ncomm'), ('ror', 751, 'arith_ncomm'), ('addp', 847, 'arith_ptr'),
('subp', 895, 'arith_ptr'), ('pack', 943, 'arith_ptr'), ('shrnk', 991, 'arith_ptr'),
('call', 1039, 'direct'), ('this', 1040, 'direct'), ('par', 1041, 'direct'),
('code', 1042, 'direct'), ('meta', 1043, 'direct'), ('ergs', 1044, 'direct'),
('sp', 1045, 'direct'), ('ldvl', 1046, 'direct'), ('stvl', 1047, 'direct'),
('stpub', 1048, 'direct'), ('inctx', 1049, 'direct'), ('lds', 1050, 'direct'),
('sts', 1051, 'direct'), ('callf', 1057, 'farcall'), ('calld', 1061, 'farcall'),
('callm', 1065, 'farcall'), ('ret', 1069, 'direct'), ('retl', 1070, 'direct'),
('rev', 1071, 'direct'), ('revl', 1072, 'direct'), ('pnc', 1073, 'direct'),
('pncl', 1074, 'direct'), ('ldm.h', 1075, 'heap'), ('stm.h', 1077, 'heap'),
('ldm.st', 1096, 'static'), ('stm.st', 1100, 'static')
]
self.opcode_map = {op[1]: (op[0], op[2]) for op in opcodes}
self.sorted_opcodes = sorted([op[1] for op in opcodes], reverse=True)
self.src_modes = ['reg', 'sp_pop', 'sp_rel', 'stack_abs', 'imm', 'code']
self.dst_modes = ['reg', 'sp_push', 'sp_rel', 'stack_abs']
self.conditions = ['none', 'gt', 'lt', 'eq', 'ge', 'le', 'ne', 'gtlt']
self.code_ref_regex = re.compile(r'code\[(?:r[0-9]+\+)?([0-9]+)\]')
def decode_instruction(self, data: bytes) -> Dict:
"""Decode 8-byte EraVM instruction."""
if len(data) != 8:
raise ValueError("Instructions must be 8 bytes")
ins = int.from_bytes(data, 'big')
imm1 = (ins >> 48) & 0xFFFF
imm0 = (ins >> 32) & 0xFFFF
dst1 = (ins >> 28) & 0xF
dst0 = (ins >> 24) & 0xF
src1 = (ins >> 20) & 0xF
src0 = (ins >> 16) & 0xF
pred = (ins >> 13) & 0x7
opcode = ins & 0x7FF
base, src_mode, dst_mode, flags = self._analyze_opcode(opcode)
name = self.opcode_map.get(base, ('<unknown>', 'direct'))[0]
return {
'mnemonic': name,
'src0_reg': f'r{src0}', 'src1_reg': f'r{src1}',
'dst0_reg': f'r{dst0}', 'dst1_reg': f'r{dst1}',
'imm0': imm0, 'imm1': imm1,
'predicate': self.conditions[pred] if pred < 8 else f'pred{pred}',
'src_mode': self.src_modes[src_mode] if 0 <= src_mode < 6 else 'none',
'dst_mode': self.dst_modes[dst_mode] if 0 <= dst_mode < 4 else 'none',
'raw_opcode': opcode, 'base_opcode': base,
**flags
}
def _analyze_opcode(self, opcode: int) -> Tuple[int, int, int, Dict]:
"""Analyze opcode to extract base instruction and operand modes."""
base = next((op for op in self.sorted_opcodes if op <= opcode), 0)
if base not in self.opcode_map:
return 0, -1, -1, {}
delta = opcode - base
encoding = self.opcode_map[base][1]
src_mode = dst_mode = -1
flags = {}
if delta > 0:
if encoding == 'nop':
dst_mode, src_mode = delta % 4, (delta // 4) % 6
elif encoding == 'arith_comm':
flags['set_flags'] = bool(delta % 2)
dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
elif encoding == 'arith_ncomm':
flags.update({
'swap': bool(delta % 2),
'set_flags': bool((delta // 2) % 2)
})
dst_mode, src_mode = (delta // 4) % 4, (delta // 16) % 6
elif encoding == 'arith_ptr':
flags['swap'] = bool(delta % 2)
dst_mode, src_mode = (delta // 2) % 4, (delta // 8) % 6
elif encoding == 'jump':
src_mode = delta % 6
elif encoding == 'farcall':
flags.update({
'is_shard': bool(delta % 2),
'is_static': bool((delta // 2) % 2)
})
return base, src_mode, dst_mode, flags
def stringify_instruction(self, decoded: Dict, constants: Dict = None) -> str:
"""Convert decoded instruction to assembly string."""
constants = constants or {}
mnemonic = decoded.get('mnemonic', '<unknown>')
if decoded.get('swap'): mnemonic += '.s'
if decoded.get('set_flags'): mnemonic += '!'
if decoded.get('predicate', 'none') != 'none':
mnemonic += f".{decoded['predicate']}"
operands = self._format_operands(decoded, constants)
comments = []
if 'code_ref_comment' in decoded:
comments.append(decoded['code_ref_comment'])
result = f"{mnemonic:<8} {', '.join(operands)}" if operands else mnemonic
if comments:
result += f" # {', '.join(comments)}"
return result.strip()
def _format_operands(self, d: Dict, constants: Dict) -> List[str]:
"""Format operands based on instruction type."""
mnemonic = d.get('mnemonic', '').split('.')[0]
operands = []
if mnemonic in ['add', 'sub', 'mul', 'div', 'and', 'or', 'xor', 'shl', 'shr', 'rol', 'ror']:
operands.append(self._format_src_operand(d, 0, constants))
operands.append(d.get('src1_reg', 'r0'))
operands.append(self._format_dst_operand(d, 0))
elif mnemonic in ['this', 'par', 'code', 'meta', 'ergs', 'sp', 'ldvl', 'stvl']:
operands.append(d.get('dst0_reg', 'r0'))
elif mnemonic in ['retl', 'revl', 'pncl']:
operands.append(str(d.get('imm0', 0)))
elif mnemonic == 'jump':
operands.append(self._format_src_operand(d, 0, constants))
elif mnemonic in ['callf', 'calld', 'callm']:
operands.extend([
d.get('src0_reg', 'r0'),
d.get('src1_reg', 'r0'),
str(d.get('imm0', 0))
])
return operands
def _format_src_operand(self, d: Dict, src_idx: int, constants: Dict) -> str:
"""Format source operand based on addressing mode."""
src_mode = d.get('src_mode', 'reg')
reg_key = f'src{src_idx}_reg'
imm_key = f'imm{src_idx}'
if src_mode == 'imm':
return str(d.get(imm_key, 0))
elif src_mode == 'code':
imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
if reg == 'r0' and imm != 0:
if imm in constants:
d['code_ref_comment'] = f"code[{imm}] = {constants[imm]}"
return f"code[{imm}]"
return f"code[{reg}]" if imm == 0 else f"code[{reg}+{imm}]"
elif src_mode == 'stack_abs':
imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
else:
return d.get(reg_key, 'r0')
def _format_dst_operand(self, d: Dict, dst_idx: int) -> str:
"""Format destination operand based on addressing mode."""
dst_mode = d.get('dst_mode', 'reg')
reg_key = f'dst{dst_idx}_reg'
imm_key = f'imm{1 if dst_idx == 0 else 0}'
if dst_mode == 'stack_abs':
imm, reg = d.get(imm_key, 0), d.get(reg_key, 'r0')
return f"stack[{reg}]" if imm == 0 else f"stack[{imm} + {reg}]"
else:
return d.get(reg_key, 'r0')
def analyze_binary(self, data: bytes) -> Dict:
"""Analyze complete EraVM binary."""
if len(data) % 32:
raise ValueError("Binary size must be multiple of 32 bytes")
const_start = self._find_constant_section(data)
instructions = []
for i in range(0, const_start, 8):
if i + 8 <= len(data):
instr_bytes = data[i:i+8]
try:
decoded = self.decode_instruction(instr_bytes)
except:
mnemonic = '<padding>' if all(b == 0 for b in instr_bytes) else '<metadata>'
decoded = {'mnemonic': mnemonic}
instructions.append({
'address': i,
'bytes': instr_bytes,
'decoded': decoded
})
constants = []
for i in range(const_start, len(data), 32):
if i + 32 <= len(data):
cell_bytes = data[i:i+32]
constants.append({
'word_number': i // 32,
'address': i,
'bytes': cell_bytes,
'value': "0x" + cell_bytes.hex()
})
return {
'instructions': instructions,
'constants': constants,
'constant_section_start': const_start
}
def _find_constant_section(self, data: bytes) -> int:
"""Find where constant section begins by analyzing code references."""
min_ref = float('inf')
for i in range(0, len(data) - 7, 8):
try:
decoded = self.decode_instruction(data[i:i+8])
asm_str = self.stringify_instruction(decoded)
refs = [int(match) for match in self.code_ref_regex.findall(asm_str)]
if refs:
min_ref = min(min_ref, min(refs))
except:
continue
if i % 32 == 24 and min_ref == (i + 8) // 32:
return i + 8
return len(data)
def format_disassembly(self, address: int, instr_bytes: bytes,
decoded: Dict, constants: Dict = None) -> str:
"""Format instruction for disassembly output."""
hex_bytes = ' '.join(f'{b:02x}' for b in instr_bytes)
asm_str = self.stringify_instruction(decoded, constants or {})
return f"{address:08x}: {hex_bytes:<24} {asm_str}"
def main():
try:
with open("example_weth.hex", "r") as f:
data = bytes.fromhex(f.read())
decoder = EraVMDecoder()
result = decoder.analyze_binary(data)
const_lookup = {c['word_number']: c['value'] for c in result['constants']}
for instr in result['instructions']:
print(decoder.format_disassembly(
instr['address'],
instr['bytes'],
instr['decoded'],
const_lookup
))
for const in result['constants']:
print(f"{const['word_number']}:")
print(f"\t.cell {const['value']}")
except FileNotFoundError:
print("Example file not found - decoder ready for use")
if __name__ == "__main__":
main()
If we look at the WETH compiled for EraVM, we can again see that the selectors are within the binary.
# name()
python3 eravm_decompiler.py | grep "06fdde03" -A 3
00000090: 00 00 01 1a 04 20 00 9c sub.s! code[282], r2, r4 # code[282] = 0x0000000000000000000000000000000000000000000000000000000006fdde03
00000098: 00 00 00 d6 00 00 61 3d jump.eq 214
000000a0: 00 00 01 1b 04 20 00 9c sub.s! code[283], r2, r4 # code[283] = 0x00000000000000000000000000000000000000000000000000000000095ea7b3
000000a8: 00 00 00 ee 00 00 61 3d jump.eq 238
# transferFrom(address,address,uint256)
python3 eravm_decompiler.py | grep "23b872dd" -A 3
000003f8: 00 00 01 17 04 20 00 9c sub.s! code[279], r2, r4 # code[279] = 0x0000000000000000000000000000000000000000000000000000000023b872dd
00000400: 00 00 01 73 00 00 61 3d jump.eq 371
00000408: 00 00 01 18 04 20 00 9c sub.s! code[280], r2, r4 # code[280] = 0x000000000000000000000000000000000000000000000000000000002e1a7d4d
00000410: 00 00 01 85 00 00 61 3d jump.eq 389
# Approval(address,address,uint256)
python3 eravm_decompiler.py | grep "0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925" -A 3
00000948: 00 00 01 2b 04 00 00 41 add code[299], r0, r4 # code[299] = 0x8c5be1e5ebec7d5bd14f71427d1e84f3dd0314c0f7b2291e5b200ac8c7c3b925
00000950: 00 00 00 02 05 00 00 29 add r0, r0, r5
00000958: 00 00 00 03 06 00 00 29 add r0, r0, r6
00000960: 04 1f 04 15 00 00 04 0f call
# Transfer(address,address,uint256)
python3 eravm_decompiler.py | grep "0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef" -A 3
00001ef8: 00 00 01 2e 04 00 00 41 add code[302], r0, r4 # code[302] = 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef
00001f00: 00 00 00 06 05 00 00 29 add r0, r0, r5
00001f08: 00 00 00 03 06 00 00 29 add r0, r0, r6
00001f10: 04 1f 04 15 00 00 04 0f call
Great, we wrote a selector extractor algorithm in a few hours using a data driven approach and also benchmarked it to verify that it works.
If you liked this blog post, you might also like the following posts (not written by me):