A few days ago Zellic released an updated version of their smart contract dataset. The new version, all-ethereum-contracts is like the previous dataset containing all smart contract deployments done historically up until recent time (February 2025 in this case), only difference now is that they give us the raw bytecode instead of the original source code. Given this new dataset, let's use it to create a simple static analysis script to extract ABI selectors from the bytecode.
You can read more about function selectors here and event selectors here. The ABI spec of Solidity (which Vyper also uses) can be read here
There are multiple existing solutions already (in no specific order):
Some of them use static analysis and some of them use dynamic analysis to infer the bytecode selectors. Some of them also try to infer the full ABI JSON, but we are focusing on only retrieving the selectors in this post. That said, most implementations just use a pre-image dictionary to resolve the full ABI from the selector (same technique can be used with our version).
The log selector logic works similarly for both Solidity and Vyper. Same is true for the function selectors, but the JUMP table where they are used is very different between the compilers. Vyper did recently a write-up on how their constant time jump tables work. Solidity hasn't really done any write-ups as far as I know, but if you look inside the source code then you can see they mention some patterns.
Instead of looking at what the developers tell us, we will look at the bytecode. Given we know the ABI ahead of time, we can see where and how the selectors are placed within the bytecode. Then create a sliding window to understand the recurring patterns. Then we find the most common patterns and we can then create a script around that to evaluate it.
Sadly there isn't any good bytecode dataset for doing this which includes all major compiler versions for both Solc and Vyper, at least I don't know of any that isn't outdated. Therefore I created one, it's created by sampling block intervals from the all-ethereum-contracts dataset and deduped based on the provided bytecode hash. Then we get the compiler version from Etherscan based on the address.
We then ended up with the following
We postprocess this with the verified contract response from Etherscan to get out the ABI selectors from the returned ABI.
from evm import get_opcodes_from_bytecode, PushOpcode, JUMP_DEST
from collections import defaultdict
from copy import deepcopy
from tqdm import tqdm
import random
import glob
import json
import os
WINDOW_SIZE = 16
MAX_PATTERNS = 10
FOLDER_PATH = os.environ.get("FOLDER_PATH")
assert FOLDER_PATH is not None
patterns = {
"solc": defaultdict(int),
"vyper": defaultdict(int),
}
def transform_opcodes_window(bytecode, opcodes, selectors, index):
opcodes_window = opcodes[index : index + WINDOW_SIZE]
min_index = float("inf")
for index, op in enumerate(opcodes_window):
if not isinstance(op, PushOpcode):
continue
op_args_int = int.from_bytes(bytes.fromhex(op.args), byteorder="big")
if op.args in selectors["functions"]:
opcodes_window[index] = "<func_selector>"
min_index = min(index, min_index)
elif op.args in selectors["events"]:
opcodes_window[index] = "<log_selector>"
min_index = min(index, min_index)
elif op_args_int < len(opcodes) and bytecode[op_args_int] == JUMP_DEST:
opcodes_window[index] = f"<jumpdest>"
else:
opcodes_window[index] = f"{op.name} <data>"
return opcodes_window, min_index
def main():
for file in tqdm(glob.glob(os.path.join(FOLDER_PATH, "**/*.json"))):
with open(file, "r") as file:
data = json.load(file)
bytecode = data["bytecode"]
bytecode = bytes.fromhex(bytecode.lstrip("0x"))
selectors = data["selectors"]
if selectors is None:
continue
compiler = data["compiler"]["kind"]
opcodes = get_opcodes_from_bytecode(bytecode)
for index, _ in enumerate(opcodes):
opcodes_window, min_index = transform_opcodes_window(
bytecode, opcodes, selectors, index
)
if min_index == float("inf"):
continue
opcodes_window_og = deepcopy(opcodes_window)
while len(opcodes_window) > 2 and (
"<func_selector>" in opcodes_window
or "<log_selector>" in opcodes_window
):
current_window = " ".join(list(map(str, list(opcodes_window))))
patterns[compiler][current_window] += 1
opcodes_window = opcodes_window[:-1]
opcodes_window = opcodes_window_og
while len(opcodes_window) > 2 and (
"<func_selector>" in opcodes_window
or "<log_selector>" in opcodes_window
):
current_window = " ".join(list(map(str, list(opcodes_window))))
patterns[compiler][current_window] += 1
opcodes_window = opcodes_window[1:]
compiler_patterns = {}
for compiler in patterns:
compiler_patterns[compiler] = []
for pattern in sorted(
list(patterns[compiler].keys()),
key=lambda x: patterns[compiler][x],
reverse=True,
):
for v in compiler_patterns[compiler]:
# If it's a subset, let's skip.
if v in pattern or pattern in v:
break
else:
compiler_patterns[compiler].append(pattern)
if len(compiler_patterns[compiler]) > MAX_PATTERNS:
break
print(json.dumps(compiler_patterns, indent=4))
if __name__ == "__main__":
main()
Then we get out the following patterns:
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
DUP1 <func_selector> EQ PUSH2 <jumpdest>
<func_selector> EQ PUSH2 <jumpdest>
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ
EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector>
DUP1 <func_selector> EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
<func_selector> EQ PUSH2 <jumpdest> JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
JUMPI DUP1 <func_selector> EQ PUSH2 <jumpdest>
PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data>
PUSH1 <data> MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
MSTORE PUSH1 <data> PUSH1 <data> RETURN JUMPDEST <func_selector> DUP2 XOR PUSH2 <data> JUMPI
There is one obvious pattern I see we are missing which is for the case when the selector is
0x000000
where the compiler usually will
optimize the check
into
a
ISZERO
comparison.
Note 1: We did not test gigahorse in this comparison because of it's long execution time, happy to retry if there is a config I can tune to get a response quicker.
Note 2: that this isn't an entirely fair evaluation as some of these tools do more than just extracting the selectors and therefore has additional complexity
Rank | Model | F1-Score | Recall | Precision |
---|---|---|---|---|
🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 |
🥈 2 | sevm | 0.8980 | 0.8157 | 0.9989 |
🥉 3 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 |
4 | whatsabi | 0.7986 | 0.6655 | 0.9983 |
5 | heimdall | 0.7886 | 0.6514 | 0.9989 |
Obviously the dynamic analysis approaches beat the static analysis approaches. However, our naive implementation is still able to get a pretty good F1-score.
But these relationships should be possible to learn by a simple neural network also and that should then (hopefully) also improve on our existing naive approach.
WINDOW_SIZE = 5
class SelectorDetector(torch.nn.Module):
def __init__(self, vocab_size, classes, simple):
super(SelectorDetector, self).__init__()
self.simple = simple
self.head = torch.nn.Sequential(
torch.nn.Embedding(vocab_size, 128),
)
self.body = torch.nn.Sequential(
torch.nn.Linear(128, 256),
torch.nn.BatchNorm1d(WINDOW_SIZE),
torch.nn.ReLU(),
torch.nn.Dropout(0.3),
torch.nn.Linear(256, 128),
torch.nn.Sigmoid(),
torch.nn.Linear(128, classes + 1),
)
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, torch.nn.Linear):
torch.nn.init.xavier_uniform_(module.weight)
torch.nn.init.zeros_(module.bias)
elif isinstance(module, torch.nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0, std=0.1)
def forward(self, X):
out = self.body(self.head(X))
return out.mean(dim=1)
Now let's evaluate it and look at the results
Rank | Model | F1-Score | Recall | Precision |
---|---|---|---|---|
🥇 1 | Evmmole | 0.9785 | 0.9588 | 0.9990 |
🥈 2 | Our torch model | 0.9283 | 0.9429 | 0.9142 |
🥉 3 | sevm | 0.8980 | 0.8157 | 0.9989 |
4 | Our naive pattern model | 0.7986 | 0.6655 | 0.9983 |
5 | whatsabi | 0.7986 | 0.6655 | 0.9983 |
6 | heimdall | 0.7886 | 0.6514 | 0.9989 |
Nice! We are now only behind a dynamic analysis solution, not bad.
You can use the same technique to find the event logs also. Unlike jump tables though, the compiler might
place the
PUSH32
opcode with the topic0 far
away
from
the
LOG
opcode using it so using a different technique is advised.
For
instance
since the topic0 is a hash, we can instead just check that there is a certain amount of randomness in
larger
PUSH32
values.
Sometimes the compiler will also optimize it into the .data section so beware.
Great, we wrote a selector extractor algorithm in a few hours using a data driven approach and also benchmarked it to verify that it works.
If you liked this blog post, you might also like the following posts (not written by me):