Huaijin Wang (The Ohio State University), Zhiqiang Lin (The Ohio State University)
Binary Code Similarity Analysis (BCSA) plays a vital role in many security tasks, including malware analysis, vulnerability detection, and software supply chain security. While numerous BCSA techniques have been proposed over the past decade, few leverage the semantics of register and memory values for comparison, despite promising initial results. Existing value-based approaches often focus narrowly on values that remain invariant across compilation settings, thereby overlooking a broader spectrum of semantically rich information. In this paper, we identify three core challenges limiting the effectiveness of value-based BCSA: (1) unscalable value extraction that fails to cover diverse value-producing behaviors, (2) insufficient noise filtering that allows semantically irrelevant artifacts (e.g., global addresses) to dominate, and (3) inefficient comparison that makes value-based matching expensive and brittle. To make value-based BCSA practical at scale, we propose VSIM, a novel framework that systematically captures values computed from register and memory operations, filters out semantically irrelevant values (e.g., global addresses), and normalizes and propagates the remaining values to enable robust and scalable similarity analysis. Extensive evaluation shows that VSIM consistently outperforms state-of-the-art BCSA systems in accuracy, robustness, and scalability, and generalizes across architectures and toolchains, delivering reliable results on diverse real-world datasets.