Workshop on Binary Analysis Research (BAR) 2025 Program
Friday, 28 February
-
Jack W. Davidson, Professor of Computer Science in the School of Engineering and Applied Science, University of Virginia
For the past twenty years, our research has been driven by the need to analyze, understand, and transform software without access to source code. Through a series of research programs, including DARPA’s Self-Regenerative Systems (SRS), AFOSR’s Enterprise Health: Self-Regenerative Incorruptible Enterprise program, IARPA’s Securely Taking on New Executable Software of Uncertain Provenance (STONESOUP) program, DARPA’s Cyber Grand Challenge (CGC), and DARPA’s Cyber Fault-Tolerant Attack and Recovery program (CFAR), and others, we have developed novel techniques to analyze and transform binaries. This talk will retrospectively examine these efforts and our key contributions in binary analysis and rewriting, from early vulnerability discovery techniques to advanced automated program transformations. We will also discuss current binary analysis research areas, speculate on where binary analysis research is heading, and why it continues to be an important, well-funded and impactful research area.
Speaker's Biography: Jack W. Davidson is a Professor of Computer Science in the School of Engineering and Applied Science at the University of Virginia. Professor Davidson is a Fellow of the ACM and a Life Fellow of the IEEE. He served as an Associate Editor of ACM’s Transactions on Programming Languages and Systems for six years, and as an Associate Editor of ACM’s Transactions on Architecture and Compiler Optimizations for eight years. He served as Chair of ACM’s Special Interest Group on Programming Languages (SIGPLAN) from 2005 to 2007. He currently serves on the ACM Executive Council and is chair of ACM’s Digital Library Board that oversees the operation and development of ACM’s Digital Library.
-
Malware analysis relies on evolving tools that undergo continuous improvement and refinement. One such tool is Ghidra, released as open-source in 2019, which has seen 39 public releases and 13,000 commits as of October 2024. In this paper, we examine the impact of these updates on code similarity analysis for the same set of input files. Additionally, we measure how the underlying version of Ghidra affects simple metrics such as analysis time, error counts, and the number of functions identified. Our case studies reveal that Ghidra’s effectiveness varies depending on the specific file analyzed, highlighting the importance of context in evaluating tool performance.
We do not yet have an answer to the question posed in the title of this paper. In general, Ghidra has certainly improved in the years since it was released. Developers have fixed countless bugs, added substantial new features, and supported several new program formats. However, we observe that better is highly nuanced. We encourage the community to approach version upgrades with caution, as the latest release may not always provide superior results for every use case. By fostering a nuanced understanding of Ghidra’s advancements, we aim to contribute to more informed decision-making regarding tool adoption and usage in malware analysis and other binary analysis domains. -
Jack Royer (CentraleSupélec), Frédéric TRONEL (CentraleSupélec, Inria, CNRS, University of Rennes), Yaëlle Vinçont (Univ Rennes, Inria, CNRS, IRISA)
Reverse engineering of software is used to analyze the behavior of malicious programs, find vulnerabilities in software, or design interoperability solutions. Although this activity largely relies on dedicated software toolbox, it is still largely manual. In order to facilitate these tasks, many tools provide analysts with an interface to visualize Control Flow Graph (CFG) of a function. Properly laying out the CFG is therefore extremely important to facilitate manual reverse engineering. However, CFGs are often laid out with general algorithms rather than domain-specific ones. This leads to subpar graph layouts. In this paper, we provide a comprehensive state-of-the-art for CFG layout techniques. We propose a modified layout algorithm that showcases the patterns analysts are looking for. Finally, we compare layouts offered by popular binary analysis frameworks with our own.
-
Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder (Auburn University)
Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction—a critical problem in binary code analysis.
To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoderdecoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows. -
Dairo de Ruck, Jef Jacobs, Jorn Lapon, Vincent Naessens (DistriNet, KU Leuven, 3001 Leuven, Belgium)
Debugging is a fundamental testing technique that directly interacts with the functionality and current state of a running program. It enables the debugger to step through a program and meanwhile inspect registers and memory as part of the program state. When debugging, variables and parameters are assigned concrete values resulting in a specific program path to be explored. This makes software testing time-consuming and at the same time requiring substantial expertise. On the other hand, symbolic debugging can explore multiple paths by replacing concrete input values by symbolic ones and choose the paths to be explored.
angr is a dynamic symbolic execution (DSE) platform that can be programmed to symbolically execute a binary program with selected, possibly symbolic inputs. The binary is lifted to an intermediate, architecture independent representation, preparatory to the symbolic execution. This paper presents dAngr a tool that builds upon angr, a symbolic execution platform, enabling the user to debug binaries by means of GDB-like commands, and enhances this experience by means of symbolic execution and binary analysis capabilities. We also abstract the angr framework and symbolic execution by utilizing these commands. The power of dAngr is demonstrated on multiple examples including capture-the-flag challenges with different levels of complexity. -
Caleb Stewart, Rhonda Gaede, Jeffrey Kulick (University of Alabama in Huntsville)
We present DRAGON, a graph neural network (GNN) that predicts data types for decompiled variables along with a confidence estimate for each prediction. While we only train DRAGON on x64 binaries compiled without optimization, we show that DRAGON generalizes well to all combinations of the x64, x86, ARM64, and ARM architectures compiled across optimization levels O0-O3. We compare DRAGON with two state-of-the-art approaches for binary type inference and demonstrate that DRAGON exhibits a competitive or superior level of accuracy for simple type prediction while also providing useful confidence estimates. We show that the learned confidence estimates produced by DRAGON strongly correlate with accuracy, such that higher confidence predictions generally correspond with a higher level of accuracy than lower confidence predictions.
-
Heng Yin, Professor, Department of Computer Science and Engineering, University of California, Riverside
Deep learning, particularly Transformer-based models, has recently gained traction in binary analysis, showing promising outcomes. Despite numerous studies customizing these models for specific applications, the impact of such modifications on performance remains largely unexamined. Our study critically evaluates four custom Transformer models (jTrans, PalmTree, StateFormer, Trex) across various applications, revealing that except for the Masked Language Model (MLM) task, additional pre-training tasks do not significantly enhance learning. Surprisingly, the original BERT model often outperforms these adaptations, indicating that complex modifications and new pre-training tasks may be superfluous. Our findings advocate for focusing on fine-tuning rather than architectural or task-related alterations to improve model performance in binary analysis.
Speaker's Biography: Dr. Heng Yin is a Professor in the Department of Computer Science and Engineering at University of California, Riverside. He obtained his PhD degree from the College of William and Mary in 2009. His research interests lie in computer security, with an emphasis on binary code analysis. His publications appear in top-notch technical conferences and journals, such as IEEE S&P, ACM CCS, USENIX Security, NDSS, ISSTA, ICSE, TSE, TDSC, etc. His research is sponsored by National Science Foundation (NSF), Defense Advanced Research Projects Agency (DARPA), Air Force Office of Scientific Research (AFOSR), and Office of Naval Research (ONR). In 2011, he received the prestigious NSF Career award. He received Google Security and Privacy Research Award, Amazon Research Award, DSN Distinguished Paper Award, and RAID Best Paper Award.
-
Cryptographic function detection in binaries is a crucial task in software reverse engineering (SRE), with significant implications for secure communications, regulatory compliance, and malware analysis. While traditional approaches based on cryptographic signatures are common, they are challenging to maintain and often prone to false negatives in the case of custom implementations or false positives when short signatures are used. Alternatively, techniques based on statistical analysis of mnemonics in disassembled code have emerged, positing that cryptographic functions tend to involve a high frequency of arithmetic and logic operations. However, these methods have predominantly been formulated as heuristics, with thresholds that may not always be optimal or universally applicable.
In this paper, we present Mnemocrypt, a machine learningbased tool for detecting cryptographic functions in x86 executables, which we release as an IDA Pro plugin. Using a random forest classifier, Mnemocrypt leverages both structural and content-related metrics of functions at varying levels of granularity to make its predictions. The primary design goal of Mnemocrypt is to minimize false positives, as misleading results could lead analysts down incorrect investigative paths, undermining the efficacy of reverse engineering efforts. Trained on a diverse dataset of cryptographic libraries compiled with different optimization levels, Mnemocrypt achieves robust detection capabilities without relying on predefined signatures or computationally expensive data flow graph analysis, ensuring high efficiency.
Our evaluation, conducted on 231 Portable Executable x86 Windows malware samples from different families, demonstrates that Mnemocrypt, when configured with a high confidence threshold, significantly outperforms existing solutions in terms of false positives. The few false positives detected by Mnemocrypt were only related to compression functions or complex data processing routines, further emphasizing the tool’s precision in distinguishing algorithms that use instructions similar to cryptographic processes. Finally, with a median execution time of six seconds, Mnemocrypt provides the reverse engineering community with a practical and efficient solution for identifying cryptographic functions, paving the way for further studies to improve this type of model.
-
Rachael Little, Dongpeng Xu (University of New Hampshire)
Software obfuscation is a form of code protection designed to hide the inner workings of a program from reverse engineering and analysis. Mixed Boolean Arithmetic (MBA) is one popular form that obscures simple arithmetic expressions via transformation to more complex equations involving both boolean and arithmetic operations. Most prior works focused on developing strong MBA at the source code or expression level; however, how many of them are resilient against compiler optimizations still remain unknown. In this work, we carefully inspect the strength of MBA obfuscation after various compiler optimizations. We embed MBA expressions from several popular datasets into C programs and examine how they appear post-compilation using the compilers GCC, Clang, and MSVC. Surprisingly, we discover a notable trend of reduction in MBA size and complexity after compiler optimization. We report our findings and discuss how MBA expressions are impacted by compiler optimizations.
-
Caleb Helbling, Graham Leach-Krouse, Sam Lasser, Greg Sullivan (Draper)
This paper introduces cozy, a tool for analyzing and visualizing differences between two versions of a software binary. The primary use case for cozy is validating “micropatches”: small binary or assembly-level patches inserted into existing compiled binaries. To perform this task, cozy leverages the Python-based angr symbolic execution framework. Our tool analyzes the output of symbolic execution to find end states for the pre- and post-patched binaries that are compatible (reachable from the same input). The tool then compares compatible states for observable differences in registers, memory, and side effects. To aid in usability, cozy comes with a web-based visual interface for viewing comparison results. This interface provides a rich set of operations for pruning, filtering, and exploring different types of program data.
-
Andrew Fasano, Zachary Estrada, Luke Craig, Ben Levy, Jordan McLeod, Jacques Becker, Elysia Witham, Cole DiLorenzo, Caden Kline, Ali Bobi (MIT Lincoln Laboratory), Dinko Dermendzhiev (Georgia Institute of Technology), Tim Leek (MIT Lincoln Laboratory), William Robertson (Northeastern University)
Firmware rehosting enables firmware execution and dynamic analysis. Prior rehosting work has taken a “one-size-fitsall” approach, where expert knowledge is baked into a tool and then applied to all input firmware. Penguin takes a new, targetcentric approach, building a whole-system rehosting environment tailored to the specific firmware being analyzed. A rehosting environment is specified by a configuration file that represents a series of transformations applied to the emulation environment. The initial rehosting configuration is derived automatically from analyzing the filesystem of an extracted firmware image, providing target-specific values such as directories, pseudofiles, and NVRAM keys. This approach allows Penguin to rehost systems from a wide variety of vendors. In tests on 13,649 embedded Linux firmware images from 69 different vendors and 8 architectures, Penguin was able to build rehosting environments that work for 75% more firmware than the prior state of the art. We implement a configuration minimizer that finds required transformations and show that most firmware require only a small number of transformations, with variation across vendors.
-
Sima Arasteh (University of Southern California), Pegah Jandaghi, Nicolaas Weideman (University of Southern California/Information Sciences Institute), Dennis Perepech, Mukund Raghothaman (University of Southern California), Christophe Hauser (Dartmouth College), Luis Garcia (University of Utah Kahlert School of Computing)
The software compilation process has a tendency to obscure the original design of the system and makes it difficult both to identify individual components and discern their purpose simply by examining the resulting binary code. Although decompilation techniques attempt to recover higherlevel source code from the machine code in question, they are not fully able to restore the semantics of the original functions. Furthermore, binaries are often stripped of metadata, and this makes it challenging to reverse engineer complex binary software.
In this paper we show how a combination of binary decomposition techniques, decompilation passes, and LLM-powered function summarization can be used to build an economical engine to identify modules in stripped binaries and associate them with high-level natural language descriptions. We instantiated this technique with three underlying open-source LLMs—CodeQwen, DeepSeek-Coder and CodeStral—and measured its effectiveness in identifying modules in robotics firmware. This experimental evaluation involved 467 modules from four devices from the ArduPilot software suite, and showed that CodeStral, the bestperforming backend LLM, achieves an average F1-score of 0.68 with an online running time of just a handful of seconds.