Binary Analysis Research (BAR) Workshop 2022

Note: All times are in PDT (UTC-7) and all sessions are held in Rousseau Room.

Proceedings Frontmatter Opens a new window

Hide all

Sunday, 24 April

09:00 - 09:15

Welcome and Introductory Remarks

09:15 - 10:20

Keynote

All things Binary

Dr. Sergey Bratus, DARPA PI and Research Associate Professor at Dartmouth College

10:20 - 10:50

Break

Boardroom

10:50 - 11:40

Session: Dynamic Analysis on Binaries

DITTANY: Strength-Based Dynamic Information Flow Analysis Tool for x86 Binaries

Walid J. Ghandour, Clémentine Maurice (CNRS, CRIStAL)

Abstract

Paper

Slides

Dynamic dependence analysis monitors information flow between instructions in a program at runtime. Strength-based dynamic dependence analysis quantifies the strength of each dependence chain by a measure computed based on the values induced at the source and target of the chain. To the best of our knowledge, there is currently no tool available that implements strength-based dynamic information flow analysis for x86.

This paper presents DITTANY, tool support for strength-based dynamic dependence analysis and experimental evidence of its effectiveness on the x86 platform. It involves two main components: 1) a Pin-based profiler that identifies dynamic dependences in a binary executable and records the associated values induced at their sources and targets, and 2) an analysis tool that computes the strengths of the identified dependences using information theoretic and statistical metrics applied on their associated values. We also study the relation between dynamic dependences and measurable information flow, and the usage of zero strength flows to enhance performance.

DITTANY is a building block that can be used in different contexts. We show its usage in data value and indirect branch predictions. Future work will use it in countermeasures against transient execution attacks and in the context of approximate computing.
FitM: Binary-Only Coverage-GuidedFuzzing for Stateful Network Protocols

Dominik Maier, Otto Bittner, Marc Munier, Julian Beier (TU Berlin)

Abstract

Paper

Slides

Common network protocol fuzzers use complex grammars for fuzzing clients and servers with a (semi-)correct input for the server. In contrast, feedback-guided fuzzers learn their way through the target and discover valid input on their own. However, their random mutations frequently destroy all stateful progress when they clobber necessary early communication packets. Deeper into the communication, it gets increasingly unlikely for a coverage-guided fuzzer like AFL++ to explore later stages in client-server communications. Even combinations of both approaches require considerable manual effort for seed and grammar generation, even though sound input sources for servers already exist: their respective clients. In this paper, we present FitM, the Fuzzer in the Middle, a coverage-guided fuzzer for complex client-server interactions. To overcome issues of the State-of-the-Art, FitM emulates the network layer between client and host, fuzzing both server and client at the same time. Once FitM reaches a new step in a protocol, it uses CRIU’s userspace snapshots to checkpoint client and server to continue fuzzing this step in the protocol directly. The combination of domain knowledge gathered from the proper peer, with coverage guided snapshot fuzzing, allows FitM to explore the target extensively. At the same time, FitM reruns earlier snapshots in a probabilistic manner, effectively fuzzing the state space. We show that FitM can reach greater depth than previous tools by comparing found basic blocks, the number of client-server interactions, and execution speed. Based on AFL++’s qemuafl, FitM is an effective and low-effort binary only fuzzer for network protocols, that uncovered overflows in the GNU Inetutils FTP client with minimum effort.

11:40 - 13:20

Lunch

Lawn

13:20 - 14:20

Keynote

30 Years into Scientific Binary Decompilation: What We Have Achieved and What We Need to Do Next

Dr. Ruoyu (Fish) Wang, Assistant Professor at Arizona State University

14:20 - 14:30

Break

Boardroom

14:30 - 15:20

Session: Learning on Binaries

Detecting Obfuscated Function Clones in Binaries using Machine Learning

Michael Pucher (University of Vienna), Christian Kudera (SBA Research), Georg Merzdovnik (SBA Research)

Abstract

Paper

Slides

The complexity and functionality of malware is ever-increasing. Obfuscation is used to hide the malicious intent from virus scanners and increase the time it takes to reverse engineer the binary. One way to minimize this effort is function clone detection. Detecting whether a function is already known, or similar to an existing function, can reduce analysis effort. Outside of malware, the same function clone detection mechanism can be used to find vulnerable versions of functions in binaries, making it a powerful technique.

This work introduces a slim approach for the identification of obfuscated function clones, called OFCI, building on recent advances in machine learning based function clone detection. To tackle the issue of obfuscation, OFCI analyzes the effect of known function calls on function similarity. Furthermore, we investigate function similarity classification on code obfuscated through virtualization by applying function clone detection on execution traces. While not working adequately, it nevertheless provides insight into potential future directions.

Using the ALBERT transformer OFCI can achieve an 83% model size reduction in comparison to state-of-the-art approaches, while only causing an average 7% decrease in the ROC-AUC scores of function pair similarity classification. However, the reduction in model size comes at the cost of precision for function clone search. We discuss the reasons for this as well as other pitfalls of building function similarity detection tooling.
Beyond the C: Retargetable Decompilation using Neural Machine Translation

Iman Hosseini, Brendan Dolan-Gavitt (NYU)

Abstract

Paper

Slides

The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.

15:20 - 15:50

Break

Boardroom

15:50 - 16:15

Session: Open Problems in Binary Analysis

The Inconvenient Truths of Ground Truth for Binary Analysis

Jim Alves-Foss, Varsha Venugopal (University of Idaho)

Abstract

Paper

Slides

The effectiveness of binary analysis tools and techniques is often measured with respect to how well they map to a ground truth. We have found that not all ground truths are created equal. This paper challenges the binary analysis community to take a long look at the concept of ground truth, to ensure that we are in agreement with definition(s) of ground truth, so that we can be confident in the evaluation of tools and techniques. This becomes even more important as we move to trained machine learning models, which are only as useful as the validity of the ground truth in the training.

16:15 - 16:30

Best Paper Award and Closing Remarks