Fuzzing Workshop 2022
Note: All times are in PDT (UTC-7) and all sessions are held in Aviary Ballroom.
Sunday, 24 April
Do you fuzz your own program, or do you fuzz someone else's program? The answer to this question has vast consequences on your view on fuzzing. Fuzzing someone else's program is the typical adverse "security tester" perspective, where you want your fuzzer to be as automatic and versatile as possible. Fuzzing your own code, however, is more like a traditional tester perspective, where you may assume some knowledge about the program and its context, but may also want to _exploit_ this knowledge - say, to direct the fuzzer to critical locations.
In this talk, I detail these differences in perspectives and assumptions, and highlight their consequences for fuzzer design and research. I also highlight cultural differences in the research communities, and what happens if you submit a paper to the wrong community. I close with an outlook into our newest frameworks, set to reconcile these perspectives by giving users unprecedented control over fuzzing, yet staying fully automatic if need be.
Andreas Zeller is faculty at the CISPA Helmholtz Center for Information Security and professor for Software Engineering at Saarland University, both in Saarbrücken, Germany. His research on automated debugging, mining software archives, specification mining, and security testing has won several awards for its impact in academia and industry. Zeller is an ACM Fellow, an IFIP Fellow, an ERC Advanced Grant Awardee, and holds an ACM SIGSOFT Outstanding Research Award.
Andrea Fioraldi (EURECOM), Alessandro Mantovani (EURECOM), Dominik Maier (TU Berlin), Davide Balzarotti (EURECOM)
AFL is one of the most used and extended fuzzing projects, adopted by industry and academic researchers alike. While the community agrees on AFL’s effectiveness at discovering new vulnerabilities and at its outstanding usability, many of its internal design choices remain untested to date. Security practitioners often clone the project “as-is” and use it as a starting point to develop new techniques, usually taking everything under the hood for granted. Instead, we believe that a careful analysis of the different parameters could help modern fuzzers to improve their performance and explain how each choice can affect the outcome of security testing, either negatively or positively.
The goal of this paper is to provide a comprehensive understanding of the internal mechanisms of AFL by performing experiments and comparing different metrics used to evaluate fuzzers. This will prove the efficacy of some patterns and clarify which aspects are instead outdated. To achieve this, we set up nine unique experiments that we carried out on the popular Fuzzbench platform. Each test focuses on a different aspect of AFL, ranging from its mutation approach to the feedback encoding scheme and the scheduling methodologies.
Our preliminary findings show that each design choice affects different factors of AFL. While some of these are positively correlated with the number of detected bugs or the target coverage, other features are related to usability and reliability. Most important, the outcome of our experiments will indicate which parts of AFL we should preserve in modern fuzzers.
Fuzzing is an effective software testing method that discovers bugs by feeding target applications with (usually a massive amount of) automatically generated inputs. Many state-of-art fuzzers use branch coverage as a feedback metric to guide the fuzzing process. The fuzzer retains inputs for further mutation only if branch coverage is increased. However, branch coverage only provides a shallow sampling of program behaviours and hence may discard inputs that might be interesting to mutate. This work aims at taking advantage of the large body of research over defining finer-grained code coverage metrics (such as mutation coverage) and use these metrics as better proxies to select interesting inputs for mutation. We propose to make coverage-based fuzzers support most fine-grained coverage metrics out of the box (i.e., without changing fuzzer internals). We achieve this by making the test objectives defined by these metrics (such as mutants to kill) explicit as new branches in the target program. Fuzzing such a modified target is then equivalent to fuzzing the original target, but the fuzzer will also retain inputs covering the additional metrics objectives for mutation. We propose a preliminary evaluation of this novel idea using two state-of-art fuzzers, namely AFL++(3.14c) and QSYM with AFL(2.52b), on the four standard LAVA-M benchmarks. Significantly positive results are obtained on one benchmark and marginally negative ones on the three others. We discuss directions towards a strong and complete evaluation of the proposed approach and call for early feedback from the fuzzing community.
Most fuzzing efforts, very understandably, focus on fuzzing the program in which bugs are to be found. However, in this paper we propose that fuzzing programs “near” the System Under Test (SUT) can in fact improve the effectiveness of fuzzing, even if it means less time is spent fuzzing the actual target system. In particular, we claim that fault detection and code coverage can be improved by splitting fuzzing resources between the SUT and mutants of the SUT. Spending half of a fuzzing budget fuzzing mutants, and then using the seeds generated to fuzz the SUT can allow a fuzzer to explore more behaviors than spending the entire fuzzing budget on the SUT. The approach works because fuzzing most mutants is “almost” fuzzing the SUT, but may change behavior in ways that allow a fuzzer to reach deeper program behaviors. Our preliminary results show that fuzzing mutants is trivial to implement, and provides clear, statistically significant, benefits in terms of fault detection for a non-trivial benchmark program; these benefits are robust to a variety of detailed choices as to how to make use of mutants in fuzzing. The proposed approach has two additional important advantages: first, it is fuzzer-agnostic, applicable to any corpus-based fuzzer without requiring modification of the fuzzer; second, the fuzzing of mutants, in addition to aiding fuzzing the SUT, also gives developers insight into the mutation score of a fuzzing harness, which may help guide improvements to a project’s fuzzing approach.
Shoham Shitrit(University of Rochester) and Sreepathi Pai (University of Rochester)
Formal semantics for instruction sets can be used to validate implementations through formal verification. However, testing is often the only feasible method when checking an artifact such as a hardware processor, a simulator, or a compiler. In this work, we construct a pipeline that can be used to automatically generate a test suite for an instruction set from its executable semantics. Our method mutates the formal semantics, expressed as a C program, to introduce bugs in the semantics. Using a bounded model checker, we then check the mutated semantics to the original for equivalence. Since the mutated and original semantics are usually not equivalent, this yields counterexamples which can be used to construct a test suite. By combining a mutation testing engine with a bounded model checker, we obtain a fully automatic method for constructing test suites for a given formal semantics. We intend to instantiate this on a formal semantics of a portion of NVIDIA’s PTX instruction set for GPUs that we have developed. We will compare to our existing method of testing that uses stratified random sampling and evaluate effectiveness, cost, and feasibility.
Fuzzing is a highly effective technique that finds security vulnerabilities, stability bugs and correctness issues in a fully automated way. Over the last decade, it has rapidly evolved from being an experimental tool used by security teams to becoming a critical component of the software development life cycle and part of NIST’s standards for software verification. This talk will give insights into this journey of fuzzing innovation, from a dumb, blackbox testing technique to a smart, generational whitebox one, augmented with effective memory instrumentation. It will also shed light on the recent efforts to standardize fuzzer benchmarking and scaling research efforts in the community.
Abhishek Arya is a Principal Engineer and head of the Google Open Source Security Team. His team has been a key contributor to various security engineering efforts inside the Open Source Security Foundation (OpenSSF). This includes the Fuzzing Tools (Fuzz-Introspector), Supply Chain Security Framework (SLSA, Sigstore), Security Risk Measurement Platform (Scorecards, AllStar), Vulnerability Management Solution (OSV) and Package Analysis project. Prior to this, he was a founding member of the Google Chrome Security Team and built OSS-Fuzz, a highly scaled and automated fuzzing infrastructure that fuzzes all of Google and Open Source. His team also maintains FuzzBench, a free fuzzer benchmarking service that helps the community rigorously evaluate fuzzing research and make it easier to adopt.
While many real-world programs are shipped with configurations to enable/disable functionalities, fuzzers have mostly been applied to test single configurations of these programs. In this work, we first conduct an empirical study to understand how program configurations affect fuzzing performance. We find that limiting a campaign to a single configuration can result in failing to cover a significant amount of code. We also observe that different program configurations contribute differing amounts of code coverage, challenging the idea that each one can be efficiently fuzzed individually. Motivated by these two observations we propose ConfigFuzz, which can fuzz configurations along with normal inputs. ConfigFuzz transforms the target program to encode its program options within part of the fuzzable input, so existing fuzzers’ mutation operators can be reused to fuzz program configurations. We instantiate ConfigFuzz on 3 configurable, common fuzzing targets, and integrate their executions in FuzzBench. In our preliminary evaluation, ConfigFuzz nearly always outperforms the baseline fuzzing of a single configuration, and in one target also outperforms the fuzzing of a sequence of sampled configurations. However, we find that sometimes fuzzing a sequence of sampled configurations, with shared seeds, improves on ConfigFuzz. We propose hypotheses and plan to use data visualization to further understand the behavior of ConfigFuzz, and refine it, in the full evaluation.
Shisong Qin (Tsinghua University), Fan Hu (State Key Laboratory of Mathematical Engineering and Advanced Computing), Bodong Zhao (Tsinghua University), Tingting Yin (Tsinghua University), Chao Zhang (Tsinghua University)
As the essential component responsible for communication, network services are security-critical, and it is vital to find vulnerabilities in them. Fuzzing is currently one of the most popular software vulnerability discovery techniques, widely adopted due to its high efficiency and low false positives. However, existing coverage-guided fuzzers mainly aim at stateless local applications, leaving stateful network services underexplored. Recently, some fuzzers targeting network services have been proposed but have certain limitations, e.g., insufficient or inaccurate state representation and low testing efficiency.
In this paper, we propose a new fuzzing solution NSFuzz for stateful network services. Specifically, we studied typical implementations of network service programs and figured out how they represent states and interact with clients, and accordingly propose (1) a program variable-based state representation scheme and (2) an efficient interaction synchronization mechanism to improve efficiency. We have implemented a prototype of NSFuzz, which uses static analysis to identify network event loops and extract state variables, then achieves fast I/O synchronization and efficient s t ate-aware fuzzing via lightweight compile-time instrumentation. The preliminary evaluation results show that, compared with state-of-the-art network service fuzzers AFLNET and STATEAFL, our solution NSFuzz could infer a more accurate state model during fuzzing and improve the testing throughput by up to 50x and the coverage by up to 20%.
Coverage-guided greybox fuzzers rely on feedback derived from control-flow coverage to explore a target program and uncover bugs. This is despite control-flow feedback offering only a coarse-grained approximation of program behavior. Data flow intuitively more-accurately characterizes program behavior. Despite this advantage, fuzzers driven by data-flow coverage have received comparatively little attention, appearing mainly when heavyweight program analyses (e.g., taint analysis, symbolic execution) are used. Unfortunately, these more accurate analyses incur a high run-time penalty, impeding fuzzer throughput. Lightweight data-flow alternatives to control-flow fuzzing remain unexplored.
We present DATAFLOW, a greybox fuzzer driven by lightweight data-flow profiling. Whereas control-flow edges represent the order of operations in a program, data-flow edges capture the dependencies between operations that produce data values and the operations that consume them: indeed, there may be no control dependence between those operations. As such, data-flow coverage captures behaviors not visible as control flow and intuitively discovers more or different bugs. Moreover, we establish a framework for reasoning about data-flow coverage, allowing the computational cost of exploration to be balanced with precision.
We perform a preliminary evaluation of DATAFLOW, comparing fuzzers driven by control flow, taint analysis (both approximate and exact), and data flow. Our initial results suggest that, so far, pure coverage remains the best coverage metric for uncovering bugs in most targets we fuzzed (72 % of them). However, data-flow coverage does show promise in targets where control flow is decoupled from semantics (e.g., parsers). Further evaluation and analysis on a wider range of targets is required.