Tobias Holl (Ruhr University Bochum), Leon Weiß (Ruhr University Bochum), Kevin Borgolte (Ruhr University Bochum)
Current best practice recommendations for fuzzing research rely on the use of standardized evaluation frameworks. While many aspects of these frameworks have been scrutinized, the coverage collection and evaluation remain blind spots. We present two examples of discrepancies in coverage reported by FuzzBench, highlighting a gap in our understanding of the evaluation frameworks we depend on. In this work, we close this gap for FuzzBench by systematically examining its coverage analysis. We find that there are issues in every stage of the analysis pipeline that can potentially influence reported coverage.
We propose a thorough experimental design to assess the impact of each individual flaw, and their combined influence on the evaluation results. We will contextualize our results within the existing literature and discuss implications for our trust in previous fuzzing evaluations. With this work, we will improve confidence in the reliabilty of standardized fuzzing benchmarks and inform future research on how to improve their evaluation.
We will open-source our code after completing all experiments.