Peiwei Hu (Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China), Ruigang Liang (Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China), Kai Chen (Institute of Information Engineering, Chinese Academy of Sciences, China)

Reverse engineering is essential in malware analysis, vulnerability discovery, etc. Decompilers assist the reverse engineers by lifting the assembly to the high-level programming language, which highly boosts binary comprehension. However, decompilers suffer from problems such as meaningless variable names, redundant variables, and lacking comments describing the purpose of the code. Previous studies have shown promising performance in refining the decompiler output by training the models with huge datasets containing various decompiler outputs. However, even datasets that take much time to construct cover limited binaries in the real world. The performance degrades severely facing the binary migration.

In this paper, we present DeGPT, an end-to-end framework aiming to optimize the decompiler output to improve its readability and simplicity and further assist the reverse engineers in understanding the binaries better. The Large Language Model (LLM) can mitigate performance degradation with its extraordinary ability endowed by large model size and training set containing rich multi-modal data. However, its potential is difficult to unlock through one-shot use. Thus, we propose the three-role mechanism, which includes referee (R_ref), advisor (R_adv), and operator (R_ope), to adapt the LLM to our optimization tasks. Specifically, R_ref provides the optimization scheme for the target decompiler output, while R_adv gives the rectification measures based on the scheme, and R_ope inspects whether the optimization changes the original function semantics and concludes the final verdict about whether to accept the optimizations. We evaluate DeGPT on the datasets containing decompiler outputs of various software, such as the practical command line tools, malware, a library for audio processing, and implementations of algorithms. The experimental results show that even on the output of the current top-level decompiler (Ghidra), DeGPT can achieve 24.4% reduction in the cognitive burden of understanding the decompiler outputs and provide comments of which 62.9% can provide practical semantics for the reverse engineers to help the understanding of binaries. Our user surveys also show that the optimizations can significantly simplify the code and add helpful semantic information (variable names and comments), facilitating a quick and accurate understanding of the binary.

View More Papers

Designing and Evaluating a Testbed for the Matter Protocol:...

Ravindra Mangar (Dartmouth College) Jingyu Qian (University of Illinois), Wondimu Zegeye (Morgan State University), Abdulrahman AlRabah, Ben Civjan, Shalni Sundram, Sam Yuan, Carl A. Gunter (University of Illinois), Mounib Khanafer (American University of Kuwait), Kevin Kornegay (Morgan State University), Timothy J. Pierson, David Kotz (Dartmouth College)

Read More

Benchmarking transferable adversarial attacks

Zhibo Jin (The University of Sydney), Jiayu Zhang (Suzhou Yierqi), Zhiyu Zhu, Huaming Chen (The University of Sydney)

Read More

Automatic Adversarial Adaption for Stealthy Poisoning Attacks in Federated...

Torsten Krauß (University of Würzburg), Jan König (University of Würzburg), Alexandra Dmitrienko (University of Wuerzburg), Christian Kanzow (University of Würzburg)

Read More

WIP: Security Vulnerabilities and Attack Scenarios in Smart Home...

Haoqiang Wang (Chinese Academy of Sciences, University of Chinese Academy of Sciences, Indiana University Bloomington), Yichen Liu (Indiana University Bloomington), Yiwei Fang, Ze Jin, Qixu Liu (Chinese Academy of Sciences, University of Chinese Academy of Sciences, Indiana University Bloomington), Luyi Xing (Indiana University Bloomington)

Read More