Beyond Raw Bytes: Towards Large Malware Language Models

Luke Kurlandski (Rochester Institute of Technology), Harel Berger (Ariel University), Yin Pan (Rochester Institute of Technology), Matthew Wright (Rochester Institute of Technology)

Malware poses an increasing threat to critical computing infrastructure, driving demand for more advanced detection and analysis methods. Although raw-binary malware classifiers show promise, they are limited in their capabilities and struggle with the challenges of modeling long sequences. Meanwhile, the rise of large language models (LLMs) in natural language processing showcases the power of massive, self-supervised models trained on heterogeneous datasets, offering flexible representations for numerous downstream tasks. The success behind these models is rooted in the size and quality of their training data, the expressiveness and scalability of their neural architecture, and their ability to learn from unlabeled data in a self-supervised manner.

In this work, we take the first steps toward developing large malware language models (LMLMs), the malware analog to LLMs. We tackle the core aspects of this objective, namely, questions about data, models, pretraining, and finetuning. By pretraining a malware classification model with language modeling objectives, we were able to improve downstream performance on diverse practical malware classification tasks on average by 1.1% and up to 28.6%, indicating that these models could serve to succeed raw-binary malware classifiers.

Paper

View More Papers

Characterizing the Implementation of Censorship Policies in Chinese LLM...

Anna Ablove (University of Michigan), Shreyas Chandrashekaran (University of Michigan), Xiao Qiang (University of California at Berkeley), Roya Ensafi (University of Michigan)

NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Lichao Wu (Technical University of Darmstadt), Sasha Behrouzi (Technical University of Darmstadt), Mohamadreza Rostami (Technical University of Darmstadt), Maximilian Thang (Technical University of Darmstadt), Stjepan Picek (University of Zagreb & Radboud University), Ahmad-Reza Sadeghi (Technical University of Darmstadt)

Validity Is Not Enough: Uncovering the Security Pitfall in...

Di Zhai (Beijing Jiaotong University), Jiashuo Zhang (Peking University), Jianbo Gao (Beijing Jiaotong University), Tianhao Liu (Beijing Jiaotong University), Tao Zhang (Beijing Jiaotong University), Jian Wang (Beijing Jiaotong University), Jiqiang Liu (Beijing Jiaotong University)