Luke Kurlandski (Rochester Institute of Technology, Rochester New York USA), Harel Berger (Ariel University, Israel), Yin Pan (Rochester Institute of Technology, Rochester New York USA), Matthew Wright (Rochester Institute of Technology, Rochester New York USA)

Malware poses an increasing threat to critical computing infrastructure, driving demand for more advanced detection and analysis methods. Although raw-binary malware classifiers show promise, they are limited in their capabilities and struggle with the challenges of modeling long sequences. Meanwhile, the rise of large language models (LLMs) in natural language processing showcases the power of massive, self-supervised models trained on heterogeneous datasets, offering flexible representations for numerous downstream tasks. The success behind these models is rooted in the size and quality of their training data, the expressiveness and scalability of their neural architecture, and their ability to learn from unlabeled data in a self-supervised manner.

In this work, we take the first steps toward developing large malware language models (LMLMs), the malware analog to LLMs. We tackle the core aspects of this objective, namely, questions about data, models, pretraining, and finetuning. By pretraining a malware classification model with language modeling objectives, we were able to improve downstream performance on diverse practical malware classification tasks on average by 1.1% and up to 28.6%, indicating that these models could serve to succeed raw-binary malware classifiers.

View More Papers

From Underground to Mainstream Marketplaces: Measuring AI-Enabled NSFW Deepfakes...

Mohamed Moustafa Dawoud (University of California, Santa Cruz), Alejandro Cuevas (Princeton University), Ram Sundara Raman (University of California, Santa Cruz)

Read More

Select-Then-Compute: Encrypted Label Selection and Analytics over Distributed Datasets...

Nirajan Koirala (University of Notre Dame), Seunghun Paik (Hanyang University), Sam Martin (University of Notre Dame), Helena Berens (University of Notre Dame), Tasha Januszewicz (University of Notre Dame), Jonathan Takeshita (Old Dominion University), Jae Hong Seo (Hanyang University), Taeho Jung (University of Notre Dame)

Read More

The Heat is On: Understanding and Mitigating Vulnerabilities of...

Sri Hrushikesh Varma Bhupathiraju (University of Florida), Shaoyuan Xie (University of California, Irvine), Michael Clifford (Toyota InfoTech Labs), Qi Alfred Chen (University of California, Irvine), Takeshi Sugawara (The University of Electro-Communications), Sara Rampazzi (University of Florida)

Read More