Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder (Auburn University)

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction—a critical problem in binary code analysis.
To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoderdecoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

View More Papers

Understanding MPU Usage in Microcontroller-based Systems in the Wild

Wei Zhou, Zhouqi Jiang (School of Cyber Science and Engineering, Huazhong University of Science and Technology), Le Guan (School of Computing, University of Georgia)

Read More

Security Advice on Content Filtering and Circumvention for Parents...

Ran Elgedawy (The University of Tennessee, Knoxville), John Sadik (The University of Tennessee, Knoxville), Anuj Gautam (The University of Tennessee, Knoxville), Trinity Bissahoyo (The University of Tennessee, Knoxville), Christopher Childress (The University of Tennessee, Knoxville), Jacob Leonard (The University of Tennessee, Knoxville), Clay Shubert (The University of Tennessee, Knoxville), Scott Ruoti (The University of Tennessee,…

Read More

QMSan: Efficiently Detecting Uninitialized Memory Errors During Fuzzing

Matteo Marini (Sapienza University of Rome), Daniele Cono D'Elia (Sapienza University of Rome), Mathias Payer (EPFL), Leonardo Querzoni (Sapienza University of Rome)

Read More

A Formal Approach to Multi-Layered Privileges for Enclaves

Ganxiang Yang (Shanghai Jiao Tong University), Chenyang Liu (Shanghai Jiao Tong University), Zhen Huang (Shanghai Jiao Tong University), Guoxing Chen (Shanghai Jiao Tong University), Hongfei Fu (Shanghai Jiao Tong University), Yuanyuan Zhang (Shanghai Jiao Tong University), Haojin Zhu (Shanghai Jiao Tong University)

Read More