Osama Al Haddad (Macquarie University, Sydney, Australia), Muhammad Ikram (Macquarie University, Sydney, Australia), Young Choon Lee (Macquarie University, Sydney, Australia), Muhammad Ejaz Ahmed (Data61 CSIRO, Sydney, Australia)

As Security Operations Center (SOC) teams face challenges analyzing disparate threat feeds with varying amounts of information, Large Language Models (LLM) are a promising technology that can scale vulnerability prioritization efforts. However, critical to LLMs generating accurate responses is highquality data on which the LLMs are trained. Recent literature suggests that a small and near-constant number of compromised training data can affect performances of LLMs of varying sizes. To investigate this possible phenomena in a SOC environment, we evaluated LLM and Prompting Technique (PT) combinations to prioritize software vulnerabilities, using the Cybersecurity Infrastructure and Security Agency’s Stakeholder-Specific Vulnerability Categorization (SSVC) framework. OpenAI ChatGPT 4o-mini, Anthropic Claude 3 Haiku, and Google Gemini Flash 1.5, across 12 PTs, were instructed to analyse 384 real-world vulnerability samples over three trials, and to return values for the four SSVC decision points (SDP). These vulnerabilities were classed pre- or post-Knowledge Cutoff Date (KCD) – pre- KCD where the vulnerabilities are within the cutoff dates of all the investigated LLMs, and post-KCD are beyond the all LLM cutoff dates. For each trial, F1-scores were calculated for each LLM-PT-SDP-KCD combination. A harmonic mean was then calculated across the three trials to yield a single performance score for each LLM-PT-SDP-KCD combination. We found that LLMs tended to perform stronger on post- KCD vulnerabilities than on pre-KCD, with Gemini Flash 1.5 the strongest performer overall in conjunction on the Chain of Thought and Few Shot PTs, particularly for the Exploitation SDP. To explain this observation, we posit that the revisions of the vulnerability prioritization life cycle amount to a type of data compromise in the training dataset, such that LLMs are hindered by older and interim reports and classifications of vulnerabilities, hence impacting their ability to provide accurate software vulnerability classifications. In conclusion, we call for greater transparency in LLM training datasets for vulnerability prioritization tasks, as well as further exploration of methods to generate LLM training datasets optimized for vulnerability prioritization. Code, prompt templates and data are available here.

View More Papers

WhiteCloak: How to Hold Anonymous Malicious Clients Accountable in...

Zhi Lu (Huazhong university of Science and Technology), Yongquan Cui (Huazhong university of Science and Technology), Songfeng Lu (Huazhong university of Science and Technology)

Read More

U.S. Election Expert Perspectives on End-to-end Verifiable Voting Systems

Julie M. Haney (National Institute of Standards and Technology, Gaithersburg, Maryland), Shanee Dawkins (National Institute of Standards and Technology, Gaithersburg, Maryland), Sandra Spickard Prettyman (Cultural Catalyst LLC, Chicago), Mary F. Theofanos (National Institute of Standards and Technology, Gaithersburg, Maryland), Kristen K. Greene (National Institute of Standards and Technology, Gaithersburg, Maryland), Kristin L. Kelly Koskey (Cultural Catalyst LLC, Chicago), Jody L. Jacobs (National Institute of Standards…

Read More

PIRANHAS: PrIvacy-Preserving Remote Attestation in Non-Hierarchical Asynchronous Swarms

Jonas Hofmann (Technical University of Darmstadt), Philipp-Florens Lehwalder (Technical University of Darmstadt), Shahriar Ebrahimi (Alan Turing Institute), Parisa Hassanizadeh (IPPT PAN / University of Warwick), Sebastian Faust (Technical University of Darmstadt)

Read More