Shang Wang (University of Technology Sydney, Australia), Tianqing Zhu (City University of Macau, Macau SAR, China), Dayong Ye (City University of Macau, Macau SAR, China), Hua Ma (Data61, CSIRO, Australia), Bo Liu (University of Technology Sydney, Australia), Ming Ding (Data61, CSIRO, Australia), Shengfang Zhai (National University of Singapore, Singapore), Yansong Gao (School of Cyber Science and Engineering, Southeast University, China)

In modern Data-as-a-Service (DaaS) ecosystems, data curators such as data brokerage companies aggregate high-quality data from many contributors and monetize it for deep learning model providers. However, malicious curators can sell valuable data but not inform their original contributors, which violates individual benefits and the law. Intrusive watermarking is one of the state-of-the-art (SOTA) techniques for protecting data copyright, and it detects whether a suspicious model carries the predefined pattern. However, these approaches face numerous limitations: struggle to work under low watermark injection rates (≤ 1.0%); performance degradation; false positives; not robust against watermarking cleansing.

This work proposes an innovative intrusive watermarking approach, dubbed DIP (Data Intelligence Probabilistic Watermarking), to support dataset ownership verification while addressing the limitations above. It applies a distribution-aware sample selection algorithm, embeds probabilistic associations between watermarked samples and multiple outputs, and adopts a two-fold verification framework that leverages both inference results and their distribution as watermark signals. Extensive experiments on 4 image and 5 text datasets demonstrate that DIP maintains the model’s performance, and achieves an average watermark success rate of 89.4% at a 1% injection budget. We further validate that DIP is orthogonal to various watermarked data designs and can seamlessly integrate their strengths. Moreover, DIP proves effective across diverse modalities (image and text) and tasks (regression), with strong performance on generation tasks in large language models. DIP exhibits robustness against various adversarial environments, including 3 based on data augmentation, 3 on data cleansing, 4 on robust training and 3 on collusion-based watermark removal, while existing SOTAs fail. The source code is released at https://github.com/SixLab6/DIP.

View More Papers

Kangaroo: A Private and Amortized Inference Framework over WAN...

Wei Xu (Xidian University), Hui Zhu (Xidian University), Yandong Zheng (Xidian University), Song Bian (Beihang University), Ning Sun (Xidian University), Hao Yuan (Xidian University), Dengguo Feng (School of Cyber Science and Technology), Hui Li (Xidian University)

Read More

Unveiling BYOVD Threats: Malware's Use and Abuse of Kernel...

Andrea Monzani (University of Milan), Antonio Parata (University of Milan), Andrea Oliveri (EURECOM), Simone Aonzo (EURECOM), Davide Balzarotti (EURECOM), Andrea Lanzi (University of Milan)

Read More

Two Heads are Better Than One: Analysing Browser Extensions...

Abdullah Hassan Chaudhry (CISPA Helmholtz Center for Information Security), Valentino Dalla Valle (CISPA Helmholtz Center for Information Security), Aurore Fass (Inria Centre at Université Côte d’Azur)

Read More