Advertising and content blocking is an important part of improving the privacy, performance and overall-pleasantness of the web. If you're reading this, you almost certainly have a content blocking tool installed. Popular content blocking tools rely on crowdsourced generated filter lists, and while they're demonstrably useful, they also suffer from many shortcomings: (i) they're easily circumvented, (ii) they break websites (and so are overly conservative) and (iii) rely on large numbers of users, and so do not “scale” to parts of the web with fewer users. This last shortcoming is particularly significant because people visiting non-English, non-global-language parts of the web often face higher data costs, and have lower incomes to pay for internet access.
In this talk I will present three research projects from Brave, and how we plan to improve content blocking for all web users. Brave is building the best-of-breed content blocker, both in terms of depth (i.e. blocking types of harmful behaviors other tools miss) and breath (i.e. proving high quality blocking for users under-served by existing tools).
The research projects discussed in this talk improve advertising and content blocking in three ways. First, I'll present work on identifying privacy-harming scripts, independent of the code unit they're delivered in. This approach allows us to measure how often advertisers evade existing blockers (changing URLs, mixing malicious and benign code, etc.), and to build counter measures. Second, I'll describe a ML tool for predicting whether a content blocker “breaks” a website, in the subjective evaluation of a browser user. This tool will allow Brave to block aggressively without breaking sites. Third, I'll discuss a method to programmatically generate filter lists for under-served web regions using a novel image classifier and Brave-developed system of deep browser instrumentation called PageGraph.