Earlier this month, DCMS launched the Online Safety Data Initiative (OSDI) - a 15-month project designed to test methodologies to facilitate better access to higher quality data to support the development of technology to identify and remove harmful and illegal content from the internet. In this blog post, we outline the key challenges to the development of this technology, and how these have informed our priorities.
The OSDI aims to support the development of new online safety solutions through improving access to required online harms data. The initiative will do this by determining and prototyping new approaches to ensure trusted parties can access the data they need to build these solutions.
Harm data - the media and metadata that relates to harmful and illegal online content and behaviour - is the single most valuable resource in safety tech. All existing moderation models are based on this data, from URL sharing initiatives to hash-matching and other automated tooling. It’s also the most vital dependency for the development and testing of AI models to address particularly challenging and emergent forms of harm.
We’ve spent much of the past fortnight talking to safety tech companies, communications service providers (CSPs), civil society and academia about the data challenges they’ve faced while trying to make the internet safer and more trustworthy.
The 2 challenges that have been most frequently cited are scarcity of access to data and inconsistency in its quality.
Lack of data access
In most cases, lack of access to harm data is the primary barrier to growth and innovation for safety tech. This data, by nature, is sensitive and sometimes contains personal information which necessitates a greater level of protection. Getting the security and privacy principles right to safe access to this data is paramount to the success of this project.
We’re in the process of mapping the current data landscape for a range of online harms - research that we’re keen to share with interested parties after its completion in March 2021. What we’re starting to see already is that online harms data is owned by a range of parties but collective efforts to share that data are easier and more developed for some forms of harm than for others.
Lack of quality data
The challenge of inconsistent data quality stems from the distributed nature of online harms and the range of individual approaches to platform moderation taken by CSPs.
Take extremism (as distinct from terrorism). What constitutes ‘extremism’ varies internationally. Even in the UK there’s no statutory definition. As a result, CSPs have to make their own decisions about the extreme content that is uploaded on their platforms, often going beyond what a government could legally enforce in respect of the removal of harmful extremist material.
As a result, there is huge variance in how online harms are classified and how data is labelled and stored. These inconsistencies make it difficult to use this data to train new safety tech or test existing systems.
The team leading the Online Safety Data Initiative has significant experience in developing practical solutions to the legal, ethical and technical challenges of data science for online harms, which we will be bringing to bear as this project progresses.
But this initiative will not succeed without listening to and collaborating with a wide and diverse range of stakeholders across the global community who are working to find solutions to online harms. And, crucially, building on the great work this community has already done and is currently developing.
To do this effectively and to maintain the trust of the community we are representing, our work will be guided by the following 3 principles:
- Transparency in thought, decision making and action: we are committed to the principle of working in the open and, where appropriate, making our work and anything we develop available to those working to further develop safety technologies, which we will couple with independent oversight of our aims and activities at every stage of our work
- Diversity of thought and approach: we aim to explore a range of practical solutions for a variety of challenges faced by the safety tech community, which will require a whole-community approach to idea generation and the development of innovative solutions
- Privacy and security: we are holding ourselves to the highest standards in protecting public and proprietary data and ensuring the security of anything we develop and of the work as a whole, as we’ve begun a work programme aimed to identify the foundational security and privacy standards and measures we require to deliver this work, which we’ll detail in a separate blog post - but crucially, we won’t be ingesting any data until those standards and principles are agreed
We are under no illusions of the size of the challenge we’re attempting to address, but we’re excited about the opportunity to test some novel approaches that could deliver measurable improvements in online safety technologies. If you have a view to offer, please get in touch with me or any of the rest of the team - we will make the time to talk to all interested parties.