The rise of prominent AI models such as ChatGPT and Stable Diffusion has brought the scale of commercial web scraping to the forefront attention of content creators and researchers. Billions of web pages and images are used to train these models without content creators’ knowledge, sparking extensive criticism and even lawsuits against AI firms. Amidst such debates, licensing is proposed by researchers and legal experts to be a potential approach to mitigate content creators’ concerns and promote more responsible data reuse. However, it remains unclear what specific licensing terms will be effective and what sociotechnical environments are necessary to facilitate the use of licensing at scale.
On October 15, CLTC co-hosted a virtual workshop at the ACM Conference on Computer-Supported Cooperative Work And Social Computing (CSCW) titled, “Can Licensing Mitigate the Negative Implications of Commercials Web Scraping?” The workshop was co-organized, in part, by Hanlin Li, assistant professor at the University of Texas at Austin and former CLTC postdoctoral scholar, and Nick Merrill, director of the Daylight Lab at CLTC.
The workshop featured brief lightning talks from researchers, content creators, and legal experts to address and answer questions related to aspects of web scraping, data licensing, and creative commons:
- Bart De Witte and Sreekanth Mukku, Hippo AI Foundation – Regenerative AI in Healthcare: A Framework to Establish Digital Sovereignty through Free Data Flows
- Kat Walsh, Creative Commons: Generative AI and Creative Commons
- Michael Clemens, University of Utah: Data Scraping with Sound Judgment
- Yiwei Wu, UT Austin: A review of licensing discussion in NeurIPS dataset papers
- Scott Cambo, Responsible AI Collaborative and Jesse Josua Benjamin, Lancaster University: Analyzing the Language of RAIL Clauses
The lightning talks were followed by a roundtable discussion led by Kyle Lo and Luca Soldaini of the Allen Institute for AI on “ImpACT, RAIL, and Beyond: Building an ecosystem of data licenses.” Participants discussed the limitations of licensing in governing online content and data and discussed opportunities for researchers and practitioners to build norms and standards to facilitate responsible data collection and sharing.