Princeton University Library workshop explores copyright issues in computational research using text and data mining

What legal issues do researchers face in computational work when using text and data mining? On March 26, Dave Hansen, Executive Director at Authors Alliance, delivered a talk at Princeton University on the existing law and policy around copyright and data mining and highlighted ways for researchers who incorporate text data mining in their workflows to navigate the complex series of limitations and considerations involved in mining copyrighted works.

Hosted by Princeton University Library (PUL) and held in the Julis Romo Rabinowitz building, the workshop was an outgrowth of the Authors Alliance’s "Text and Data Mining: Demonstrating Fair Use" project. Launched in 2023 with support from the Mellon Foundation, this initiative aims to dismantle legal barriers hindering researchers from exercising their fair use rights, particularly concerning text data mining (TDM) research within the framework of existing regulatory exemptions.

Dave Hansen during his talk on copyright and legal issues held at Princeton University’s Julis Romo Rabinowitz building. Photo c

Dave Hansen during his talk on copyright and legal issues held at Princeton University’s Julis Romo Rabinowitz building. Photo credit: Brandon Johnson.

“We want to be able to extract more from text than just what’s on the surface,” Hansen said, noting the importance of analyzing text to uncover language patterns across temporal and cultural boundaries. 

Jennifer Grayburn, Assistant Director of Research Data and Open Scholarship, invited Hansen to speak on copyright, a topic PUL regularly engages with. “Copyright is increasingly complicated for researchers engaging with data (and for us as a library providing it) given that data are often licensed or taken from the web using web scraping,” she said. 

Grayburn added, “From another perspective, PUL has openly accessible data of its own and we need to decide how we want to encourage or restrict people from using materials in our digital library and repositories for computational research or training AI models.”

Sourcing Matters

Hansen’s talk highlighted the “Google Books” project, which, by its 15th anniversary in 2019, scanned more than 40 million unique titles. The Google Books Project and the litigation that followed it established an important legal precedent showing that copyright law can support mass digitization and mass computational analysis of text.   

TDM projects benefit from the precedent established by Google, as well as its corpus of digitized books, but legal complications due to digital locks and contractual restrictions can still hinder how those bodies of data can be used.

“Sourcing matters,” Hansen noted. “AI has changed the discussion of large corpora. Problems arise, for example, when data is mined from a body of books that was taken from less than legitimate sites.”

Hansen’s talk also covered the nuances created by claims under the Fair Use doctrine, and the degree to which projects can be considered transformative of original, copyrighted work. 

“Google’s NGram viewer allows users to see temporal changes in language,” Hansen explained, giving the example of a shift in using the “United States” as a collective or singular noun before and after the Civil War respectively. 

Hansen said, “The court found Google’s use to be a highly transformative, new purpose. It’s appropriate to scan entire books given the purpose. No one is using these techniques as a substitute for buying the books.”

Still, aspects of content management, like digital rights management, can be applied on top of already copyrighted works, adding even more nuance to an already dense conversation. 

Authors Alliance, however, is working to smooth out some of the road bumps researchers and institutions are facing in the copyright landscape. Navigating licenses, understanding how copyright and TDM apply to other forms of media (music, visual works, video games, streaming services), and determining whether Fair Use applies to bodies of source material that weren’t legal when they were created are all among the open questions the organization seeks to address. 

“I am a lawyer — I am not your lawyer,” Hansen joked at the start of his presentation. “But Authors Alliance does offer consultations for partner members if you need them.”

Grayburn added, “We’ll be prepared to support and answer questions from our researchers in this area as computational research and TDM become more common and lawsuits challenging the use of TDM work their way through the courts.”

Authors Alliance is a nonprofit organization that supports authors “who want to serve the public good by sharing their creations broadly.” The organization also works “to help authors understand and enjoy their rights and promote policies that make knowledge and culture available and discoverable.”

Princeton researchers with copyright questions can consult the Copyright at Princeton guide or reach out to copyright@princeton.edu.

Please review the Library’s Data and Computation workshops to learn more about upcoming programming on web scraping and other vendor tools related to TDM.

Princeton users can view a recording of Hansen's talk on Media Central.

Published on April 16, 2024

Written by Brandon Johnson, Communications Strategist

Media Contact: Stephanie Oster, Publicity Manager