New Study Reveals: OpenAI Models May Have "Memorized" Protected Content

A recent study has come out indicating that OpenAI's models might have "stored" parts of information. Of copyrighted material, adding gravity to persistent claims from writers, developers, and copyright owners. These parties have alleged that OpenAI has utilized their creations—including books, code repositories, and essays—to educate its AI systems without authorization.

OpenAI has frequently relied on the fair use doctrine, asserting that utilizing publicly accessible information for training purposes is covered under this exemption. Nevertheless, opponents contend that U.S. copyright legislation does not provide specific allowances for training datasets, leading to legal disputes surrounding this approach.

Approach for Recognizing " Memorized" Material

The research paper, which was collaboratively authored by scholars from the University of Washington, University of Copenhagen, and Stanford University, introduces an innovative technique aimed at detecting training data "stored" within AI systems such as those developed by OpenAI. The approach centers around analyzing "high-surprise" terms—words that appear infrequently in specific contexts. To illustrate, take the statement: “Jack and I remained completely motionless while the radar buzzed softly”; here, 'radar' stands out as a high-surprise term compared to more common alternatives like ‘motor’ or ‘television.’

By eliminating high-surprisal words from sections of texts like novels and articles from The New York Times, and subsequently having models predict the hidden words, researchers discovered that OpenAI’s GPT-4 model displayed indications of committing sections from widely read books and articles to memory. Particularly, GPT-4 seemed to have absorbed portions of texts from a collection known as BookMIA, comprising copyrighted digital publications, but exhibited a reduced level of retention for content from The New York Times pieces.

Consequences for AI Openness and Responsibility

Abhilasha Ravichander, a PhD candidate at the University of Washington and co-author of the study, stressed the importance of greater openness in AI training methods. She highlighted that for AI systems to be reliable, they should undergo thorough audits and verification processes, particularly concerning their training datasets.

Although OpenAI supports more lenient regulations regarding the utilization of copyrighted material during model creation, they have simultaneously put in place content licensing agreements along with opt-out options for copyright owners. Nonetheless, OpenAI has campaigned for governmental bodies globally to establish explicit fair-use guidelines tailored specifically for AI training purposes.

Featured image credit: macrovector via Freepik

Follow us For additional up-to-date news on DMR

Posting Komentar (0)
Lebih baru Lebih lama