OpenAI is trying to negotiate a reduction in the number of files it would have to create in a copyright case. Files belonging to former chief scientist Ilya Satskeva are among the files in dispute. The Authors Guild’s lawsuit centers on claims that OpenAI trained an AI model on books without permission.
Lawyers for OpenAI argue that the latest demands, involving the company’s co-founder Ilya Satskeva and seven other current and former employees, are too large and numerous, and the company is facing a high-profile copyright lawsuit. is seeking negotiations to reduce the number of documents that must be reviewed and disclosed.
In a letter to a judge filed Wednesday in New York federal court, OpenAI attorney Carolyn M. Homer said the files requested by the Writers Guild from eight additional people “contain more than 886,000 documents.” He said it would total several hundred gigabytes of data.
These eight “custodians” (people believed to have relevant evidence to present in pre-trial discovery proceedings) include former chief scientist and co-founder Sutskever, who left the company in May; This includes Jan Rijke, a researcher who joined rival company Anthropic.
The lawsuit focuses on claims that OpenAI’s models were trained based on the book without the author’s permission.
Homer also named OpenAI technical staff members Chelsea Boss, Shantanu Jani, and Kim Jong-wook, pre-training data lead Kiming Yuan, and former employees Andrew Mayne and Karen O’Keefe, among others, in the dispute. He also named other administrators inside.
OpenAI has already agreed to create documents from 24 admins, but the proposed search parameters and requests to create files related to the 8 new admins will result in their files being They oppose it because they are concerned that it would significantly increase the resources required for search.
According to OpenAI’s lawyers, the company’s search criteria for its 24 existing custodians would require examining “over 460,000 documents” totaling 359 gigabytes. Homer said that using the conditions proposed by the Authors Guild, OpenAI would have to review more than 1 million documents.
Considering the search criteria that OpenAI proposed for the eight custodians in dispute, the file size would exceed 375 gigabytes, exceeding the size of the files from the 24 custodians that the parties had already agreed to, Homer said. he said.
The lawyer also said that OpenAI estimated the overlap rate to be 71% based on search terms suggested between the eight disputed custodians and 24 existing custodians.
OpenAI’s lawyers say the “substantial number of hits” and concerns about high duplication rates mean they will continue trying to reach an agreement with plaintiffs over Sutskever’s files and other disputed custodians. He said there was.
The controversy marks the latest development in an ongoing class action lawsuit filed against OpenAI by the Authors Guild, which supports authors. Unsealed documents reviewed by BI this year show that the ChatGPT maker deleted two datasets, “books1” and “books2,” that were used to train an older AI model called GPT-3.
OpenAI also faces several other lawsuits over copyright infringement, including one filed against the company by the New York Times.
Lawyers for the Authors Guild said in a filing that the dataset could have included “more than 100,000 published books.”
OpenAI and the Authors Guild did not immediately respond to Business Insider’s requests for comment.