These companies have an insatiable appetite for online data to train models and present content in an easy-to-understand format. In mid-2023, social media veteran and IPO novice Reddit turned off the spigot and began charging some companies for access to their data.
Reddit has a growing repository of 19 years of user comments, making it an attractive resource for AI companies. The platform recently reported its first quarterly profit as a publicly traded company, thanks to data licensing deals it signed with OpenAI and Google over the past year.
Reddit CEO and co-founder Steve Huffman said the company had to stop giving away valuable data to the world’s biggest companies for free.
“This is an arms race,” he said at the Wall Street Journal’s Tech Live conference in October. “But we’re in talks with just about every company, so we’ll see where these things land.”
Reddit’s vast amount of data is organized by topic, uses a voting system instead of an algorithm to categorize the quality of content, and because people tend to be candid, AI It works to the company’s advantage.
For the first nine months of 2024, Reddit’s revenue category, which includes licensing, rose to $81.6 million from $12.3 million in the year-ago period.
Although data licensing revenue remains small compared to Reddit’s core advertising revenue, the rapid growth of this new category reveals a potentially lucrative business line with relatively high margins.
Diversifying away from its reliance on advertising while also tapping into AI-related markets has made Reddit attractive to investors looking for new exposure to the latest technology boom. Reddit stock has more than doubled in the past three months.
The source of Reddit’s newfound wealth is the burgeoning market for data useful for AI. Reddit’s willingness to sell its data to AI companies is notable because there is a limit to the amount of data that AI companies can swallow or buy for free. Some executives and researchers say the industry’s need for high-quality text could outstrip supply within two years, slowing the development of AI.
AI companies need data to enable their apps to return accurate results to users’ prompts and search queries, and respond in the conversational tones that the company is known for. Reddit’s text-centric platform and growing corpus of online human interactions fit this requirement.
“This is like manna from above,” said Ari Molkos, CEO of DatologyAI, a startup that curates data for AI training. All you have to do is package it and hire salespeople.”
However, it’s unclear exactly how much financial value Reddit’s data licensing deals with AI companies will have in the future. Reddit did not disclose the terms of the agreement or how long it would last.
Selling data to AI companies doesn’t make sense for all social media platforms. Some, like Facebook’s parent company Meta and Elon Musk’s X, have their own AI models, while others primarily involve private conversations and discussions about specific topics. Reddit doesn’t build AI models like those created by customers who provide data licenses, and it functions differently than most other services.
Reddit on Monday began testing an AI-powered search tool for its content using AI models from OpenAI and Google, a spokesperson said.
For example, Reddit users can upvote or downvote each other’s posts and comments, and they can earn so-called karma points by posting popular content that other users can see. Votes and karma can be used as signals to AI models to indicate the likelihood of high-quality and low-quality content, said Jaime Sevilla, director of Epoch AI, an AI-focused research institute. In contrast, most other social platforms measure a user’s popularity by simply indicating whether a post or comment is popular by the number of likes, or by indicating the number of followers.
Another feature of Reddit is that most of its users are anonymous. People tend to be more honest and candid online when they don’t have to worry about embarrassing themselves or offending people they know by what they post, said Richard Lachman, an associate professor of digital media at Toronto Metropolitan University. It is said that there is. The more authentic the content, the more useful it will be in training the AI, he said.
The same logic applies to Reddit’s diverse corpus, Lachman added. The platform is divided into more than 100,000 “subreddits” dedicated to all kinds of topics, from sports and religion to politics and animals. On the other hand, many other social platforms cater to narrow groups of people, like Discord and Strava for video game enthusiasts. For fitness fans.
“Reddit is like a 24-hour buffet,” Lachman said.
As of October, Reddit.com was the fifth most-visited website in the United States, according to analytics firm SimilarWeb. More than 5.3 billion pieces of content were posted on Reddit in the first half of this year, a 20.5% increase compared to the second half of 2023, according to the company’s latest transparency report. However, this amount also includes content that we do not provide to our data licensing partners, such as private messages and chats.
Molkos, who worked at Metaplatform and Google’s DeepMind division before founding DatologyAI last year, said the amount of data Reddit sells to customers is far less than the amount of data generated by other large social platforms. He said it was highly likely. At the end of September, Reddit had 97 million daily users, while Snapchat had 443 million daily users.
Meanwhile, some news publishers, including the New York Times, have sued OpenAI and its backer Microsoft in court, alleging that their content was used without permission to train artificial intelligence tools and input answers for users. chose to fight. OpenAI said the lawsuit is without merit.
News Corp., owner of the Wall Street Journal, has a content licensing partnership with OpenAI.
Email Sarah E. Needleman at Sarah.Needleman@wsj.com.