Reddit, the popular social media platform known for its decades of topic-specific forums, holds a treasure trove of user-generated content that A.I. companies can use to train large language models. But the platform doesn’t take kindly to having its data used without permission. In a lawsuit filed yesterday (June 4), Reddit accused A.I. company Anthropic of scraping its site’s content without authorization. Describing Anthropic as a company that “bills itself as the white knight of the A.I. industry,” Reddit’s court filings argued that the startup is “anything but.”
Reddit’s archives, which span two decades of online discussions, make the site an especially valuable resource for human-generated text. This type of content is increasingly sought after by tech companies as their data pools—necessary for training A.I. models—begin to dwindle.
“Reddit’s vast corpus of public content has enormous utility, including as a potential source of inputs for training emerging large language A.I. technologies, like Anthropic’s Claude offering, and assisting A.I. technologies in generating answers to user queries,” said Reddit in the suit.
Co-founded by Huffman and his college roommate Alexis Ohanian in 2005, Reddit has more than 100 million daily active users who use the platform’s subreddits to ask questions, provide tips and share perspectives on various subjects. The company went public last year and currently has a market capitalization of $21.8 billion.