In a quiet move, OpenAI has introduced GPTBot, a website crawler aimed at gleaning website content to train its large language models. As word spread about the bot’s existence, many website owners and content creators began discussing ways to prevent GPTBot from accessing their content.

The Mechanics Behind GPTBot

OpenAI’s addition of the GPTBot support page also unveiled instructions for blocking the bot from scraping content. By tweaking a website’s robots.txt file, owners can prevent their content from being accessed by OpenAI. Yet, given the vastness of the web and its frequent crawling by various entities, it remains uncertain whether simply blocking GPTBot will effectively exclude content from large language model training data.

An OpenAI representative noted, “We periodically collect public data from the internet which may be used to improve the capabilities, accuracy, and safety of future models.” The representative further explained that OpenAI’s guidelines detail how to prohibit the collection bot from accessing content and that web pages are screened to exclude sources behind paywalls, sources known to accumulate personally identifiable information, or those with content that breaches OpenAI’s policies.

Content Creators Respond

Numerous digital platforms, such as The Verge, have already implemented the robots.txt alteration to prevent OpenAI’s model from extracting their content. Various personalities like Casey Newton and Neil Clarke have also voiced concerns or announced measures to stop GPTBot access. For instance, Neil Clarke, the editor of sci-fi magazine Clarkesworld, declared on X (formerly Twitter) his intent to obstruct GPTBot.

OpenAI’s Collaborative Moves

Simultaneously, OpenAI unveiled a grant of $395,000 in partnership with New York University’s Arthur L. Carter Journalism Institute. Guided by ex-Reuters editor-in-chief Stephen Adler, the partnership seeks to guide students in ethically utilizing AI within journalism. Tom Rubin, OpenAI’s chief of intellectual property and content, expressed enthusiasm for the new Ethics and Journalism Initiative, emphasizing the challenges journalists face, particularly those arising from the application of AI. However, the announcement did not touch upon the subject of public web scraping or the associated debates.

Understanding Web Data Collection

The questions surrounding control over open internet content persist. Large language models and other AI platforms have long utilized vast collections of public data for training purposes. Established datasets like Google’s Colossal Clean Crawled Corpus (C4) and the nonprofit Common Crawl play pivotal roles in these processes. Content previously captured in such collections might be irreversibly integrated into the training data of platforms like OpenAI’s ChatGPT.

Legal Tangles Around Web Scraping

Recent rulings by the U.S. Ninth Circuit of Appeals confirmed web scraping of publicly accessible data as legal, aligning it outside the boundaries of the Computer Fraud and Abuse Act. Nonetheless, this practice has faced criticism, especially in the context of AI training. In 2022, OpenAI faced legal challenges, including allegations of copyright infringements and potential privacy law violations. These controversies hint at an evolving landscape where the ethics and legality of data scraping face increasing scrutiny.

Potential Paths Forward

The dialogues emphasize the need for clear boundaries and understanding between AI entities and content creators. While some suggest that an “opt-in” approach, where permission is sought before scraping, might be more suitable, others emphasize the importance of recognizing and compensating creators for their content. As the discussions continue, the balance between technological advancement and content rights remains a crucial focal point.