OpenAI's GPTBot: What Website Owners Need to Know
In a quiet move, OpenAI has introduced GPTBot, a website crawler aimed at gleaning website content to train its large language models. As word spread about the bot’s existence, many website owners and content creators began discussing ways to prevent GPTBot from accessing their content.
OpenAI’s addition of the GPTBot support page also unveiled instructions for blocking the bot from scraping content. By tweaking a website’s robots.txt file, owners can prevent their content from being accessed by OpenAI. Yet, given the vastness of the web and its frequent crawling by various entities, it remains uncertain whether simply blocking GPTBot will effectively exclude content from large language model training data.
An OpenAI representative noted, “We periodically collect public data from the internet which may be used to improve the capabilities, accuracy, and safety of future models.” The representative further explained that OpenAI’s guidelines detail how to prohibit the collection bot from accessing content and that web pages are screened to exclude sources behind paywalls, sources known to accumulate personally identifiable information, or those with content that breaches OpenAI’s policies.
Numerous digital platforms, such as The Verge, have already implemented the robots.txt alteration to prevent OpenAI’s model from extracting their content. Various personalities like Casey Newton and Neil Clarke have also voiced concerns or announced measures to stop GPTBot access. For instance, Neil Clarke, the editor of sci-fi magazine Clarkesworld, declared on X (formerly Twitter) his intent to obstruct GPTBot.
Simultaneously, OpenAI unveiled a grant of $395,000 in partnership with New York University’s Arthur L. Carter Journalism Institute. Guided by ex-Reuters editor-in-chief Stephen Adler, the partnership seeks to guide students in ethically utilizing AI within journalism. Tom Rubin, OpenAI’s chief of intellectual property and content, expressed enthusiasm for the new Ethics and Journalism Initiative, emphasizing the challenges journalists face, particularly those arising from the application of AI. However, the announcement did not touch upon the subject of public web scraping or the associated debates.
The questions surrounding control over open internet content persist. Large language models and other AI platforms have long utilized vast collections of public data for training purposes. Established datasets like Google’s Colossal Clean Crawled Corpus (C4) and the nonprofit Common Crawl play pivotal roles in these processes. Content previously captured in such collections might be irreversibly integrated into the training data of platforms like OpenAI’s ChatGPT.
Recent rulings by the U.S. Ninth Circuit of Appeals confirmed web scraping of publicly accessible data as legal, aligning it outside the boundaries of the Computer Fraud and Abuse Act. Nonetheless, this practice has faced criticism, especially in the context of AI training. In 2022, OpenAI faced legal challenges, including allegations of copyright infringements and potential privacy law violations. These controversies hint at an evolving landscape where the ethics and legality of data scraping face increasing scrutiny.
The dialogues emphasize the need for clear boundaries and understanding between AI entities and content creators. While some suggest that an “opt-in” approach, where permission is sought before scraping, might be more suitable, others emphasize the importance of recognizing and compensating creators for their content. As the discussions continue, the balance between technological advancement and content rights remains a crucial focal point.
South Africa’s fashion entrepreneurs are navigating unprecedented challenges, from rolling power outages to supply chain…
Bismriti Paudel and Gita Paudel, two Nepali entrepreneurs, have overcome personal and professional challenges to…
In Kenya, Mary Nyambura, founder and CEO of Ecocharge Limited, transforms agricultural waste into biomass…
Canadian women are increasingly stepping into entrepreneurship across industries like retail, technology, and food services.…
The United Nations Development Programme (UNDP) in Ukraine, supported by the Government of Japan, has…
The Women’s Empowerment for Inclusive Growth (WING) project focuses on integrating women into formal economies…