OpenAI's Controversial Move: GPTBot's Unwelcomed Web Scraping and its Implications
An unfavorable action from OpenAI has occurred.
OpenAI is currently collecting information from the internet using GPTBot, a robot designed to gather data from websites to transparently and responsibly train AI models, according to OpenAI’s statement.
On OpenAI’s website, instructions are provided on how to block GPTBot’s access to a site by adding User-agent: GPTBot to the robots.txt file. However, not many people are pleased with this. Why?
As a website owner, you have to actively block GPTBot’s access to your site instead of willingly allowing your website to be included in their data collection program for training their AI. Essentially, by default, you are permitting GPTBot to access your website. Wait, when did I grant permission? Who authorized OpenAI to grant themselves that right?
It’s akin to a thief announcing to the entire village, “Hey, everyone, I’m going to visit each house to search around, and if anyone doesn’t want that, lock your doors.”
As per the announcement, OpenAI is using data sources from Wikipedia, government court records, books, and emails. Okay, agreed, those are public data sources accessible to anyone. But now, they are scanning the entire internet, and the story changes entirely.
You might ask: Why has no one complained about Google scanning these websites all this time?
Google is entirely different; Google scans your website to bring it to people, meaning to bring traffic to your website. This is a symbiotic relationship, a win-win. When you create a website, you have specific purposes, such as placing ads to make money or selling products. By directing users to your website, Google acknowledges it as your content, benefiting you.
OpenAI is not like that, and, in fact, quite the opposite. OpenAI does not acknowledge the websites they have used to build their models. For example, after researching, Professor A posts on his website: “The Earth is square.” OpenAI will collect but only say that the Earth is square instead of quoting: According to Professor A’s website, the Earth is square.
Once it’s posted on the web, it’s public, so GPTBot can view it.
Not quite. Each individual or organization creates a website for specific purposes. People publish it for USERS, not robots. OpenAI’s robot doesn’t provide benefits like purchasing or brand recognition. So, no one is happy about incurring additional server costs for GPTBot to steal content they painstakingly created. No one, and nothing, has the right to access someone else’s data on their own, including OpenAI.
You create a website and write content for your users, your potential customers. You want users on your website, not for some company to use your website as data to train their AI model. That is their business product, and you have no responsibility related to this.