AI companies scan websites against their owners' wishes

Websites that don't want to be indexed by Google and other search engines have long been able to use robots.txt files that tell robots that they are spammy. There's no law requiring this, but Google, Yahoo, Bing, and others have always followed this recommendation.

Since Open AI Chat released GPT and started the AI gold rush, robots.txt has also begun to be used to require AI companies not to collect all the content on websites to train their large language models. But AI companies don't have the same moral compass as search engine developers. Reuters Reports indicate that many companies simply chose to ignore the files and wishes of website owners.

This revelation comes after a letter from Tollbit, a company that mediates between website publishers and artificial intelligence developers in order to obtain content licensing agreements. Wired previously accused Perplexity of ignoring robots.txt files on its site and other Condé Nast sites. according to Interested in trade He also ignores Anthropic and Open AI files, despite previously saying to respect them.

says Aravind Srinivas, CEO, Perplexity Fast company The company's robots do not ignore robots.txt files, but purchase materials from other companies that have done so. When the reporter asked him whether the company would now ask the partner to start respecting the files, he replied, “It is complicated.”