Website publishers have recently leveled serious allegations against the AI startup Anthropic, accusing it of aggressive data scraping practices.
This automated process involves extracting data from websites without explicit permission from the content owners, potentially violating the terms of service of websites, which could have lasting repercussions for both publishers and AI companies.
Data scraping, while not necessarily illegal, has become contentious when it infringes upon the rights of content creators. As data scraping permits AI companies to train their models with potentially sensitive or exclusive content, publishers are increasingly cautious.
Reaction and Actions from Freelancer.com
Freelancer.com, a prominent platform for freelancers and employers, has been particularly vocal in these accusations against Anthropic.
CEO Matt Barrie described the startup's data scraping activities as staggering, claiming that within 4 hours, the website had found 3.5 million visits from a crawler linked to Anthropic. Barrie stated that this volume is "probably about five times the volume of the number two".
Due to these disruptive activities, Freelancer.com has blocked Anthropic's crawler entirely. Barrie criticized the company for disrespecting internet protocols, describing the data scraping as “egregious.”
For websites’ part, such activities not only weaken the site’s performance but also significantly impact revenue, as the increased traffic from automated crawlers can overload the system and make it slower.
iFixit: It Is Not A Polite Internet Behaviour
iFixit, a repair community and resource website, also alleged that Anthropic ignored the site’s "do not crawl" regulations specified in its robots.txt file.
Kyle Wiens, CEO of iFixit, reported that Anthropic's crawler accessed their servers a million times within a single day, which is staggering by the scale and disruptiveness of their scraping activities.
Robots.txt is a file that specifies which web pages crawlers are allowed to access, and ignoring that can create major adherence problems and raise broader industry worries about following set protocols.
Although compliance with robots.txt is voluntary and predominantly relied upon to govern web crawlers, disregard for these rules can start a troubling trend in the data scraping practices of some AI firms, including Anthropic.