An AI crawler is a computer program that collects data from websites to train large language models. Due to the increased use of AI search and the need to collect training data, a number of new web scrapers have appeared on the internet, including Bytespider, PerplexityBot, ClaudeBot, and GPTBot.
Until 2022, the internet had traditional search engine crawlers such as GoogleBot, AppleBot, and BingBot, which had followed decades-old principles of ethical content scraping and scheduling.
On the other hand, aggressive AI bots not only violate content guidelines, but also slow down website performance, add overhead, and pose security threats. Many websites and content portals have implemented anti-scraping and bot restriction technologies to combat this. According to Cloudflare, a leading content delivery network provider, nearly 40% of the top 10 internet domains visited by 80% of AI bots have moved to block AI crawlers.
Find a story that interests you
Nascom, India’s top technology agency, said these crawlers are particularly damaging when news publishers use content created without attribution. “If using copyrighted data to train an AI model falls under fair use, then that use is invalid,” Raj Shekhar, head of responsible AI at Nasscom, told ET. “The legal dispute between ANI Media and OpenAI is a wake-up call for AI developers to be aware of IP (intellectual property) laws when collecting training data. Intellectual property experts should be consulted to ensure compliant data practices and avoid potential liability.”
“Scraping introduces significant overhead and impacts website performance. It interacts with the site intensively and This is done by trying to collect every piece of content, which slows down performance.
According to Cloudflare’s analysis of the top 10,000 internet domains, three AI bots accounted for the highest share of visited websites: Bytespider (40.40%), operated by China’s TikTok, and GPTBot (40.40%), operated by OpenAI. 35.46%) and ClaudeBot (11.17%) operated by Anthropic. Although these AI bots follow the rules, Cloudflare says the overwhelming majority of its customers choose to block them. Meanwhile, CCBot, developed by Common Crawl, scrapes the web to create open-source datasets that anyone can use.
Features of AI crawler
Unlike traditional crawlers, AI crawlers target high-quality text, images, and videos that can enrich your training dataset. AI-powered crawlers are more intelligent than traditional search engine crawlers that “just crawl, collect data, and stop there,” Akamai’s Koh said. “Their intelligence is used not only to select data, but also to classify and prioritize data. This is because even after crawling, indexing, and scraping all the data, you still have to decide what the data will be used for. “It means we can handle it,” he said.
Traditionally, web scraper bots follow the robots.txt protocol for guidance on what can be indexed. Traditional search engine bots like GoogleBot and BingBot comply with this and stay away from intellectual property. However, AI bots have been found to violate robots.txt principles in multiple instances. “Google and Bing follow predictable and transparent indexing schedules, so your website won’t be overwhelmed. For example, Google makes it clear how often a particular domain should be indexed. Companies can anticipate and manage potential performance impacts,” said Koh. “As newer, more aggressive crawlers emerge, such as those driven by AI, the situation becomes less predictable. These crawlers do not necessarily operate on a fixed schedule, and their scraping activity becomes more intensive. It may be done.”
Koh warned of a third category of crawlers that are malicious in nature and use data to commit fraud. According to Akamai’s State of The Internet study, more than 40% of all internet traffic comes from bots, and about 65% of that comes from malicious bots.
cannot block everything
However, experts say eliminating AI crawlers is not the ultimate solution as they still need to discover your website. If AI search is to become the new way of searching, websites need to appear in commercial search engine results to be discovered and attract customers, they said. “Enterprises will be concerned about whether we are blocking crawling and bot activity that generates legitimate revenue, or whether we are allowing too much malicious activity to occur on our websites. Is it too much? It’s a very delicate balance and they need to understand it,” Mr Koh said.