ByteDance looks like it’s eager to make up for lost time when it comes to scraping the web for data needed to train its generative AI models.
The China-based parent company of video app TikTok released its own web crawler or scraper bot, dubbed Bytespider, sometime in April, according to research from Kasada, a company that specializes in bot management for companies with online data. The existence of the bot was also confirmed by Dark Visitors, which monitors scraper bots.
ByteDance’s bot has quickly become one of the most, if not the single most, aggressive scrapers on the internet, the research shows. It’s scraping data at a rate that’s many multiples of other major companies, such as (Google, Meta, Amazon, OpenAI, and Anthropic, which use their own scraper bots to help create and improve their large language or multimodal models, known as LLMs or LMMs.
Sam Crowther, the CEO of Kasada, said since Bytespider showed up, it’s been scraping data at about 25 times the rate of GPTbot, which scrapes data for OpenAI’s ChatGPT platform and underlying models, for instance. Bytespider has been scraping at 3,000 times the rate of ClaudeBot, from Anthropic, which operates the Claude platform.
As the months have gone by, Bytespider has become even more aggressive, according to Kasada. Data shows huge spikes in scraping activity from Bytespider over each of the last six weeks.
Representatives of TikTok and ByteDance did not respond to emails seeking comment.
ByteDance’s aggressive scraping comes despite the possibility of TikTok being banned in the U.S in the coming months. President Joe Biden has signed legislation that requires ByteDance to sell TikTok, due to national security concerns, or shut it down.
The Bytespider bot, much like those of OpenAI and Anthropic, does not respect robots.txt, the research shows. Robots.txt is a line of code that publishers can put into a website that, while not legally binding in any way, is supposed to signal to scraper bots that they cannot take that website’s data.
Web scraping goes back decades, mainly by search engines to gather links to web pages. But the rise of generative AI tools has added a new dimension and made the practice a prime source of lawsuits and controversy. People and organizations whose work has been scraped argue their copyright is being infringed in the process. All of the models that underly generative AI tools were trained on massive amounts of online data,…
Click Here to Read the Full Original Article at Fortune | FORTUNE…