CCBot

CCBot is the web crawler operated by Common Crawl, a non-profit organisation that has been building open, freely available snapshots of the web since 2008. The Common Crawl dataset, which runs to petabytes of crawled web text, is one of the most widely used sources of pre-training data for large language models, having been incorporated into the training of models such as GPT-3, LLaMA, Mistral, and many others. Webmasters can identify CCBot visits by its user-agent string and can block it via robots.txt. Because the Common Crawl dataset is publicly available and used by many different AI organisations, blocking CCBot may reduce a site’s representation in future open-source and academic LLMs, though it will not necessarily affect the training datasets of commercial model providers who license data separately. The ethical and legal questions around using web-crawled content for commercial model training have made CCBot a frequently discussed topic in AI and publishing circles since 2022.