
- Cloudflare has caught Perplexity scraping websites that explicitly block AI crawlers.
- Perplexity's AI crawlers concealed their identity and even used undisclosed IP addresses.
- The AI startup was caught doing so across tens of thousands of domains, making millions of requests per day.
Perplexity has been caught red-handed by Cloudflare, as the startup has been sneaking around websites that do not want to be scraped by AI crawlers. Typically, AI answer engines like Perplexity or ChatGPT go through several websites on the internet, and extract data such as text, images, and other content to generate answers, often without obtaining permission.
Cloudflare has now published its research, claiming that Perplexity uses dubious tactics to circumvent restrictions by concealing its identity to scrape websites, despite websites explicitly opting out.
Cloudflare CEO Matthew Prince has blasted Perplexity on X, stating that “Some supposedly “reputable” AI companies act more like North Korean hackers. Time to name, shame, and hard block them.”
This, of course, hurts site traffic, which is why some websites have started using the ‘robots.txt’ file to curb AI’s free lunch. This file tells AI crawlers which pages a site wants indexed and which it doesn’t. But according to Cloudflare’s report, Perplexity seems to be completely violating the robots.txt standard.
How Perplexity Pulled Off the Grand Theft Data
Cloudflare published the report after it received several complaints from its customers who claimed that Perplexity still had access to their website’s content, despite having set restrictions in the Robots.txt file, and created Web Application Firewall (WAF) rules to prevent AI bots from scraping data.
In response to the complaints, Cloudflare created test domains with similar restrictions to observe Perplexity’s behavior. They found that Perplexity initially attempts to access sites using its regular crawlers, i.e., “PerplexityBot” or “Perplexity-User.” However, if the AI encounters restrictions, it switches its user agent, the identifier that tells a website what kind of browser and device is being used.
In Perplexity’s case, it masked itself as a Chrome browser on macOS. Moreover, Perplexity used “rotating” IP addresses that the company does not mention on its list of IP addresses used by its bots. Cloudflare’s report also mentions that Perplexity changes its autonomous system networks (ASNs), which are unique identifiers used to distinguish large networks.
Cloudflare mentions in its post, “This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.”
Not Perplexity’s First Rodeo
Perplexity was caught doing the same thing in June last year, ignoring paywalls and Robots.txt files on websites. Back then, the company’s CEO, Aravind Srinivas, blamed it all on third-party crawlers the company was relying on. But now, the situation is different, and the blame squarely falls on Perplexity itself.
In a statement to The Verge, Perplexity spokesperson Jesse Dwyer calls Cloudflare’s report a “publicity stunt.” He further adds that “there are a lot of misunderstandings in the blog post.” However, we are still waiting to hear more from Perplexity. Meanwhile, Cloudflare has delisted Perplexity as a verified bot and is rolling out new ways to block Perplexity from crawling websites.
It is also worth pointing out that Apple has been interested in buying Perplexity and was reportedly in early talks. However, following this report, the Cupertino giant may now reconsider its decision.