Crawlers Versus Content Cache

This article explains why bots and crawlers are a substantial challenge in maintaining an effective content cache.

How the content cache works

In an ideal situation, the content cache would store all possible HTTP responses of a website for a very long duration. This would make it 100% effective in taking load off the application servers. For obvious reasons, this is only possible for very small and very static websites – the kind of website that benefits the least from a cache in the first place.

With a big and dynamic website, for whose performance a content cache is essential, the content cache will always only be able to handle a part of the visitor traffic. Even if its cache memory is sized generously, there’s going to be constant churn from cache objects going stale and needing to be refreshed. Dynamic content, for example from visitor search queries, adds to the challenge because of the combinatorial explosion in possible response variations.

Since the content cache frees up space for new responses using a Least-Recently-Used (LRU) algorithm, its memory reflects the current “hot content”. This is the content requested the most at any time, determined by factors like links from the home page, from today’s email newsletter etc.

How crawlers spoil the whole plan

Crawlers aren’t interested in hot content much. They got their name because they follow every and any link they can find (or even invent themselves), regardless of how old the linked content is. This has two effects, both detrimental.

First, they tend to request content that is not currently in the cache. In the case of search results, which are short-lived, the probability of them ever being cached is low to begin with. This means that the content cache will have to pass nearly every request from crawlers back to the application servers. This can add significant load to your application boxes. In the worst case, load spikes caused by a wave of bot requests can even render your freistilbox cluster temporarily unresponsive.

The second effect is even more insidious: Those crawler requests might even cause other cache content to get evicted from memory for them to be stored instead. But since the probability that the same response just sent to a crawler will be reusable in the near future is fairly low, the cache efficiency goes way down.

How about `robots.txt`?

If you’ve identified an abusive crawler from your logs, it’s worth a try to add it to robots.txt. But don’t get your hopes up too much. With traditional search engine crawlers, there were grounds to trust that they’d follow these rules. But now that everyone and their dog is building AI applications, that implicit contract has gone out of the window. Some crawlers ignore any guardrails and flood websites with requests as if there’s no tomorrow. It’s why we’re seeing more and more websites display a “Checking if you’re human” page before they let you through to the actual content.

If a bot or crawler doesn’t follow the rules you’ve set in robots.txt, we recommend using the freistilbox ACL feature to stop them right at the edge of our hosting platform.

Crawlers Versus Content Cache

How the content cache works

How crawlers spoil the whole plan

How about robots.txt?

How about `robots.txt`?