ModularWebCrawler Functionality

Handling Redirection (3xx Status Codes)

Many times site owners choose to redirect their users from one page to an other one, this is happening behind the scenes and more often than not the user is not even aware he was redirected.

Redirection technically is done using one of several options, one of the most used method to redirect is using the .htaccess file in the root of your web server (if you use the apache web server)

How should a web crawler behave when it finds a redirected page?

How should MWC handle redirects ?

One of the main concepts MWC follows is simplicity. In order to achieve simplicity MWC uses an opinionated approach, so following redirects is not configurable, it is always true, but as MWC also strives to be speedy, it won’t be automagically done behind the scenes.

Any redirected url, will be treated as a regular url added now to the crawler, so it will be added the frontierDB – so if later on the crawler will find that url, it will already recognize it and won’t attempt to recrawl it, and it will be passed through all of the user-defined-filters, so it might be dropped in the way.

How MWC achieves URL redirection ?

When MWC fetches a page and gets a 3xx http redirect status code, it does the following:

  • Sends the redirected url through the user-defined-filters, as it might fail one of the filters
  • Adds the redirected page url to the frontierDB (so it will be recognized if it is found on another page) – while maintaining the depth of the original url, because conceptually, the user estimates the depth of the original url and the depth of the redirected url as the same depth
  • MWC still allows the user to handle the redirection differently by overriding the handleRedirection(…) method
  • This way of working avoids a redirect loop, as if in any stage a redirect is pointing to an Url which was already visited it will be dropped when passing via the filters
Modular Web Crawler