ModularWebCrawler Functionality

Handling Web Pages With Broken Certificates

In the last years more and more sites use SSL certificates to encrypt the communication to their sites, this certificate grants their URL with the “S” after the “http” prefix.

Browser notice the encryption status and add a lock or some other icon to hint to the users that they are now in a safe web page so they can safely put their credit card credentials and buy whatever they want as the communication is encrypted and there are no listeners on this line.

If a site has a certificate but there is something wrong with it, maybe it is expired or maybe it doesn’t point to the site in the right way (doesn’t add the www prefix for example) then most browser will alert you while some of them (Chrome! for example) will block you from accessing the site by default (advanced -> I trust…).

How should a web crawler handle these pages ? should it access them like normal pages ? should it block the access to them like Chrome’s default behaviour ?

MocularWebCralwer’s approach is simple, get to crawl as many links while exposing the user to the minimal risks. Most web crawlers will leave the decision to the user, but I believe in an opinionated approach rather than a blank approach, so, of course, if one wants he can configure the specifics using the CrawlerConfig object, but if he leaves the deafault behaviour he will get the following configurations out of the box:

  • Regular crawling will include web pages which their SSL certificate is broken as the risk is really minimal-to-non-existent in a web crawler (home page, bookmarks, default search engine, toolbar – all of these don’t exist in a web crawler).
  • Authenticated crawling – meaning when crawling places for which MWC will put the username/password will block broken SSL pages and not crawl them as not to leak the credentials to a MITM attack.

These simple rules are the default behaviour, but it can be easily tweaked to suit the user’s requirements.

P.S. Crawling broken SSL sites is as risky as crawling sites with no SSL certificate (http://…), there is no substantial difference between these two.

ModularWebCrawler Functionality

Optimizing Page Fetching

Fetching (downloading) data from an online site is an expensive act and is very time consuming, it doesn’t take much storage or cpu/gpu, the main resource which is consumed is the network transport.

There is a high overhead just for making the actual connection to the remote site (from DNS resolution to the handshake and maintaining the actual connection), then there is the actual download of the bytes to your computer.

A web crawler must do lots of page fetchings, but there are ways one can optimize it’s work.

Optimization #1 – Prefetching

The first thing to do is to determine the content type of the online page, is it text based or is it binary data ?
Text pages are the bread and butter of web crawlers as they serve two main purposes, the first and obvious one is to return them to the user so he can parse them or do whatever he wants with them as that is the purpose of the crawl in most cases (crawl the web in order to find X, Y or Z), the second purpose these text pages (they can be text/json/csv/xml/html or any other text format) if for the web crawler itself to parse and get all of the URLs it can find in these pages, so it could continue its crawl with new seeds for further crawling.

Binary data like mp3, avi, mpeg or any other non-notepad-readable file format is another thing, there is no simple way to parse these files for links (yes, there are ways, but these are CPU intensive and not worth the crawler’s time and energy), so these files are worthless for the web crawler, but they might be the thing which the user is searching for (for example, a user might start a web crawl for all mp3 files in a specific site), the downside to getting these files for the user is that these files can be quite big to download, it is not uncommon to find pictures/sound files all over the www weighing several megabytes or finding many other files which weigh in the hundreds of megabytes, a file like this can take so much band width as to almost halt all other activity of the web crawler for several minutes, thus optimization is in line here.

The most basic of optimization (so basic as one might not even call it optimization) is to have a configuration flag which will disable the downloading of any binary file if the user doesn’t need it.

The way of a MWC to implement this feature is to use a regular expression matching on any link in order to try and decipher a file extension, if one is found compare it to a list of binary file extensions, if there is a match – this file won’t be downloaded – boom!, huge optimization was done using this single flag!

This is not a flawless check, most binary files won’t be downloaded but a very small amount of text files which happen to contain a url structure of a binary file will be skipped (doesn’t happen much, but might happen) and binary files which won’t contain the regular expression matching pattern will be downloaded and won’t be successfully filtered, wasting bandwidth but there is not much to do about it, after downloading the binary data a small and speedy procedure will read the bytes of the downloaded object and will give us a much more accurate estimation of the file type (binary or text) and will parse-for-links only the text based files, thus saving us parsing cpu.

Optimization #2 – Head Fetching

The second optimization will happen when the user does choose to download binary files, but he chooses to limit the size of the downloaded files to a maximum size, for example, the user might choose to download only jpg files with the maximum size of 70kb, this can save him lots of bandwidth. The complicated job falls then on web crawler to implement the user’s demands in a smart way.

Mwc, will start by denying downloads of all binary file extensions except the requested binary extension type which in this example will be jpg, then before downloading the jpg files it will make a HEAD http request, this is a trimmed down http request which gets only the file meta data – the headers – it will then parse the headers and get the content-length header, and only if the file is less than the requested maximum size (70kb in our example) it will download this file.

This optimization is a good optimization in most cases, but not without its merits, doing a HEAD request is cheaper than doing a full GET request, but still all of the overhead of the actual connection is done the same way as the full request, only the actual downloading of the file is spared from the user.

The usage of the HEAD request is a must, as from time to time the crawler encounters huge files around the web, and the bandwidth cost of downloading them is worth tens of thousands of HEAD requests.

Summary (TL;DR)

Optimizing crawling can be done via binary file download cuts

  • Flag to eliminate any binary file download
  • Filter to download only specific types of binary files
  • Content checking of downloaded files to determine (Tika) for sure what type of file was downloaded – thus cutting down the need for parsing binary files
  • Configuration property to limit the maximum size of a binary file to downloading, this is done using a HEAD http request
ModularWebCrawler Functionality

Handling Redirection (3xx Status Codes)

Many times site owners choose to redirect their users from one page to an other one, this is happening behind the scenes and more often than not the user is not even aware he was redirected.

Redirection technically is done using one of several options, one of the most used method to redirect is using the .htaccess file in the root of your web server (if you use the apache web server)

How should a web crawler behave when it finds a redirected page?

How should MWC handle redirects ?

One of the main concepts MWC follows is simplicity. In order to achieve simplicity MWC uses an opinionated approach, so following redirects is not configurable, it is always true, but as MWC also strives to be speedy, it won’t be automagically done behind the scenes.

Any redirected url, will be treated as a regular url added now to the crawler, so it will be added the frontierDB – so if later on the crawler will find that url, it will already recognize it and won’t attempt to recrawl it, and it will be passed through all of the user-defined-filters, so it might be dropped in the way.

How MWC achieves URL redirection ?

When MWC fetches a page and gets a 3xx http redirect status code, it does the following:

  • Sends the redirected url through the user-defined-filters, as it might fail one of the filters
  • Adds the redirected page url to the frontierDB (so it will be recognized if it is found on another page) – while maintaining the depth of the original url, because conceptually, the user estimates the depth of the original url and the depth of the redirected url as the same depth
  • MWC still allows the user to handle the redirection differently by overriding the handleRedirection(…) method
  • This way of working avoids a redirect loop, as if in any stage a redirect is pointing to an Url which was already visited it will be dropped when passing via the filters
Modular Web Crawler