Fetching (downloading) data from an online site is an expensive act and is very time consuming, it doesn’t take much storage or cpu/gpu, the main resource which is consumed is the network transport.
There is a high overhead just for making the actual connection to the remote site (from DNS resolution to the handshake and maintaining the actual connection), then there is the actual download of the bytes to your computer.
A web crawler must do lots of page fetchings, but there are ways one can optimize it’s work.
Optimization #1 – Prefetching
The first thing to do is to determine the content type of the online page, is it text based or is it binary data ?
Text pages are the bread and butter of web crawlers as they serve two main purposes, the first and obvious one is to return them to the user so he can parse them or do whatever he wants with them as that is the purpose of the crawl in most cases (crawl the web in order to find X, Y or Z), the second purpose these text pages (they can be text/json/csv/xml/html or any other text format) if for the web crawler itself to parse and get all of the URLs it can find in these pages, so it could continue its crawl with new seeds for further crawling.
Binary data like mp3, avi, mpeg or any other non-notepad-readable file format is another thing, there is no simple way to parse these files for links (yes, there are ways, but these are CPU intensive and not worth the crawler’s time and energy), so these files are worthless for the web crawler, but they might be the thing which the user is searching for (for example, a user might start a web crawl for all mp3 files in a specific site), the downside to getting these files for the user is that these files can be quite big to download, it is not uncommon to find pictures/sound files all over the www weighing several megabytes or finding many other files which weigh in the hundreds of megabytes, a file like this can take so much band width as to almost halt all other activity of the web crawler for several minutes, thus optimization is in line here.
The most basic of optimization (so basic as one might not even call it optimization) is to have a configuration flag which will disable the downloading of any binary file if the user doesn’t need it.
The way of a MWC to implement this feature is to use a regular expression matching on any link in order to try and decipher a file extension, if one is found compare it to a list of binary file extensions, if there is a match – this file won’t be downloaded – boom!, huge optimization was done using this single flag!
This is not a flawless check, most binary files won’t be downloaded but a very small amount of text files which happen to contain a url structure of a binary file will be skipped (doesn’t happen much, but might happen) and binary files which won’t contain the regular expression matching pattern will be downloaded and won’t be successfully filtered, wasting bandwidth but there is not much to do about it, after downloading the binary data a small and speedy procedure will read the bytes of the downloaded object and will give us a much more accurate estimation of the file type (binary or text) and will parse-for-links only the text based files, thus saving us parsing cpu.
Optimization #2 – Head Fetching
The second optimization will happen when the user does choose to download binary files, but he chooses to limit the size of the downloaded files to a maximum size, for example, the user might choose to download only jpg files with the maximum size of 70kb, this can save him lots of bandwidth. The complicated job falls then on web crawler to implement the user’s demands in a smart way.
Mwc, will start by denying downloads of all binary file extensions except the requested binary extension type which in this example will be jpg, then before downloading the jpg files it will make a HEAD http request, this is a trimmed down http request which gets only the file meta data – the headers – it will then parse the headers and get the content-length header, and only if the file is less than the requested maximum size (70kb in our example) it will download this file.
This optimization is a good optimization in most cases, but not without its merits, doing a HEAD request is cheaper than doing a full GET request, but still all of the overhead of the actual connection is done the same way as the full request, only the actual downloading of the file is spared from the user.
The usage of the HEAD request is a must, as from time to time the crawler encounters huge files around the web, and the bandwidth cost of downloading them is worth tens of thousands of HEAD requests.
Optimizing crawling can be done via binary file download cuts
- Flag to eliminate any binary file download
- Filter to download only specific types of binary files
- Content checking of downloaded files to determine (Tika) for sure what type of file was downloaded – thus cutting down the need for parsing binary files
- Configuration property to limit the maximum size of a binary file to downloading, this is done using a HEAD http request