Categories
ModularWebCrawler Functionality

Handling Web Pages With Broken Certificates

In the last years more and more sites use SSL certificates to encrypt the communication to their sites, this certificate grants their URL with the “S” after the “http” prefix.

Browser notice the encryption status and add a lock or some other icon to hint to the users that they are now in a safe web page so they can safely put their credit card credentials and buy whatever they want as the communication is encrypted and there are no listeners on this line.

If a site has a certificate but there is something wrong with it, maybe it is expired or maybe it doesn’t point to the site in the right way (doesn’t add the www prefix for example) then most browser will alert you while some of them (Chrome! for example) will block you from accessing the site by default (advanced -> I trust…).

How should a web crawler handle these pages ? should it access them like normal pages ? should it block the access to them like Chrome’s default behaviour ?

MocularWebCralwer’s approach is simple, get to crawl as many links while exposing the user to the minimal risks. Most web crawlers will leave the decision to the user, but I believe in an opinionated approach rather than a blank approach, so, of course, if one wants he can configure the specifics using the CrawlerConfig object, but if he leaves the deafault behaviour he will get the following configurations out of the box:

  • Regular crawling will include web pages which their SSL certificate is broken as the risk is really minimal-to-non-existent in a web crawler (home page, bookmarks, default search engine, toolbar – all of these don’t exist in a web crawler).
  • Authenticated crawling – meaning when crawling places for which MWC will put the username/password will block broken SSL pages and not crawl them as not to leak the credentials to a MITM attack.

These simple rules are the default behaviour, but it can be easily tweaked to suit the user’s requirements.

P.S. Crawling broken SSL sites is as risky as crawling sites with no SSL certificate (http://…), there is no substantial difference between these two.

Categories
About ModularWebCrawler News

MWC Milestone 0.3 has been Released

Today is a good day :-), today I released Modular Web Crawler v0.3!

This milestone is a huge one and is very significant, it makes MWC from a proof of concept to a working web crawler, yes, now you can easily start a web crawl and it will work as it should albeit missing advanced features, but still a crawl can be initiated and the expected results will return to the user.

Not only basic crawling but also some optimizations are implemented in this release, mainly in the field of binary data downloading.

What are the main new things introduced in this release?

  • Redirect support
  • Binary file handling
  • Head fetches
  • fetch-by-size limiting
  • Many many bugs crushed
  • much better SSL handling
  • Upgraded code compatibility to java8
  • editor-config compatibility
  • Singleton services
  • Lombok usage for data boilerplate and logs
  • GitIgnore

A special care was given to the unit tests in this release

  • Tabulated results
  • Added an internal web server to host testing pages
  • Made all tests pass on every build
  • Lots of work on organizing the tests

And a lot of attention was given to the actual project

For the full changelog you are invited to the closed bugs grouped by the v0.3 milestone

Categories
ModularWebCrawler Functionality

Optimizing Page Fetching

Fetching (downloading) data from an online site is an expensive act and is very time consuming, it doesn’t take much storage or cpu/gpu, the main resource which is consumed is the network transport.

There is a high overhead just for making the actual connection to the remote site (from DNS resolution to the handshake and maintaining the actual connection), then there is the actual download of the bytes to your computer.

A web crawler must do lots of page fetchings, but there are ways one can optimize it’s work.

Optimization #1 – Prefetching

The first thing to do is to determine the content type of the online page, is it text based or is it binary data ?
Text pages are the bread and butter of web crawlers as they serve two main purposes, the first and obvious one is to return them to the user so he can parse them or do whatever he wants with them as that is the purpose of the crawl in most cases (crawl the web in order to find X, Y or Z), the second purpose these text pages (they can be text/json/csv/xml/html or any other text format) if for the web crawler itself to parse and get all of the URLs it can find in these pages, so it could continue its crawl with new seeds for further crawling.

Binary data like mp3, avi, mpeg or any other non-notepad-readable file format is another thing, there is no simple way to parse these files for links (yes, there are ways, but these are CPU intensive and not worth the crawler’s time and energy), so these files are worthless for the web crawler, but they might be the thing which the user is searching for (for example, a user might start a web crawl for all mp3 files in a specific site), the downside to getting these files for the user is that these files can be quite big to download, it is not uncommon to find pictures/sound files all over the www weighing several megabytes or finding many other files which weigh in the hundreds of megabytes, a file like this can take so much band width as to almost halt all other activity of the web crawler for several minutes, thus optimization is in line here.

The most basic of optimization (so basic as one might not even call it optimization) is to have a configuration flag which will disable the downloading of any binary file if the user doesn’t need it.

The way of a MWC to implement this feature is to use a regular expression matching on any link in order to try and decipher a file extension, if one is found compare it to a list of binary file extensions, if there is a match – this file won’t be downloaded – boom!, huge optimization was done using this single flag!

This is not a flawless check, most binary files won’t be downloaded but a very small amount of text files which happen to contain a url structure of a binary file will be skipped (doesn’t happen much, but might happen) and binary files which won’t contain the regular expression matching pattern will be downloaded and won’t be successfully filtered, wasting bandwidth but there is not much to do about it, after downloading the binary data a small and speedy procedure will read the bytes of the downloaded object and will give us a much more accurate estimation of the file type (binary or text) and will parse-for-links only the text based files, thus saving us parsing cpu.

Optimization #2 – Head Fetching

The second optimization will happen when the user does choose to download binary files, but he chooses to limit the size of the downloaded files to a maximum size, for example, the user might choose to download only jpg files with the maximum size of 70kb, this can save him lots of bandwidth. The complicated job falls then on web crawler to implement the user’s demands in a smart way.

Mwc, will start by denying downloads of all binary file extensions except the requested binary extension type which in this example will be jpg, then before downloading the jpg files it will make a HEAD http request, this is a trimmed down http request which gets only the file meta data – the headers – it will then parse the headers and get the content-length header, and only if the file is less than the requested maximum size (70kb in our example) it will download this file.

This optimization is a good optimization in most cases, but not without its merits, doing a HEAD request is cheaper than doing a full GET request, but still all of the overhead of the actual connection is done the same way as the full request, only the actual downloading of the file is spared from the user.

The usage of the HEAD request is a must, as from time to time the crawler encounters huge files around the web, and the bandwidth cost of downloading them is worth tens of thousands of HEAD requests.

Summary (TL;DR)

Optimizing crawling can be done via binary file download cuts

  • Flag to eliminate any binary file download
  • Filter to download only specific types of binary files
  • Content checking of downloaded files to determine (Tika) for sure what type of file was downloaded – thus cutting down the need for parsing binary files
  • Configuration property to limit the maximum size of a binary file to downloading, this is done using a HEAD http request
Categories
ModularWebCrawler Functionality

Handling Redirection (3xx Status Codes)

Many times site owners choose to redirect their users from one page to an other one, this is happening behind the scenes and more often than not the user is not even aware he was redirected.

Redirection technically is done using one of several options, one of the most used method to redirect is using the .htaccess file in the root of your web server (if you use the apache web server)

How should a web crawler behave when it finds a redirected page?

How should MWC handle redirects ?

One of the main concepts MWC follows is simplicity. In order to achieve simplicity MWC uses an opinionated approach, so following redirects is not configurable, it is always true, but as MWC also strives to be speedy, it won’t be automagically done behind the scenes.

Any redirected url, will be treated as a regular url added now to the crawler, so it will be added the frontierDB – so if later on the crawler will find that url, it will already recognize it and won’t attempt to recrawl it, and it will be passed through all of the user-defined-filters, so it might be dropped in the way.

How MWC achieves URL redirection ?

When MWC fetches a page and gets a 3xx http redirect status code, it does the following:

  • Sends the redirected url through the user-defined-filters, as it might fail one of the filters
  • Adds the redirected page url to the frontierDB (so it will be recognized if it is found on another page) – while maintaining the depth of the original url, because conceptually, the user estimates the depth of the original url and the depth of the redirected url as the same depth
  • MWC still allows the user to handle the redirection differently by overriding the handleRedirection(…) method
  • This way of working avoids a redirect loop, as if in any stage a redirect is pointing to an Url which was already visited it will be dropped when passing via the filters
Categories
News

MWC Milestone 0.2 has been Reached

MWC has reached a new milestone v0.2.

All milestones are important and this one is no different, MWC is now in a working state, though not a perfect one as there are so many edge cases in the world wide web, but MWC can be easily started, it’s internal code structure makes sense, and the most basic crawling features are implemented.

All of the above being said, there is still so much work to be done till MWC will be in a presentable state.

Basic features should still be polished, exception handling should be upgraded, some 3rd party implementations of the main components should be added and of course there is the matter of advanced features (Proxy support, authentication support, proper politeness support and many more).

Anyway, this is a start, and a good one at that, so I am pleased

Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 3

How ModularWebCrawler Works?

As MWC strives for simplicity, the flow of the code is written to be simple and understandable.

MWC, will start with a seed – a first (list of) url(s) and a configuration object, the code is well documented so the configuration object should be self explanatory.

The difference between crawls will be defined by the configuration object which contains a set of properties/flags (mostly booleanic) defining the specific crawl the user wants to initiate.

In the configuration object the user will be able to configure the politeness of the crawl, as well as the option to crawl sitemaps, or even to specify his general attitude towards crawling, does he prefer storage space or speed which MWC will translate internally into different decisions depicting speed/storage/accuracy etc. The configuration object will also contain the limitations the user wants to put on the crawl like the depth of the crawl, the domains the user wants to limit the crawler to or the maximum number of pages to crawl.

Mwc will then run the crawl starting with the initial seed/s.
First the seed will pass a list of filters which might drop the current seed as it doesn’t pass the user requirements, an example for a filter is a domain limiter filter, if the user chooses to limit the crawl to specific domains, then this filter will be initiated for every seed and if the seed is not of the list of the user-defined domains, then it will drop in this stage and no further processing is needed for this specific seed.

Assuming the seed passed all filters and still is eligible for crawling, then the fetcher will fetch the seed URL; If the fetch failed for any reason, a matching exception will be thrown and will be caught with an empty implementation (or with a simple log line), these implementations can optionally be overridden by the user if he needs a different implementation for them, MWC will then continue to fetch the next seed.

The parser will come into play then and parse the fetched web page (assuming it is not a binary file which is taken care differently), extracting all links from the page, running each one of them through the filters and putting them into the frontier DB to be fetched by an available thread. The page will then return to the user to do his required work on crawled pages.

The thing is, that web crawling, when sliced to its atomic parts, should be really simple, so I wanted MWC to reflect that simple logic.

Logical steps done in MWC

  • Run seed through filters
  • Put seed into the frontierDB
  • Pop an url from frontierDB
  • Fetch web page
  • Parse web page extracting all links
  • Return page to the user (and store link in crawledDB)
  • Send all parsed links to the filter list (first step in this list)
Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 2

Why ModularWebCrawler?

Searching around, one can find many Java web crawlers, but sometimes quantity is not the answer, this was the case for me, although many Java web crawlers exist, none satisfied my needs.

  • Many Java web crawlers are seriously outdated
  • Many have specific needs in mind and are not good generic web crawlers
  • The big Java web crawlers are too big for a small contained web crawl (as they use several machines for a crawl, or use big frameworks to run) and are a total overkill for the job
  • Some are way too complicated to start running, not having a decent interface for a simple user
  • And most of the crawlers use in house components for the various tasks of the crawling which is to invent the wheel all over again as there are many solid and active libraries which do any part of the crawling so there is no use trying to implement every part of the crawl.

My goal is to overcome all of the above, creating a Java web crawler which will be very simple to run and understand, will be active and relevant, will use 3rd party components to do the various parts of the crawling and be configurable to do any type of crawl the user has in mind.

I am not sure why I needed a web crawler, it was a small crawl which I needed, but when I tried the different active web crawlers, I got frustrated, I have more than 10 years of Java development behind me and still I found it hard to start a simple crawl, and even when I managed to find a decent Java web crawler, I was frustrated from different parts of it like the internal storage and the robots.txt handling which were custom made, thus had many bugs in them, bugs which were solved long ago by dedicated teams which maintained projects solving these atomic goals (storage, robots.txt).

I joined the crawler-commons team, which focuses on solving some of the crawling functionalities like robots.txt handling and sitemap parsing, and found no reason to implement these in any crawler when this implementation is maintained by a highly skilled team.

So I began the task of separating the concerns of any web crawler, then searching for libraries solving those concerns. Then I defined APIs of the functionality I needed from each component and began implementing it using the different 3rd party implementations. That being done I wrote a set of tests to compare the different implementation of each component in terms of accuracy and performance. I then defined the flow of a web crawl (filter the url seed, fetch, parse, store the links then return the page to the user), and made sure the flow is obvious from the code, while the implementation is hidden in the different libraries used in MWC, thus any bug found should be reported to the respective libraries making MWC and the other libraries happy.

Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 1

What is ModularWebCrawler?

Modular Web Crawler is a simple Java web crawler, it’s purpose is to crawl the web according to the user’s requirements.

Web crawling is a procedure in which an initial web page (seed) is fetched then parsed allowing extraction of all links from it, those links will be then fetched in turn giving the user many many more links of pages to fetch, which means that the list of web pages to fetch grows exponentially as any web page holds many links, which in turn hold many links, thus allowing the user to crawl a portion of the web.

The crawling is basically infinite (as the www is almost infinite 🙂 ), so the user should limit the web crawl according to his needs, popular examples of crawling limitations are limiting the crawl to specific domains, so the crawler will crawl those domains, till no more links from those domains are left uncrawled. Another popular crawling limitation is limiting the crawl to a specific number of crawled pages, so if the user wishes to crawl only 5000 pages, then the crawler will stop its work when it gets to the required quota.

The purpose of any web crawling is fetching the crawled pages and returning them to the user, so any web crawler will fetch the page then parse it in order to extract the links from the page for further crawling, but before it returns to crawling the next page it will return the crawled page for the user which will do whatever he needs doing with the fetched page. A pure crawler won’t store the fetched pages in any storage, as it is not the purpose of the web crawler, the pure purpose of a web crawler is only to crawl the web according to the user’s requirements and limitations, and return the fetched pages to the user.

ModularWebCrawler

Modular web crawler is not different in those aspects from any other web crawler, it fetches, parses then returns the pages to the user, the difference between MWC and other Java web crawlers is in how it is built and in its defined goals.

Defined Goals

Simplicity

Simple – MWC, is goaled to be simple in all ways

  • Simple to run – A user can maven MWC into his project, then start crawling the web in a minute using a mere 4 lines of code (2 of which will be the declaration of the class!)
  • Simple to understand – MWC is designed to be understood by any user, novice or a professional by having simple class structure and by removing any complexity in the architecture by design.
  • Simple to dive into the code – Any user is able to dive straight into the MWC code and understand any part of it without much fuss, the code is designed to be simple, a lot of thought was thrown into the code in order to make it understandable, so the classes should be well defined, the hierarchy is well established and the flow of the code is obvious, the user should understand the flow of the crawler by a simple glance of the code.

Modularity

Modular – MWC strives to be modular in any helpful way, so any functionality which can be “outsourced” to a 3rd party library and gain a leverage because of that will be implemented in a way (interface and 3rd party implementation) which can be easily implemented using different 3rd party implementations, implementations will be easily changeable by the user to match his specific needs. That being said, a recommended implementation for each component will be defaulted in MWC.
The price of this modularity is a bigger size of MWC as each 3rd party component might drag with it lots of megabytes of code and dependencies (Although, I personally am not worried about the size of MWC, as it is not such an issue nowdays).
Another price I pay for modularity is the need to test each implementation, which is more work, although a smart set of unit tests can solve most of this problem.

An example of modularity is the MWC parser, its purpose is to parse a fetched page from the web, so instead of writing my own parser, I wrote an interface depicting my needs from the parser like the link extraction from the page, then I implemented that interface using several page parser libraries like JSoup and HtmlCleaner, each one implementing the same functionality declared in the interface, additional parser can easily be integrated into MWC and a suite of tests comparing the parsing of a page using the different implementations was written, so I can compare performance and accuracy of the parsing.

But if a user isn’t satisfied by the integrated parser, he might decide to implement the same interface using his own methods, gaining his specific requirements and functionality like parsing of links written in a bad way which the other parsers don’t catch.

Speed

Speed is an essential part of any web crawler as the crawling process is very time consuming and the goal of most crawlings is to get the requested results as fast as possible and MWC is no different, thus a lot of thought and testing was invested in MWC so it will be as speedy as possible straight out of the box and when some functionality might be essential for some users but that functionality is time consuming, I made that functionality optional and defaulted the crawler to the option which makes the most sense (usually speed was preferred).

Specific tests test the performance of any component in order to understand (and compare if the need arose) which component takes how much time.

In several cases a tradeoff can be done between memory/storage/speed, in those cases I will have a configuration option for the user to choose which aspect he wishes to take priority so he will have an easy way to sacrifice storage for speed for example.

Modular Web Crawler