ModularWebCrawler

What is ModularWebCrawler?

Modular Web Crawler is a simple Java web crawler, it’s purpose is to crawl the web according to the user’s requirements.

Web crawling is a procedure in which an initial web page (seed) is fetched then parsed allowing extraction of all links from it, those links will be then fetched in turn giving the user many many more links of pages to fetch, which means that the list of web pages to fetch grows exponentially as any web page holds many links, which in turn hold many links, thus allowing the user to crawl a portion of the web.

The crawling is basically infinite (as the www is almost infinite 🙂 ), so the user should limit the web crawl according to his needs, popular examples of crawling limitations are limiting the crawl to specific domains, so the crawler will crawl those domains, till no more links from those domains are left uncrawled. Another popular crawling limitation is limiting the crawl to a specific number of crawled pages, so if the user wishes to crawl only 5000 pages, then the crawler will stop its work when it gets to the required quota.

The purpose of any web crawling is fetching the crawled pages and returning them to the user, so any web crawler will fetch the page then parse it in order to extract the links from the page for further crawling, but before it returns to crawling the next page it will return the crawled page for the user which will do whatever he needs doing with the fetched page. A pure crawler won’t store the fetched pages in any storage, as it is not the purpose of the web crawler, the pure purpose of a web crawler is only to crawl the web according to the user’s requirements and limitations, and return the fetched pages to the user.

Modular web crawler is not different in those aspects from any other web crawler, it fetches, parses then returns the pages to the user, the difference between MWC and other Java web crawlers is in how it is built and in its defined goals.

Defined Goals

Simplicity

Simple – MWC, is goaled to be simple in all ways

Simple to run – A user can maven MWC into his project, then start crawling the web in a minute using a mere 4 lines of code (2 of which will be the declaration of the class!)
Simple to understand – MWC is designed to be understood by any user, novice or a professional by having simple class structure and by removing any complexity in the architecture by design.
Simple to dive into the code – Any user is able to dive straight into the MWC code and understand any part of it without much fuss, the code is designed to be simple, a lot of thought was thrown into the code in order to make it understandable, so the classes should be well defined, the hierarchy is well established and the flow of the code is obvious, the user should understand the flow of the crawler by a simple glance of the code.

Modularity

Modular – MWC strives to be modular in any helpful way, so any functionality which can be “outsourced” to a 3rd party library and gain a leverage because of that will be implemented in a way (interface and 3rd party implementation) which can be easily implemented using different 3rd party implementations, implementations will be easily changeable by the user to match his specific needs. That being said, a recommended implementation for each component will be defaulted in MWC.
The price of this modularity is a bigger size of MWC as each 3rd party component might drag with it lots of megabytes of code and dependencies (Although, I personally am not worried about the size of MWC, as it is not such an issue nowdays).
Another price I pay for modularity is the need to test each implementation, which is more work, although a smart set of unit tests can solve most of this problem.

An example of modularity is the MWC parser, its purpose is to parse a fetched page from the web, so instead of writing my own parser, I wrote an interface depicting my needs from the parser like the link extraction from the page, then I implemented that interface using several page parser libraries like JSoup and HtmlCleaner, each one implementing the same functionality declared in the interface, additional parser can easily be integrated into MWC and a suite of tests comparing the parsing of a page using the different implementations was written, so I can compare performance and accuracy of the parsing.

But if a user isn’t satisfied by the integrated parser, he might decide to implement the same interface using his own methods, gaining his specific requirements and functionality like parsing of links written in a bad way which the other parsers don’t catch.

Speed

Speed is an essential part of any web crawler as the crawling process is very time consuming and the goal of most crawlings is to get the requested results as fast as possible and MWC is no different, thus a lot of thought and testing was invested in MWC so it will be as speedy as possible straight out of the box and when some functionality might be essential for some users but that functionality is time consuming, I made that functionality optional and defaulted the crawler to the option which makes the most sense (usually speed was preferred).

Specific tests test the performance of any component in order to understand (and compare if the need arose) which component takes how much time.

In several cases a tradeoff can be done between memory/storage/speed, in those cases I will have a configuration option for the user to choose which aspect he wishes to take priority so he will have an easy way to sacrifice storage for speed for example.