ModularWebCrawler – What, Why, and How 2

Why ModularWebCrawler?

Searching around, one can find many Java web crawlers, but sometimes quantity is not the answer, this was the case for me, although many Java web crawlers exist, none satisfied my needs.

Many Java web crawlers are seriously outdated
Many have specific needs in mind and are not good generic web crawlers
The big Java web crawlers are too big for a small contained web crawl (as they use several machines for a crawl, or use big frameworks to run) and are a total overkill for the job
Some are way too complicated to start running, not having a decent interface for a simple user
And most of the crawlers use in house components for the various tasks of the crawling which is to invent the wheel all over again as there are many solid and active libraries which do any part of the crawling so there is no use trying to implement every part of the crawl.

My goal is to overcome all of the above, creating a Java web crawler which will be very simple to run and understand, will be active and relevant, will use 3rd party components to do the various parts of the crawling and be configurable to do any type of crawl the user has in mind.

I am not sure why I needed a web crawler, it was a small crawl which I needed, but when I tried the different active web crawlers, I got frustrated, I have more than 10 years of Java development behind me and still I found it hard to start a simple crawl, and even when I managed to find a decent Java web crawler, I was frustrated from different parts of it like the internal storage and the robots.txt handling which were custom made, thus had many bugs in them, bugs which were solved long ago by dedicated teams which maintained projects solving these atomic goals (storage, robots.txt).

I joined the crawler-commons team, which focuses on solving some of the crawling functionalities like robots.txt handling and sitemap parsing, and found no reason to implement these in any crawler when this implementation is maintained by a highly skilled team.

So I began the task of separating the concerns of any web crawler, then searching for libraries solving those concerns. Then I defined APIs of the functionality I needed from each component and began implementing it using the different 3rd party implementations. That being done I wrote a set of tests to compare the different implementation of each component in terms of accuracy and performance. I then defined the flow of a web crawl (filter the url seed, fetch, parse, store the links then return the page to the user), and made sure the flow is obvious from the code, while the implementation is hidden in the different libraries used in MWC, thus any bug found should be reported to the respective libraries making MWC and the other libraries happy.