Categories
News

MWC Milestone 0.2 has been Reached

MWC has reached a new milestone v0.2.

All milestones are important and this one is no different, MWC is now in a working state, though not a perfect one as there are so many edge cases in the world wide web, but MWC can be easily started, it’s internal code structure makes sense, and the most basic crawling features are implemented.

All of the above being said, there is still so much work to be done till MWC will be in a presentable state.

Basic features should still be polished, exception handling should be upgraded, some 3rd party implementations of the main components should be added and of course there is the matter of advanced features (Proxy support, authentication support, proper politeness support and many more).

Anyway, this is a start, and a good one at that, so I am pleased

Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 3

How ModularWebCrawler Works?

As MWC strives for simplicity, the flow of the code is written to be simple and understandable.

MWC, will start with a seed – a first (list of) url(s) and a configuration object, the code is well documented so the configuration object should be self explanatory.

The difference between crawls will be defined by the configuration object which contains a set of properties/flags (mostly booleanic) defining the specific crawl the user wants to initiate.

In the configuration object the user will be able to configure the politeness of the crawl, as well as the option to crawl sitemaps, or even to specify his general attitude towards crawling, does he prefer storage space or speed which MWC will translate internally into different decisions depicting speed/storage/accuracy etc. The configuration object will also contain the limitations the user wants to put on the crawl like the depth of the crawl, the domains the user wants to limit the crawler to or the maximum number of pages to crawl.

Mwc will then run the crawl starting with the initial seed/s.
First the seed will pass a list of filters which might drop the current seed as it doesn’t pass the user requirements, an example for a filter is a domain limiter filter, if the user chooses to limit the crawl to specific domains, then this filter will be initiated for every seed and if the seed is not of the list of the user-defined domains, then it will drop in this stage and no further processing is needed for this specific seed.

Assuming the seed passed all filters and still is eligible for crawling, then the fetcher will fetch the seed URL; If the fetch failed for any reason, a matching exception will be thrown and will be caught with an empty implementation (or with a simple log line), these implementations can optionally be overridden by the user if he needs a different implementation for them, MWC will then continue to fetch the next seed.

The parser will come into play then and parse the fetched web page (assuming it is not a binary file which is taken care differently), extracting all links from the page, running each one of them through the filters and putting them into the frontier DB to be fetched by an available thread. The page will then return to the user to do his required work on crawled pages.

The thing is, that web crawling, when sliced to its atomic parts, should be really simple, so I wanted MWC to reflect that simple logic.

Logical steps done in MWC

  • Run seed through filters
  • Put seed into the frontierDB
  • Pop an url from frontierDB
  • Fetch web page
  • Parse web page extracting all links
  • Return page to the user (and store link in crawledDB)
  • Send all parsed links to the filter list (first step in this list)
Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 2

Why ModularWebCrawler?

Searching around, one can find many Java web crawlers, but sometimes quantity is not the answer, this was the case for me, although many Java web crawlers exist, none satisfied my needs.

  • Many Java web crawlers are seriously outdated
  • Many have specific needs in mind and are not good generic web crawlers
  • The big Java web crawlers are too big for a small contained web crawl (as they use several machines for a crawl, or use big frameworks to run) and are a total overkill for the job
  • Some are way too complicated to start running, not having a decent interface for a simple user
  • And most of the crawlers use in house components for the various tasks of the crawling which is to invent the wheel all over again as there are many solid and active libraries which do any part of the crawling so there is no use trying to implement every part of the crawl.

My goal is to overcome all of the above, creating a Java web crawler which will be very simple to run and understand, will be active and relevant, will use 3rd party components to do the various parts of the crawling and be configurable to do any type of crawl the user has in mind.

I am not sure why I needed a web crawler, it was a small crawl which I needed, but when I tried the different active web crawlers, I got frustrated, I have more than 10 years of Java development behind me and still I found it hard to start a simple crawl, and even when I managed to find a decent Java web crawler, I was frustrated from different parts of it like the internal storage and the robots.txt handling which were custom made, thus had many bugs in them, bugs which were solved long ago by dedicated teams which maintained projects solving these atomic goals (storage, robots.txt).

I joined the crawler-commons team, which focuses on solving some of the crawling functionalities like robots.txt handling and sitemap parsing, and found no reason to implement these in any crawler when this implementation is maintained by a highly skilled team.

So I began the task of separating the concerns of any web crawler, then searching for libraries solving those concerns. Then I defined APIs of the functionality I needed from each component and began implementing it using the different 3rd party implementations. That being done I wrote a set of tests to compare the different implementation of each component in terms of accuracy and performance. I then defined the flow of a web crawl (filter the url seed, fetch, parse, store the links then return the page to the user), and made sure the flow is obvious from the code, while the implementation is hidden in the different libraries used in MWC, thus any bug found should be reported to the respective libraries making MWC and the other libraries happy.

Categories
About ModularWebCrawler

ModularWebCrawler – What, Why, and How 1

What is ModularWebCrawler?

Modular Web Crawler is a simple Java web crawler, it’s purpose is to crawl the web according to the user’s requirements.

Web crawling is a procedure in which an initial web page (seed) is fetched then parsed allowing extraction of all links from it, those links will be then fetched in turn giving the user many many more links of pages to fetch, which means that the list of web pages to fetch grows exponentially as any web page holds many links, which in turn hold many links, thus allowing the user to crawl a portion of the web.

The crawling is basically infinite (as the www is almost infinite 🙂 ), so the user should limit the web crawl according to his needs, popular examples of crawling limitations are limiting the crawl to specific domains, so the crawler will crawl those domains, till no more links from those domains are left uncrawled. Another popular crawling limitation is limiting the crawl to a specific number of crawled pages, so if the user wishes to crawl only 5000 pages, then the crawler will stop its work when it gets to the required quota.

The purpose of any web crawling is fetching the crawled pages and returning them to the user, so any web crawler will fetch the page then parse it in order to extract the links from the page for further crawling, but before it returns to crawling the next page it will return the crawled page for the user which will do whatever he needs doing with the fetched page. A pure crawler won’t store the fetched pages in any storage, as it is not the purpose of the web crawler, the pure purpose of a web crawler is only to crawl the web according to the user’s requirements and limitations, and return the fetched pages to the user.

ModularWebCrawler

Modular web crawler is not different in those aspects from any other web crawler, it fetches, parses then returns the pages to the user, the difference between MWC and other Java web crawlers is in how it is built and in its defined goals.

Defined Goals

Simplicity

Simple – MWC, is goaled to be simple in all ways

  • Simple to run – A user can maven MWC into his project, then start crawling the web in a minute using a mere 4 lines of code (2 of which will be the declaration of the class!)
  • Simple to understand – MWC is designed to be understood by any user, novice or a professional by having simple class structure and by removing any complexity in the architecture by design.
  • Simple to dive into the code – Any user is able to dive straight into the MWC code and understand any part of it without much fuss, the code is designed to be simple, a lot of thought was thrown into the code in order to make it understandable, so the classes should be well defined, the hierarchy is well established and the flow of the code is obvious, the user should understand the flow of the crawler by a simple glance of the code.

Modularity

Modular – MWC strives to be modular in any helpful way, so any functionality which can be “outsourced” to a 3rd party library and gain a leverage because of that will be implemented in a way (interface and 3rd party implementation) which can be easily implemented using different 3rd party implementations, implementations will be easily changeable by the user to match his specific needs. That being said, a recommended implementation for each component will be defaulted in MWC.
The price of this modularity is a bigger size of MWC as each 3rd party component might drag with it lots of megabytes of code and dependencies (Although, I personally am not worried about the size of MWC, as it is not such an issue nowdays).
Another price I pay for modularity is the need to test each implementation, which is more work, although a smart set of unit tests can solve most of this problem.

An example of modularity is the MWC parser, its purpose is to parse a fetched page from the web, so instead of writing my own parser, I wrote an interface depicting my needs from the parser like the link extraction from the page, then I implemented that interface using several page parser libraries like JSoup and HtmlCleaner, each one implementing the same functionality declared in the interface, additional parser can easily be integrated into MWC and a suite of tests comparing the parsing of a page using the different implementations was written, so I can compare performance and accuracy of the parsing.

But if a user isn’t satisfied by the integrated parser, he might decide to implement the same interface using his own methods, gaining his specific requirements and functionality like parsing of links written in a bad way which the other parsers don’t catch.

Speed

Speed is an essential part of any web crawler as the crawling process is very time consuming and the goal of most crawlings is to get the requested results as fast as possible and MWC is no different, thus a lot of thought and testing was invested in MWC so it will be as speedy as possible straight out of the box and when some functionality might be essential for some users but that functionality is time consuming, I made that functionality optional and defaulted the crawler to the option which makes the most sense (usually speed was preferred).

Specific tests test the performance of any component in order to understand (and compare if the need arose) which component takes how much time.

In several cases a tradeoff can be done between memory/storage/speed, in those cases I will have a configuration option for the user to choose which aspect he wishes to take priority so he will have an easy way to sacrifice storage for speed for example.