How ModularWebCrawler Works?
As MWC strives for simplicity, the flow of the code is written to be simple and understandable.
MWC, will start with a seed – a first (list of) url(s) and a configuration object, the code is well documented so the configuration object should be self explanatory.
The difference between crawls will be defined by the configuration object which contains a set of properties/flags (mostly booleanic) defining the specific crawl the user wants to initiate.
In the configuration object the user will be able to configure the politeness of the crawl, as well as the option to crawl sitemaps, or even to specify his general attitude towards crawling, does he prefer storage space or speed which MWC will translate internally into different decisions depicting speed/storage/accuracy etc. The configuration object will also contain the limitations the user wants to put on the crawl like the depth of the crawl, the domains the user wants to limit the crawler to or the maximum number of pages to crawl.
Mwc will then run the crawl starting with the initial seed/s.
First the seed will pass a list of filters which might drop the current seed as it doesn’t pass the user requirements, an example for a filter is a domain limiter filter, if the user chooses to limit the crawl to specific domains, then this filter will be initiated for every seed and if the seed is not of the list of the user-defined domains, then it will drop in this stage and no further processing is needed for this specific seed.
Assuming the seed passed all filters and still is eligible for crawling, then the fetcher will fetch the seed URL; If the fetch failed for any reason, a matching exception will be thrown and will be caught with an empty implementation (or with a simple log line), these implementations can optionally be overridden by the user if he needs a different implementation for them, MWC will then continue to fetch the next seed.
The parser will come into play then and parse the fetched web page (assuming it is not a binary file which is taken care differently), extracting all links from the page, running each one of them through the filters and putting them into the frontier DB to be fetched by an available thread. The page will then return to the user to do his required work on crawled pages.
The thing is, that web crawling, when sliced to its atomic parts, should be really simple, so I wanted MWC to reflect that simple logic.
Logical steps done in MWC
- Run seed through filters
- Put seed into the frontierDB
- Pop an url from frontierDB
- Fetch web page
- Parse web page extracting all links
- Return page to the user (and store link in crawledDB)
- Send all parsed links to the filter list (first step in this list)