In the last years more and more sites use SSL certificates to encrypt the communication to their sites, this certificate grants their URL with the “S” after the “http” prefix.
Browser notice the encryption status and add a lock or some other icon to hint to the users that they are now in a safe web page so they can safely put their credit card credentials and buy whatever they want as the communication is encrypted and there are no listeners on this line.
If a site has a certificate but there is something wrong with it, maybe it is expired or maybe it doesn’t point to the site in the right way (doesn’t add the www prefix for example) then most browser will alert you while some of them (Chrome! for example) will block you from accessing the site by default (advanced -> I trust…).
How should a web crawler handle these pages ? should it access them like normal pages ? should it block the access to them like Chrome’s default behaviour ?
MocularWebCralwer’s approach is simple, get to crawl as many links while exposing the user to the minimal risks. Most web crawlers will leave the decision to the user, but I believe in an opinionated approach rather than a blank approach, so, of course, if one wants he can configure the specifics using the CrawlerConfig object, but if he leaves the deafault behaviour he will get the following configurations out of the box:
- Regular crawling will include web pages which their SSL certificate is broken as the risk is really minimal-to-non-existent in a web crawler (home page, bookmarks, default search engine, toolbar – all of these don’t exist in a web crawler).
- Authenticated crawling – meaning when crawling places for which MWC will put the username/password will block broken SSL pages and not crawl them as not to leak the credentials to a MITM attack.
These simple rules are the default behaviour, but it can be easily tweaked to suit the user’s requirements.
P.S. Crawling broken SSL sites is as risky as crawling sites with no SSL certificate (http://…), there is no substantial difference between these two.