Completed coding of recursive crawler, it was fun and a lot of hard work, some meditation, and lots of google. I finally did it. My friend Abhijeet asked to make recursive crawler and I was thinking how can I do that. So came up with this idea wo making two lists
1. processed list (All crawled urls are stored here)
2. unprocessed list (All new url are stored here)
Now if a new url exists in any of these lists then skip it and move furthur. Happy crawling guys.....:)
This program do the following thing
- store data in mongodb
- parse html in page title, meta data, meta keywords
- In case if page request fails error handling save it from breaking
- it does not follow any other domain except the given one
Here is the link https://github.com/vishvendrasingh/crawler.git