Tuesday, April 14, 2009

Nutch 1.0

Nutch has come a long way and just released their 1.0 version. Here's are some thing which I found and might be useful to others as well.


Q: My nutch crawler only seems to find limited set of links and is not indexing all the documents from the website.

A:
i) make sure the depth is defined long enough to crawl thru all the pages.
crawl -d urldir -dir crawl-dir -depth 20 -threads 10-topN 50

ii) make sure the default page size of 65536 is increased to appropriate limit to read the whole page. This is defined with http.content.limit.

iii) Most important is db.max.outlinks.per.page property. Make sure its large enough (or -1) to conver your content. Default is 100.



Q: How do I control which sites are crawled?

A: You can control which sites to control in crawl-urlfilter.txt if you are using the crawl command which is asingle commandfto perform all the steps. If you do not use the crawl command, then you can use db.ignore.external.links property and set it to "true" and it'll use only the sites within the seed list.

Q: How do I crawl only a given set of sites. Should I be using crawl-urlfilter?

A: crawl-urlfilter.txt is used only while crawling and contains the regular expression to control te websites. A better way to do this is using domain-urlfilter.txt where you can specify which domains, TLD or otherwise, to use. Its just a list of domains like
net
xxx.org
foo.bar.cn
in

One thing to note here is if you include "in" as a domain name and one of the "in" sites like http://www.foo.in/ redirects you too www.bar.com/in then you'll end up getting this .com site as well because the urlfilter runs before the fetch and doesn't know about the redirect.

If you don't want this behavior, you can set http.redirect.max to 0 (zero).


Q: How to exclude some file types/extensions which I don't want to crawl.

A: Use suffix-urlfuilter.txt to include file extensions which you don't want.

No comments: