Wednesday, April 29, 2009

deleting documents from Nutch Index

Here's what you can use to delete documents from the nutch index.

bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -queries qu.txt -output ou.txt

where
- crawl/indexes/part-00000 contains the indexes

- qu.txt is the file with a list of queries in it. one example being
"site:abc.xxx.com"

- ou.txt is the output of the document urls which (are) will be deleted

If you just want to see which documets will be effected without actually deleting them then use the -dryrun flag also.

bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -dryrun -queries qu.txt -output ou.txt

US Energy Map


NPR has published an interactive map of US electric grid along with some other details like power plants, solar and wind etc.

I have been thinking about doing something similar for a long time, but this is a great start.

The map includes:


  1. US Electric Grid with existing and proposed lines

  2. Power sources (including Coal, Nuclear, Gas, Hydro, Oil, Biomass, Wind) and their usage in various states.

  3. Power plants (including Coal, Nuclear, Gas, Hydro, Oil, Biomass, Solar, Wind) and their details

  4. Solar power capacity and solar power transmission lines

  5. Wind speed chart and proposed wind power transmission lines by and after 2030

http://www.npr.org/news/graphics/2009/apr/electric-grid/


Thursday, April 23, 2009

Apache Tomcat Native library

Error Message
"The Apache Tomcat Native library which allows optimal performance in production environments was not found"

There are many threads on the internet but most of them fail to resolve this error. What you need is couple of packages in order to get this working.

1) apr
2) tomcat-native

If you use yum, then you can just do

> sudo yum install apr tomcat-native

and this will do all the work, otherwise you might need to find the right rpm/dlls or build these from source.

Thursday, April 16, 2009

What is a ‘green job’ anyway?

Van Jones, social activist and advisor to President Obama says a green job is "a family-supporting, career-track job that directly contributes to preserving or enhancing environmental quality."

http://www.renewableenergyjobs.com/what_is_a_green_job/

Tuesday, April 14, 2009

Nutch 1.0

Nutch has come a long way and just released their 1.0 version. Here's are some thing which I found and might be useful to others as well.


Q: My nutch crawler only seems to find limited set of links and is not indexing all the documents from the website.

A:
i) make sure the depth is defined long enough to crawl thru all the pages.
crawl -d urldir -dir crawl-dir -depth 20 -threads 10-topN 50

ii) make sure the default page size of 65536 is increased to appropriate limit to read the whole page. This is defined with http.content.limit.

iii) Most important is db.max.outlinks.per.page property. Make sure its large enough (or -1) to conver your content. Default is 100.



Q: How do I control which sites are crawled?

A: You can control which sites to control in crawl-urlfilter.txt if you are using the crawl command which is asingle commandfto perform all the steps. If you do not use the crawl command, then you can use db.ignore.external.links property and set it to "true" and it'll use only the sites within the seed list.

Q: How do I crawl only a given set of sites. Should I be using crawl-urlfilter?

A: crawl-urlfilter.txt is used only while crawling and contains the regular expression to control te websites. A better way to do this is using domain-urlfilter.txt where you can specify which domains, TLD or otherwise, to use. Its just a list of domains like
net
xxx.org
foo.bar.cn
in

One thing to note here is if you include "in" as a domain name and one of the "in" sites like http://www.foo.in/ redirects you too www.bar.com/in then you'll end up getting this .com site as well because the urlfilter runs before the fetch and doesn't know about the redirect.

If you don't want this behavior, you can set http.redirect.max to 0 (zero).


Q: How to exclude some file types/extensions which I don't want to crawl.

A: Use suffix-urlfuilter.txt to include file extensions which you don't want.