Wednesday, April 29, 2009

deleting documents from Nutch Index

Here's what you can use to delete documents from the nutch index.

bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -queries qu.txt -output ou.txt

where
- crawl/indexes/part-00000 contains the indexes

- qu.txt is the file with a list of queries in it. one example being
"site:abc.xxx.com"

- ou.txt is the output of the document urls which (are) will be deleted

If you just want to see which documets will be effected without actually deleting them then use the -dryrun flag also.

bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -dryrun -queries qu.txt -output ou.txt

No comments: