Here's what you can use to delete documents from the nutch index.
bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -queries qu.txt -output ou.txt
where
- crawl/indexes/part-00000 contains the indexes
- qu.txt is the file with a list of queries in it. one example being
"site:abc.xxx.com"
- ou.txt is the output of the document urls which (are) will be deleted
If you just want to see which documets will be effected without actually deleting them then use the -dryrun flag also.
bin/nutch org.apache.nutch.tools.PruneIndexTool crawl/indexes/part-00000 -dryrun -queries qu.txt -output ou.txt
No comments:
Post a Comment