Monday, June 21, 2010

Text mining resources.

Resources for Semantic searches, entity extraction, classification, and other NLP oriented approaches.
This is just a reference for various approaches I came across while working on my semantic mining projects. Some of these are full fledged software platforms while others are APIs or small specific algorithms.
1. Yahoo Term Extractor APIs
Yahoo has an API which you can use for term extractions. It does an ok job but seems like there’s no new development on it. http://developer.yahoo.com/search/content/V1/termExtraction.html
There are projects on github which wrap this api and can be used from within Rails or other languages, although the use of the API is very straight forward
2. Git hub projects
Github has many projects for term extraction, classification etc

a. Term-Extractor http://github.com/DRMacIver/term-extractor

b. Bayes_motel for multi-variate classification http://github.com/mperham/bayes_motel



3. Rubyforge projects
These projects do classification, stemming etc.

a. Classifier
b. Stemmer
http://rubyforge.org/projects/classifier/

4. WEKA (collection of machine learning algos) http://www.cs.waikato.ac.nz/ml/weka/
There’s also a JRuby wrapper in github http://github.com/bmaland/Eureka

5. WordNet (http://wordnet.princeton.edu/ )

6. GATE (NLP tools) http://gate.ac.uk/

7. LingPipe A not so open source version of linguistic analysis libraries

8. Topia_termextractor http://pypi.python.org/pypi/topia.termextract/

9. Open Calais. This does a lot more than entity extraction. It also does classification.

10. KEA (Keyphrase Extraction Algorithm)

11. Maui Indexer (Google code project)
12. Other references
a. http://alias-i.com/lingpipe/web/competition.html
b. http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html
c. Ruby related NLP: http://web.media.mit.edu/~dustin/rubyai.html