I open sourced my Java KBtextmaster project
KBtextmaster reads a variety of document formats (Word, Powerpoint, PDF, OpenOffice.org, AbiWord) and performs categorization, summarization, part of speech tagging, document clustering, and indexing/search using Lucene.
You can get it here. It is released under the GPL, with alternative licenses available if the GPL does not work for your project.