Mark Watson’s Artificial Intelligence Books and Blog

Share this post

Using Lucene with JRuby

markwatson.com

Using Lucene with JRuby

Mark Watson
Jun 17, 2007
Share

I use the Ruby Ferret indexing and search library a lot. Ferret is a port (some Ruby, mostly C) of Lucene. I have recently been getting into using JRuby. A few days ago, I discovered that it was reasonable easy to run a simple Rails web application using the Java application server JBoss using JRuby (this took me an hour - next time will be easy). Today, I spent a short while getting Lucene and JRuby working together:

require "java"
require "lib/lucene-core-2.1.0.jar"

class Lucene
  @index_path = nil
  def initialize(an_index_path = "data/")
    @index_path = an_index_path
  end
  def add_documents id_text_pair_array # e.g., [[1,"test1"],[2,'test2']]
    index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
    index_writer = org.apache.lucene.index.IndexWriter.new(
          @index_path,
          org.apache.lucene.analysis.standard.StandardAnalyzer.new,
          !index_available)
    id_text_pair_array.each {|id_text_pair|
      term_to_delete = org.apache.lucene.index.Term.new("id", id_text_pair[0].to_s) # if it exists
      a_document = org.apache.lucene.document.Document.new
      a_document.add(org.apache.lucene.document.Field.new('text', id_text_pair[1],
                       org.apache.lucene.document.Field::Store::YES,
                       org.apache.lucene.document.Field::Index::TOKENIZED))
      a_document.add(org.apache.lucene.document.Field.new('id', id_text_pair[0].to_s,
                       org.apache.lucene.document.Field::Store::YES,
                       org.apache.lucene.document.Field::Index::TOKENIZED))
      index_writer.updateDocument(term_to_delete, a_document) # delete any old docs with same id
    }
    index_writer.close
  end
  def search(query)
    parse_query = org.apache.lucene.queryParser.QueryParser.new(
         'text',
         org.apache.lucene.analysis.standard.StandardAnalyzer.new)
    query = parse_query.parse(query)
    engine = org.apache.lucene.search.IndexSearcher.new(@index_path)
    hits = engine.search(query).iterator
    results = []
    while (hits.hasNext && hit = hits.next)
      id = hit.getDocument.getField("id").stringValue.to_i
      text = hit.getDocument.getField("text").stringValue
      results << [hit.getScore, id, text]
    end
    engine.close
    results
  end
  def delete_documents id_array # e.g., [1,5,88]
    index_available = org.apache.lucene.index.IndexReader.index_exists(@index_path)
    index_writer = org.apache.lucene.index.IndexWriter.new(
          @index_path,
          org.apache.lucene.analysis.standard.StandardAnalyzer.new,
          !index_available)
    id_array.each {|id|
      index_writer.deleteDocuments(org.apache.lucene.index.Term.new("id", id.to_s))
    }
    index_writer.close
  end
end

This code assumes that the Java Lucence JAR file lucene-core-2.1.0.jar is in the subdirectory lib. A short test program is:

require "lucene"
require 'pp'

ls = Lucene.new
ls.add_documents([[1,"test one two"],[2,'testing 1 2 3'], [3,'this is a longer test string']])
ls.delete_documents([1])  # optional: test document delete from index
pp ls.search("test")

I had some hesitations about JRuby: I was concerned that using JRuby would lack the light weight feel of hacking in native Ruby. No worries though: JRuby is easy and quick to work with.

Share
Comments
Top
New

No posts

Ready for more?

© 2023 Mark Watson
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing