Wednesday, November 11, 2009

MongoDB has good support for indexing and search, including prefix matching for AJAX completion lists

I have been spoiled by great support for indexing and search in relational databases (e.g., Sphinx, native search in PostgreSQL and MySQL, etc.)

I was pleased to discover, after a little bit of hacking this morning, how easy it is to do indexing and search using the MongoDB document-centered database. I have two common use cases for search, and MongoDB seems to handle both of them fairly well:My approach does require combining search results for multiple search terms in application code, but that is OK. Assuming the use of MongoRecord, here is a code snippet:
class Recipe < MongoRecord::Base
collection_name :recipes
fields :name, :directions, :words
def to_s
"recipe: #{name} directions: #{directions[0..20]}..."
end
def Recipe.make collection, name, directions
collection.insert({:_id => Mongo::ObjectID.new, :name => name,
:directions => directions,
:words => (name + ' ' + directions).split.uniq})
end
end

host = 'localhost'
port = Mongo::Connection::DEFAULT_PORT
MongoRecord::Base.connection = Mongo::Connection.new(host,port).db('mongorecord-test')

db = MongoRecord::Base.connection

coll = db.collection('recipes')
coll.remove({})

coll.create_index(:words, Mongo::ASCENDING)

Recipe.make coll, 'Rice Soup', 'Cook the rice, then add extra water to thin it out.'
Recipe.make coll, 'Cheese and Rice Crackers', 'Slice the cheese and layer on top of crackers.'

puts "\nSimple find"
puts Recipe.find_by_name(:name => 'Rice Soup').to_s

puts "\nFind recipe by regular expression (ignoring case) in array of words /water/i"
Recipe.find(:all, :conditions => {:words => /^water/i}).each { |row| puts row.to_s }
According to the MongoDB documentation, a regular expression match like /^water/i will use an index just as a relational database match in the form like 'water%' does.

I am still in a learning mode with MongoDB, so I would appreciate any comments on improving this aproach.

Labels: , ,


Monday, June 08, 2009

Ruby client for search and spelling correction using Bing's APIs

I noticed that Microsoft allows free use of their search and spelling correction APIs. I just played with the APIs for a few minutes. Here is a Ruby code snippet that I just wrote:
API_KEY = ENV['BING_API_KEY']

require 'rubygems' # needed for Ruby 1.8.x
require 'simple_http'
require 'json'

def search query
uri = "http://api.search.live.net/json.aspx?AppId=#{API_KEY}&Market=en-US&Query=#{CGI.escape(query)}&Sources=web+spell&Web.Count=4"
JSON.parse(SimpleHttp.get(uri))["SearchResponse"]["Web"]["Results"]
end

def correct_spelling text
uri = "http://api.search.live.net/json.aspx?AppId=#{API_KEY}&Market=en-US&Query=#{CGI.escape(text)}&Sources=web+spell&Web.Count=1"
JSON.parse(SimpleHttp.get(uri))["SearchResponse"]["Spell"]["Results"][0]["Value"]
end
You need a free Bing API key - notice that I set the key value in my environment. If you get a key, then try:
search "semantic web java ruby lisp"
correct_spelling "semaantic web jaava ruby lisp"
The first method does spelling correction before search.

Labels: ,


Tuesday, January 02, 2007

Nutch: a platform is born

I have used Nutch for two contracting jobs and Lucene for many jobs. Until today, I have viewed Nutch simply as:Today however I started looking more closely at the underlying Hadoop architecture (like the distributed Google file system and their map reduce client library) and at both the available plugins and the plugin architecture. New opinion: Nutch is a platform for building more complex web applications and knowledge management applications.

Labels: ,


Thursday, December 21, 2006

Correction: Google SOAP Search APIs

Google has stopped issuing new API keys to developers and is adding no new resources to support the search APIs, but: "you can continue to execute queries, and we have no plans to turn off the service in the future".

Better news!

Labels:


Tuesday, December 19, 2006

Google search API: rest in peace

I have been using Google's search API (limited to 1000 queries a day) for research and demos since 2002. A cool service, I am sorry to see it go.

I also use Yahoo's free search API and sometimes run Nutch which supports the OpenSearch API.

My favorite use of Google's API was a "who/where" natural language processing question answering demo that I used to run on knowledgebooks.com

Labels:


Saturday, November 25, 2006

AJAX tools for multiple development platforms

I feel like I have come full circle (almost) in AJAX development: I started out a few years ago adding some simple AJAX enabled forms to a JSP based application. When first starting out, it took hours to get something working. Then a year ago, I discovered how simple it is to use AJAX in Rails: fine, except that Ruby does not have high enough performance for some applications (unless large parts are written in C - Ferret, for example).

I have spent many evenings playing around with various release versions of the Google Web Toolkit (GWT) and it is very compelling, especially if you already are used to developing GUI applications in Java - the only new wrinkle worth mentioning is getting used to handling events asynchronously. The problem with GWT is that it really does tie you to the Java platform. I spend most of my time developing AI applications, but that said, who does not want basic knowledge and competence at building web applications?

I use Common Lisp, Java, and Ruby for development, so for the occasional AJAX tasks that I have, I have settled with the well respected dojo Javascript toolkit because it plays very well with both Lisp and Java JSP based web applications. Dojo is also easy to use with Rails, if you want an alternative to Rail's great AJAX support.

By using Dojo, I basically have to deal with just one learning curve no matter what platform I am developing on. Here is a simple example for a JSP page (assuming that this would be more interesting to most people than a Lisp example):

Add this to your <head> section on a top level JSP page:

<script type="text/javascript">
var djConfig = { isDebug: true };
</script>
<script type="text/javascript" src="dojo.js"></script>
<script type="text/javascript">
dojo.require("dojo.io.*");
dojo.require("dojo.event.*");
</script>
Then try putting this somewhere on your JSP page (note: this assumes that you have another JSP page ajax.jsp that gets the form values and returns content to be placed into the DIV element with ID=ajxplydiv. Anthing that ajax.jsp writes using out.print() gets inserted asynchronously into the "ajxplaydiv" DIV element):

<h3>test AJAX Form:</h3>
<form id="myForm">
<input type="text" name="input_test_form_text" />
<input type="button" name="button1"
value="Try AJAX form" id="ajaxButton" />
</form>

<div id="ajxplaydiv">
initial context for AJAX play div
</div>

<script type="text/javascript">

var buttonObj = document.getElementById("ajaxButton");
dojo.event.connect(buttonObj, "onclick",
this, "my_onclick_button");

function my_onclick_button() {
var ajaxargs = {
url: "ajax.jsp",
load: function(type, data, evt){
dojo.byId('ajxplaydiv').innerHTML = data;
},
error: function(type, data, evt){
alert("Error occurred!");
},
mimetype: "text/plain",
formNode: document.getElementById("myForm")
};
dojo.io.bind(ajaxargs);
}
</script>
The call to dojo.io.bind sends a request containing the form data to the ajax.jsp page, and whenever the results are returned then the Javascript functions "load" or "error" defined in the data block in my_onclick_button() is called. This is just one example of processing form data and adding HTML below the form without refreshing the page - but is a common use case for using AJAX. This example assumes that I have both the dojo "src" directory and dojo.js in my top level web resources directory. The great thing about dojo is that it encapsulates all asynchronous event handling, offers a good supply of visual components (that map to HTML elements) and is simple to use no matter what programming language you are using for a web application.

Labels: , , , ,


Sunday, April 30, 2006

Information: organization vs. overload

I will get to the topic in the title, but first: This month's issue of the Communications of the ACM has a great series of articles on exploratory search: lots of good ideas on organizing sets of search results rather than single documents, clustering vs. categorization, etc. Another good read: May issue of WiRED magazine has good coverage of vblogs and online video - the sort of grass-roots publishing that I like :-)

There is so much information to absorb and use for any type of knowledge worker that it takes time and effort to stay up to date with what we need to do our jobs. Much of my work involves writing custom software (usually layered on open source) for information management in specific industries/applications (large scale search, document categorization and repository maintenance, AI style data mining, agent technology to assist by bringing important things to user's attention, etc.) but I find it ironic that I can not seem to set aside the time to write much custom code for my own information needs (take care of customers first!). And, so far, it always seems to take custom code to solve specific information management problems. From what I have seen, there is not yet any silver bullet.

I have some ideas for exactly what tools I want for my own work flow and how I might "productize" them, but for now I have an adhoc system using subversion repositories, local directories organized by topic and augmented with local search, and using del.icio.us to organize bookmarks for material on the web. If I can set aside the time, I would like to integrate more of what I use in my own work flow.

What about search? Well, search is not information management. If/when semantic web technologies become more widely used, then software agents will be able to treat the web as an information source and be able to do research either without human intervention, or at least be valued assistants. CEOs of companies have well trained staff to filter and organize information - what will the effects be on society and the economy in the future when most people will have free or inexpensive software agents that can compete with well trained human staff? A nice thought but there will always be selective advantages to better information management systems.

Labels: ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]