Wednesday, November 11, 2009
MongoDB has good support for indexing and search, including prefix matching for AJAX completion lists
I have been spoiled by great support for indexing and search in relational databases (e.g., Sphinx, native search in PostgreSQL and MySQL, etc.)
I was pleased to discover, after a little bit of hacking this morning, how easy it is to do indexing and search using the MongoDB document-centered database. I have two common use cases for search, and MongoDB seems to handle both of them fairly well:
I am still in a learning mode with MongoDB, so I would appreciate any comments on improving this aproach.
I was pleased to discover, after a little bit of hacking this morning, how easy it is to do indexing and search using the MongoDB document-centered database. I have two common use cases for search, and MongoDB seems to handle both of them fairly well:
- Search for words inside of text fields
- Efficient word prefix search to support AJAX "suggest" style lists
class Recipe < MongoRecord::BaseAccording to the MongoDB documentation, a regular expression match like /^water/i will use an index just as a relational database match in the form like 'water%' does.
collection_name :recipes
fields :name, :directions, :words
def to_s
"recipe: #{name} directions: #{directions[0..20]}..."
end
def Recipe.make collection, name, directions
collection.insert({:_id => Mongo::ObjectID.new, :name => name,
:directions => directions,
:words => (name + ' ' + directions).split.uniq})
end
end
host = 'localhost'
port = Mongo::Connection::DEFAULT_PORT
MongoRecord::Base.connection = Mongo::Connection.new(host,port).db('mongorecord-test')
db = MongoRecord::Base.connection
coll = db.collection('recipes')
coll.remove({})
coll.create_index(:words, Mongo::ASCENDING)
Recipe.make coll, 'Rice Soup', 'Cook the rice, then add extra water to thin it out.'
Recipe.make coll, 'Cheese and Rice Crackers', 'Slice the cheese and layer on top of crackers.'
puts "\nSimple find"
puts Recipe.find_by_name(:name => 'Rice Soup').to_s
puts "\nFind recipe by regular expression (ignoring case) in array of words /water/i"
Recipe.find(:all, :conditions => {:words => /^water/i}).each { |row| puts row.to_s }
I am still in a learning mode with MongoDB, so I would appreciate any comments on improving this aproach.
Monday, June 08, 2009
Ruby client for search and spelling correction using Bing's APIs
I noticed that Microsoft allows free use of their search and spelling correction APIs. I just played with the APIs for a few minutes. Here is a Ruby code snippet that I just wrote:
API_KEY = ENV['BING_API_KEY']You need a free Bing API key - notice that I set the key value in my environment. If you get a key, then try:
require 'rubygems' # needed for Ruby 1.8.x
require 'simple_http'
require 'json'
def search query
uri = "http://api.search.live.net/json.aspx?AppId=#{API_KEY}&Market=en-US&Query=#{CGI.escape(query)}&Sources=web+spell&Web.Count=4"
JSON.parse(SimpleHttp.get(uri))["SearchResponse"]["Web"]["Results"]
end
def correct_spelling text
uri = "http://api.search.live.net/json.aspx?AppId=#{API_KEY}&Market=en-US&Query=#{CGI.escape(text)}&Sources=web+spell&Web.Count=1"
JSON.parse(SimpleHttp.get(uri))["SearchResponse"]["Spell"]["Results"][0]["Value"]
end
search "semantic web java ruby lisp"The first method does spelling correction before search.
correct_spelling "semaantic web jaava ruby lisp"
Tuesday, January 02, 2007
Nutch: a platform is born
I have used Nutch for two contracting jobs and Lucene for many jobs. Until today, I have viewed Nutch simply as:
- Quick to configure for target websites to spider and to administer spidering
- Trivial to run search web application
- Web service provider (OpenSearch API)
Thursday, December 21, 2006
Correction: Google SOAP Search APIs
Google has stopped issuing new API keys to developers and is adding no new resources to support the search APIs, but: "you can continue to execute queries, and we have no plans to turn off the service in the future".
Better news!
Better news!
Labels: search
Tuesday, December 19, 2006
Google search API: rest in peace
I have been using Google's search API (limited to 1000 queries a day) for research and demos since 2002. A cool service, I am sorry to see it go.
I also use Yahoo's free search API and sometimes run Nutch which supports the OpenSearch API.
My favorite use of Google's API was a "who/where" natural language processing question answering demo that I used to run on knowledgebooks.com
I also use Yahoo's free search API and sometimes run Nutch which supports the OpenSearch API.
My favorite use of Google's API was a "who/where" natural language processing question answering demo that I used to run on knowledgebooks.com
Labels: search
Saturday, November 25, 2006
AJAX tools for multiple development platforms
I feel like I have come full circle (almost) in AJAX development: I started out a few years ago adding some simple AJAX enabled forms to a JSP based application. When first starting out, it took hours to get something working. Then a year ago, I discovered how simple it is to use AJAX in Rails: fine, except that Ruby does not have high enough performance for some applications (unless large parts are written in C - Ferret, for example).
I have spent many evenings playing around with various release versions of the Google Web Toolkit (GWT) and it is very compelling, especially if you already are used to developing GUI applications in Java - the only new wrinkle worth mentioning is getting used to handling events asynchronously. The problem with GWT is that it really does tie you to the Java platform. I spend most of my time developing AI applications, but that said, who does not want basic knowledge and competence at building web applications?
I use Common Lisp, Java, and Ruby for development, so for the occasional AJAX tasks that I have, I have settled with the well respected dojo Javascript toolkit because it plays very well with both Lisp and Java JSP based web applications. Dojo is also easy to use with Rails, if you want an alternative to Rail's great AJAX support.
By using Dojo, I basically have to deal with just one learning curve no matter what platform I am developing on. Here is a simple example for a JSP page (assuming that this would be more interesting to most people than a Lisp example):
Add this to your <head> section on a top level JSP page:
I have spent many evenings playing around with various release versions of the Google Web Toolkit (GWT) and it is very compelling, especially if you already are used to developing GUI applications in Java - the only new wrinkle worth mentioning is getting used to handling events asynchronously. The problem with GWT is that it really does tie you to the Java platform. I spend most of my time developing AI applications, but that said, who does not want basic knowledge and competence at building web applications?
I use Common Lisp, Java, and Ruby for development, so for the occasional AJAX tasks that I have, I have settled with the well respected dojo Javascript toolkit because it plays very well with both Lisp and Java JSP based web applications. Dojo is also easy to use with Rails, if you want an alternative to Rail's great AJAX support.
By using Dojo, I basically have to deal with just one learning curve no matter what platform I am developing on. Here is a simple example for a JSP page (assuming that this would be more interesting to most people than a Lisp example):
Add this to your <head> section on a top level JSP page:
Then try putting this somewhere on your JSP page (note: this assumes that you have another JSP page ajax.jsp that gets the form values and returns content to be placed into the DIV element with ID=ajxplydiv. Anthing that ajax.jsp writes using out.print() gets inserted asynchronously into the "ajxplaydiv" DIV element):
<script type="text/javascript">
var djConfig = { isDebug: true };
</script>
<script type="text/javascript" src="dojo.js"></script>
<script type="text/javascript">
dojo.require("dojo.io.*");
dojo.require("dojo.event.*");
</script>
The call to dojo.io.bind sends a request containing the form data to the ajax.jsp page, and whenever the results are returned then the Javascript functions "load" or "error" defined in the data block in my_onclick_button() is called. This is just one example of processing form data and adding HTML below the form without refreshing the page - but is a common use case for using AJAX. This example assumes that I have both the dojo "src" directory and dojo.js in my top level web resources directory. The great thing about dojo is that it encapsulates all asynchronous event handling, offers a good supply of visual components (that map to HTML elements) and is simple to use no matter what programming language you are using for a web application.
<h3>test AJAX Form:</h3>
<form id="myForm">
<input type="text" name="input_test_form_text" />
<input type="button" name="button1"
value="Try AJAX form" id="ajaxButton" />
</form>
<div id="ajxplaydiv">
initial context for AJAX play div
</div>
<script type="text/javascript">
var buttonObj = document.getElementById("ajaxButton");
dojo.event.connect(buttonObj, "onclick",
this, "my_onclick_button");
function my_onclick_button() {
var ajaxargs = {
url: "ajax.jsp",
load: function(type, data, evt){
dojo.byId('ajxplaydiv').innerHTML = data;
},
error: function(type, data, evt){
alert("Error occurred!");
},
mimetype: "text/plain",
formNode: document.getElementById("myForm")
};
dojo.io.bind(ajaxargs);
}
</script>
Labels: AJAX, Java, Javascript, Ruby, search
Sunday, April 30, 2006
Information: organization vs. overload
I will get to the topic in the title, but first: This month's issue of the Communications of the ACM has a great series of articles on exploratory search: lots of good ideas on organizing sets of search results rather than single documents, clustering vs. categorization, etc. Another good read: May issue of WiRED magazine has good coverage of vblogs and online video - the sort of grass-roots publishing that I like :-)
There is so much information to absorb and use for any type of knowledge worker that it takes time and effort to stay up to date with what we need to do our jobs. Much of my work involves writing custom software (usually layered on open source) for information management in specific industries/applications (large scale search, document categorization and repository maintenance, AI style data mining, agent technology to assist by bringing important things to user's attention, etc.) but I find it ironic that I can not seem to set aside the time to write much custom code for my own information needs (take care of customers first!). And, so far, it always seems to take custom code to solve specific information management problems. From what I have seen, there is not yet any silver bullet.
I have some ideas for exactly what tools I want for my own work flow and how I might "productize" them, but for now I have an adhoc system using subversion repositories, local directories organized by topic and augmented with local search, and using del.icio.us to organize bookmarks for material on the web. If I can set aside the time, I would like to integrate more of what I use in my own work flow.
What about search? Well, search is not information management. If/when semantic web technologies become more widely used, then software agents will be able to treat the web as an information source and be able to do research either without human intervention, or at least be valued assistants. CEOs of companies have well trained staff to filter and organize information - what will the effects be on society and the economy in the future when most people will have free or inexpensive software agents that can compete with well trained human staff? A nice thought but there will always be selective advantages to better information management systems.
There is so much information to absorb and use for any type of knowledge worker that it takes time and effort to stay up to date with what we need to do our jobs. Much of my work involves writing custom software (usually layered on open source) for information management in specific industries/applications (large scale search, document categorization and repository maintenance, AI style data mining, agent technology to assist by bringing important things to user's attention, etc.) but I find it ironic that I can not seem to set aside the time to write much custom code for my own information needs (take care of customers first!). And, so far, it always seems to take custom code to solve specific information management problems. From what I have seen, there is not yet any silver bullet.
I have some ideas for exactly what tools I want for my own work flow and how I might "productize" them, but for now I have an adhoc system using subversion repositories, local directories organized by topic and augmented with local search, and using del.icio.us to organize bookmarks for material on the web. If I can set aside the time, I would like to integrate more of what I use in my own work flow.
What about search? Well, search is not information management. If/when semantic web technologies become more widely used, then software agents will be able to treat the web as an information source and be able to do research either without human intervention, or at least be valued assistants. CEOs of companies have well trained staff to filter and organize information - what will the effects be on society and the economy in the future when most people will have free or inexpensive software agents that can compete with well trained human staff? A nice thought but there will always be selective advantages to better information management systems.
Labels: knowledge management, search
Subscribe to Posts [Atom]
