Saturday, May 10, 2008

Programming: sometimes simpler is better

I recently chose a development environment for a spare time project: I am re-working some of my old algorithms and miscelanious code (in several different programming languages) for extracting semantic information from plain text after reading through the excellent book The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. I have been working on information extraction for about 20 years (very much part time), and although most of the material in this book was familiar I found the book to be an excellent reference and a good summary for the state of the art in information extraction techniques. I have blogged before about the excellent Reuters/ClearForest system - the authors were principles at ClearForest.

I chose for this project the combination of Gambit-C Scheme, Emacs, and a few customizations of the Gambit-C Emacs code. For "mostly thinking" projects like my information extraction library, I like simplicity: a simple clean programming language and an environment that provides good editing and debugging support but otherwise stays out of my way. Professionally, I do a lot of work with Common Lisp (either Franz + ELI + Emacs, or SBCL + Slime + Emacs) but since I am basically just experimenting with algorithms I felt like using something light weight. I thought about using Ruby (with either the excellent NetBeans support or TextMate) but I like the ease with which Gambit-C Scheme can be used to build native applications or libraries (compiles to intermediate C) and I will probably want to share my information extraction program (perhaps a free and commercial version) but not release the source code. The performance of compiled Gambit-C code is also very good.

Labels: ,


Saturday, April 19, 2008

Using JSON for communicating between Ruby and Lisp or Scheme

JSON is very much lighter weight than XML and is meeting a need for easily calling some Scheme code from a Ruby program. The Scheme code I am using is old, I wrote it years ago to extract entities from plain text. Since the Scheme program starts very quickly, I am able to simply start a Scheme interpreter as a separate process using back quotes to capture any output to stdout in a Ruby string variable:
require 'json'
s = `gsi extraction.scm -e '(get-proper-names "President Bush moved to South America.")'`
# parse JSON text...
Here I am using Gambit-C Scheme interpretively. It is very easy to write any structured data in JSON format in the Scheme (or Lisp) code and parse it on the Ruby side. It is also easy to speed things up and compile my program:
gsc -c extraction.scm
gsc -link extraction.c
gcc -o extraction extraction.c extraction_.c -lgambc -I/Library/Gambit-C/current/include/ -L/Library/Gambit-C/current/lib/
which both makes the program faster and reduces the startup time down to a few milliseconds. Since the extraction program starts very quickly, this solution is good enough. However, I have been looking at an alternative idea in case I ever need to use a Lisp or Scheme program that has a long startup time. I have not done this yet, but here are my ideas for Gambit-C Scheme:This looks easy enough, but getting memory allocation and deallocation correct might be difficult. For now, I am calling my old Scheme code from Ruby the easy way :-)

Labels: , ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]