Friday, February 09, 2007

Semantic Web: through the back door with HTML and CSS

I wrote a blog entry on this subject today on my AI blog.

Labels: ,


Saturday, December 16, 2006

Public web applications and knowledge workers

Public web applications, especially those that allow exporting my data in easily processed formats, have been the most important changes in the way I use computers since I made the transition from punched cards to a Dec-10 in the 1970s, when I bought my second home computer in 1978 (serial #71 Apple II), when my company SAIC bought a Xerox Lisp Machine for me in 1982, and I started using the Internet in 1985. I now use GMail, Google Calendar, Blogger.com, Flickr, Google Documents, Gliffy.com, Google Reader, Picasa web photos, and del.icio.us as a regular part of my work process and for entertainment.

The key thing is that all of these public web services allow you to export your data for archival backup, use in utility scripts and programs, etc. Most also support, in addition to manual data export, web service interfaces.

The only work that I perform "locally" is programming in Emacs+Common Lisp, various languages in Eclipse, and writing large documents using either Latex or OpenOffice.org. Even for "local work", all of my working materials are stored on leased managed servers in subversion repositories.

Labels:


Wednesday, October 18, 2006

More on personal information management

Public APIs for web portals like del.icio.us, "Google Office", Google Reader (references on the web but I have not seen it available yet), Flickr, etc. offer something special to computer scientists that are not available to computer users: the ability to "re-purpose" your own data and meta-data that happens to live on public web portals.

While there are very good systems like Piggy Bank for creating your own meta-data store (in RDF) for web sites that you visit, this requires extra work. The sweet-spot for automating the collection and use of our own meta-data and data is being able to automatically use information that you may already have for your own del.iciu.us tags, tags you have applied to RSS feed items in Google Reader, etc. Most of us use public web portals as part of our work flow, but there are definitely unexplored possibilities for customizing our own knowledge management environments.

Labels:


Sunday, September 03, 2006

Disconnect between thinking about a problem and programming

The Subtext programming system has been getting some buzz, and I think that this is worth while if it makes us think about the disconnect between thinking about problems and writing software to solve these problems.

While state of the art IDEs like IntelliJ for Java and VisualStudio for .Net languages provide a comfortable working environment, I must say that both Java and the .Net languages are poor choices for many programming tasks.

Scripting languages like Ruby and Python help this thinking vs. programming disconnect in one important way: for small programming tasks, very short programs are sufficient and we can keep track of both problem task thinking and programming.

What about large projects? There are two good alternatives in programming languages: Common Lisp and Smalltalk:

Common Lisp lends itself really well to growing your own application specific language (using macros if you like, and functions). Once you build up an application specific language, a lot of the complexity of even complex programs goes away. Even more importantly, domain specific languages should help close the gap between thinking about problem solutions and programming these solutions.

The downside of Common Lisp is that while Emacs based IDEs are effective environments, even with add on code browsers, I find exploring large Common Lisp software projects to be tedious.

Smalltalk implementations generally have great code browsers because the simplicity and regularity of the language make it easier to automatically process the structure and semantics of code. Smalltalk blocks and closures, like in Ruby, allow many concise coding tricks - shorter programs are easier to understand and modify.

Labels: , , ,


Thursday, August 10, 2006

OpenCyc 1.0, AI in general

I noticed on Slashdot that OpenCyc 1.0 has been released. I spent a short while reading comments and realized how different my own views on AI are from many Slashdot commentators.

To me, AI is all about writing software that makes decisions given uncertain and sometimes contradictory information. AI is about modeling problem domains and working both within that model and changing the model as new information becomes available. AI is about using problem domain models to provide human users with useful, interesting, and unexpected results by matching a model of a user's inquiry. AI is about solving the game of Go: the branching complexity of the game is so great that having perfect information is not enough.

So, a tool like OpenCyc is not really a match to my personal view of what AI development is: Cyc and OpenCyc try to define ontolological knowledge of real world common sense knowledge. I appreciate decades of hard work, and I have myself spent many hours experimenting with earlier versions of OpenCyc - so kudos for the 1.0 delivery.

Still, I tend to view "AI problems" as being problems restricted to narrow domains but still made very difficult or impossible by uncertainty, missing information, and time or memory constraints on algorithms.

Labels: ,


Sunday, August 06, 2006

Globally unique identifiers

I really enjoyed listening to Tim Bray's talk on developing the ATOM specification on ITConversations. He made a lot of interesting points, but the one that resonated most was ATOM's requirement for a globally unique identifier for every feed and entry. With more syndication, we all see lots of duplicate material. Examples of duplication can readily be seen on rojo.com (used to be my customer, and I still enjoy their site a lot) and technorati.com: we end up with many URIs that refer to the same material.

It is possible to write software that detects duplicate feeds, but comparing two articles is not an inexpensive operation, and when comparing a very large number of feeds, the O(N^2) runtime is painful. I have experimented with much a less accurate algorithm: hash NGRAMs of articles and check for duplication using a hash lookup. I have found that this gives poor results - at least in my experiments. If you do partial matching of NGRAMs, you are back to O(N^2). (If anyone knows a good way to handle this, let me know :-)

Globally unique identifiers help solve many duplication problems, makes it easier to implement container relations, and in general ATOM just seems to be a better and more scalable platform than RSS 2.0 for complex new applications.

Labels: ,


Wednesday, July 19, 2006

Good point: disinformation and the Semantic Web

I wish that I had gone to the AAAI conference this year. I am keenly interested in the application of AI techniques to the Semantic Web, and Tim Berners-Lee gave a keynote speech largely on the Semantic Web.

After Berners-Lee's talk, Peter Norvig in the question period posed the problem of people publishing fake data in much the same way they try to cheat to increase the page ranking on their web sites. I had not thought of that problem, and it is a tough problem to deal with: what happens to trust mechanisms when some people actively try to fake the meta data on their web sites? While I was walking to lunch with Norvig a few years ago, I brought up a related problem: assume that for narrow domains of discourse (e.g., political news, financial news, etc.) that you could largely automate the creation of RDF from natural language text on web sites. I personally believe that this is achievable right now, with a lot of effort. The problem that I posed at lunch was (besides the technical challenges of dealing with potentially trillions of RDF triples) the problem of dealing with lots of conflicting information while factoring in different levels of trust.

Labels:


Sunday, June 11, 2006

Different doument types, different work flows

I keep everything that I do in a subversion repository. Even though subversion can diff binary files like Word documents or OpenOffice.org ZIP file enclosures for documents, I still like as many of my design artifacts to be plain text as possible. Now, I do keep lots of binary files in repositories, especially when working on book projects, but I do have a strong preference for text files.

I also like my design artifacts to look good, even if I am the only person who sees them. Two highly recommended tools are AbiWord and OmniGraffle because their default file formats are plain XML text files.

I admit that disk space and network bandwidth are close to free now but I still like to keep a project directory small. By using design tools that have small file footprints, most projects (source files, build scripts, tests, and design artifacts) are small and tidy.

Labels:


Thursday, May 25, 2006

New PowerLoom site

Thanks to reader Vinodh Das for pointing this out to me: the PowerLoom web site has been updated and as one of the developers told me, PowerLoom is now released under an open source license. PowerLoom is a great system - if you are interested in AI, logic, reasoning systems, etc., then check it out.

Labels: ,


Sunday, May 21, 2006

Dealing with Knowledge Artifacts that are still in paper form

When my wife and I lived in Solana Beach California my home office wall had about a 20 foot wide set of bookshelves. When we moved to Sedona, my home office shrunk to a 10x12 foot room. I also went from an ocean view to a mountain view - the change in views is fine but I miss the library shelf space! Since I consider myself, like many people, to be a "knowledge worker", I thought that it would be fun to talk about how I deal with storage problems for physical artifacts like ACM and AAAI journals, books, etc. Please send me your ideas via email, and I will add them to my list here.

Fortunately, most journals are also available online, and articles can be copied for personal use. Before throwing out old journals I take a quick look for articles that might be of use in the future and I do a web search including the journal name and the article name. Articles in the ACM Portal or AAAI Digital Library (for example) can be copied locally for personal use by members after logging in. I used to keep journals, in paper form, almost forever but now having just high (possible) value articles stored on my local file system and indexed for search is good enough. I usually just save plain text, but if figures look especially useful I save them also.

Books are more of a problem. When we moved 7 years ago, I reduced the size of my technical library from about 400 books to about 150. Now when I purchase new books, I try to get rid of an equal number as gifts to my local library or sell them at a local used bookstore. A few times a year I go to reference a book that I have let go, but in general, I think that my technical library might be more useful with fewer books because I can find things very quickly.

Anyway, local storage works well for knowledge artifacts that other people create - usually storage and archival for personal use is allowed. For stuff that I produce (except for my published books that are owned by my publishers), I prefer public web storage.

I find that del.icio.us is a fantastic resource for organizing bookmarks for both knowledge artifacts on the web and for fun stuff that I might want to find again.

For fun stuff: I used to keep travel and family photographs on my KnowledgeBooks.com web site, but now I keep the best pictures on Flickr. I am tempted to start storing video clips (and I have some great stuff like dancers in India and Africa, etc.) on video.google.com when I have time.

Labels:


Tuesday, May 09, 2006

Integrating a semantic network with a reasoning system

For a long term AI project, I am using Common Lisp and CLOS to model customer application specific nodes in a semantic network. This morning I worked out the non-obvious (to me!) bits for integrating my stuff with the Loom reasoning system by deriving Loom concepts from my CLOS classes. Cool stuff!

Labels: , ,


Sunday, April 30, 2006

Information: organization vs. overload

I will get to the topic in the title, but first: This month's issue of the Communications of the ACM has a great series of articles on exploratory search: lots of good ideas on organizing sets of search results rather than single documents, clustering vs. categorization, etc. Another good read: May issue of WiRED magazine has good coverage of vblogs and online video - the sort of grass-roots publishing that I like :-)

There is so much information to absorb and use for any type of knowledge worker that it takes time and effort to stay up to date with what we need to do our jobs. Much of my work involves writing custom software (usually layered on open source) for information management in specific industries/applications (large scale search, document categorization and repository maintenance, AI style data mining, agent technology to assist by bringing important things to user's attention, etc.) but I find it ironic that I can not seem to set aside the time to write much custom code for my own information needs (take care of customers first!). And, so far, it always seems to take custom code to solve specific information management problems. From what I have seen, there is not yet any silver bullet.

I have some ideas for exactly what tools I want for my own work flow and how I might "productize" them, but for now I have an adhoc system using subversion repositories, local directories organized by topic and augmented with local search, and using del.icio.us to organize bookmarks for material on the web. If I can set aside the time, I would like to integrate more of what I use in my own work flow.

What about search? Well, search is not information management. If/when semantic web technologies become more widely used, then software agents will be able to treat the web as an information source and be able to do research either without human intervention, or at least be valued assistants. CEOs of companies have well trained staff to filter and organize information - what will the effects be on society and the economy in the future when most people will have free or inexpensive software agents that can compete with well trained human staff? A nice thought but there will always be selective advantages to better information management systems.

Labels: ,


Monday, April 10, 2006

Working backwards on the Semantic Web

I just had a bit of an insight: I think that many people, including myself, may be taking the wrong approach to working on the Semantic Web. I think that dealing with XML serializations of RDF, RDFS, and OWL is just plain wrong. Tools like Protege offer a frame-like UI that makes it a lot easier to work with (and free descriptive logic engines like Fact++ help by checking for consistencies). However...

I have had a little free time today to work on a pet project that Obie inspired: write Ruby wrapper code for making it easier to deal with RDF/RDFS/OWL by loading files and automatically mirroring classes, etc. I would work in Protege, then write Ruby code to consume the RDF/RDFS/OWL files so that I could work in a decent language. OK, fine.

However, this all still seems more than a little wrong to me. Since the Semantic Web is largely about ontologies and knowledge representation, why turn our backs on decades of AI research? Why not work with knowledge representation systems written in Lisp (or Prolog, Ruby, etc.) and have a back end that serializes to XML/RDF/RDFS/OWL as required. Really, use the best notation possible for all of the human-intensive work.

While Protege is a terrific tool, I still think that using older technologies like KEE, Loom, PowerLoom, etc. with optimal programming environments makes a lot of sense. Any language with good introspection (like Ruby or Common Lisp) would work for supporting XML serialization when required.

Labels:


Wednesday, April 05, 2006

Interesting: Bill Gate's work flow; knowledge management

I enjoyed this article by Gates on cnn on his personal work flow. The best part was on SharePoint. A few years ago one of my customers hired me to write a SharePoint clone that was tailored for their work flow - I used Java (JSPs, custom tag libraries, and Prevayler for persistence) in a fairly agile way but I would like to redo that project, with lessons learned, using Rails, and use AI text mining technologies to help automatically organize information (or suggest organizing hints).

Knowledge management is something that I am keenly interested in because it ties together lots of technologies that I am interested in: ontologies, knowledge representation, data constraints, server side technologies, and natural language processing with text mining, etc.

Labels:


Monday, March 27, 2006

We *really* need semantic attributes on web links

Scoble complains about the lack of a 'do not follow' link attribute on Tailrank: link to an article that you are complaining about and the linked article gets a (slightly) higher search rank, and more people read it.

Sure, this is a problem, but the real solution is setting a standard (could be grass roots) for adding optional attributes to HTML links (I wrote about this More on link types and the Semantic Web a few months ago).

Jon Udel has a great idea for combining CSS style ids with semantic information. The microformats people have good ideas for using the rel attribute to specify a relationship to a link (license, help, type-of, friend, etc.)

What I am most interested in is having enough usable information on the web to enable me to write software tools that can automatically extract information on trust relationships between information sources on the web. However, the range of applications is just about infinite - given more metadata.

Labels: ,


Tuesday, March 14, 2006

Useful tool for searching millions of RSS feeds; some of my own projects

I have been using Rojo.com recently (both because they are a customer of mine and their service is very useful). I also use Technorati to specifically search blogs, but Rojo has a few results optimizations that makes their service a bit more useful to me.

I have been working on a customized information portal KBSportal.com for about a year. I took down the Java version a few months ago when I started on a newer version written in Ruby and Rails. I was hoping to have it back on line by now, but working on my new Ruby book and consulting for a few customers takes priority.

I created a 'knowledge based' recipes/cooking portal CJsKitchen.com last year and I also want to work on this technology more this year. I originally started to use the USDA nutrition database, then stopped when I realized how inaccurate estimating the nutritional content of recipes is given variations in cooking and quality of ingredients. Anyway, I want to take another shot at providing nutritional information along with displayed recipes. One cool thing about CJsKitchen.com is that you can keep a personal database on the web site of the ingredients you have on hand - and optionally only see recipes that you have the ingredients to make. I also have a first cut at an AI recipe agent to help you use up ingredients that you have.

Labels: ,


Wednesday, January 11, 2006

More on link types and the Semantic Web

I wrote a few weeks ago about the need for a less formal way to specify link type information. One reader pointed me to the Microformat rel tag. Actually, the the rel tag is a W3C recommendation but the number of accepted link types is small.

One problem with the rel tag is that it does not support name/value pairs. For example, I might want a <a href=...> tag to have an attribute for how much I agree with the linked page. Something like this: <a href=... agreewith="3" ...> where the numeric constant might be assumed to be in the range [-10,10]. Still, even rel tag values like disagreewith, agreewith, etc. would be useful.

I think that eventually the Semantic Web will, in some form, catch on but I think its eventual success will come from small grass-roots efforts that are simple to implement and become de-facto standards if they become widely used. I believe that the most important semantics for linked web sites is what the level of trust or agreement is. For example, if a very large number of people link to a site that they believe to contain incorrect information, the dubious site's page rank will be high and the dubious site may be taken to contain accurate information. Widespread use of trust attributes on links would create new possibilities and opportunities for people writing software agents.

Labels:


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]