Friday, February 09, 2007

Semantic Web: through the back door with HTML and CSS

I have spent a lot of time over the last 10 years working on technology for harvesting semantic information from a variety of existing sources. I was an early enthusiast for Semantic web standards like RDF and later OWL. The problem is that too few web site creators invest the effort to add Semantic Web meta data to their sites.

I believe that as web site creators start using CSS and drop HTML formatting tags like <font ...>, etc. (HTML should be used for structure, not formatting!), writing software that is able to understand the structure of web pages will get simpler. Furthermore, the use of meaningful id and class property values in <div> tags will act as a crude but probably effective source of meta data; for example: a <div> tag with an id or class property value that contains the string "menu" is likely to be navigational information and can be ignored or be of value depending on the requirements of your knowledge gathering application.

Just as extracting semantic information from natural language text is very difficult, analyzing the structure and HTML/CSS markup to augment web data scraping information software is difficult. That said, HTML + CSS is likely to be much simpler to process in software than plain HTML with formatting tags. BTW, I am in the process of converting all of my own sites to using only CSS for formatting - I have been writing HTML with embedded formatting since my first web site in 1993 - time for an update in methodology.

Comments:
Using HTML just for metadata is more easy to write and create a more clean source.
 
But is CSS and good HTML formatting enough?

Surely XHTML holds more promise?

I attended a seminar a couple of years ago, hosted by British Telecom and a local university, and the idea of the Symantic Web was very appealing, but very far off, for reasons beyond just simple mark-up.

As a rule, I work as close as I can to the web standards, not just to be on the right of legal, but for both accessibility reasons as well as making my website SEO'd from the get-go.

This is a huge topic...
 
Post a Comment





<< Home

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [ATOM]