Java trick for reading OpenOffice.org files
I have read a few times about people having problems reading the ZIPed XML files in OpenOffice.org documents. The problem is SAX parsers not being able to locate a local copy of the office.dtd file. I have been using a kluge to get around this problem for a long time and have not had any problems with it: When reading the input stream from the ZIP file entry labeled "content.xml", skip past the second ">" character:
InputSource is =
new InputSource(zf.getInputStream(zipEntry));
InputStream r = is.getByteStream();
for (int i=0, count = 0; i<500; i++) {
if ((char)r.read() == '>') count++;
if (count > 1) break;
}
SAXParser p = saxFactory.newSAXParser();
p.parse(r, new OpenOffice.OpenOfficeSaxHandler());
Hopefully in the future people having this problem will find this post when doing a web search and save themselves a little time. Another good alternative is to make office.dtd available on your system and put it on your classpath.