first_page

Clean XHTML and Brian Jones

All right! I admit it! I am able to write an XSL transformation to extract a subset of WordprocessingML and transform it into XHTML. This is a completely different direction away from previous developments (an understatement potentially rendering about six months or more of work almost meaningless and thousands of lines of code never to see the light of day).

Of an update from a previous entry, Brian Jones of the Microsoft Office 12 team took a look at this new direction and is encouraging. He writes in response to my rough XSL sketch:

That’s a great start Bryan. I was going to post something similar, but hadn’t got around to it yet.

In answer to your earlier question, I definitely do not recommend loading the XHTML schema into Word and marking up the document, that’s duplicating too much information.

We designed the XML support so that you could leverage both WordML and your XML together. If there are features such as formatting, lists and tables that Word already supports, then you don’t need to mark that up. Instead, you can just take the subset of your schema that isn’t already represented by Word functionality, and only mark up with that.

Then you can just transform on the way out into your schema. At one point, I had an example of doing this for DocBook, but I can’t seem to find it anywhere. I’ll post it if I ever dig it up.

Brian is right. I really need to meditate on the words, “you can just take the subset of your schema that isn’t already represented by Word functionality, and only mark up with that…” In fact, most of ‘XHTML functionality’ not supported by Word are semantic elements like blockquote and acronym—these elements contain what we can call “metadata” and cannot be represented by Word Style data. Additionally, there is the “advanced” need to insert “raw” XHTML into a document (but this need can be overcome by placing uniquely identifiable div placeholders in the document instead of raw markup for a backend component to find and replace these place holders with server-side XHTML snippets).

rasx()