April 27, 2009
Encarta nostalgia: SGML and the Semantic Web
I’m not going to much mourn Encarta’s demise. Wikipedia is too big, too fast, too useful, too much fun. But Encarta was an ambitious project that broke some ground. So, pardon me if I sigh wistfully for a moment, and have a little moment of Encarta appreciation. Ahhhh.
When Encarta began, it was taken as validating this whole crazy CD-ROM approach to knowledge. It was searchable. It had multimedia. It let you do some slicing and dicing. It was breezy, at least compared to its hundred-pound competitors. But for my circle, the big news was below the surface: Encarta used SGML. It was, in fact, one of the first commercial SGML projects delivered into the hands of average customers.
SGML — Standard Generalized Markup Language — was the Semantic Web of its time: roughly the same arguments in its favor, roughly the same approach. This isn’t entirely accidental, for two reasons: 1. HTML is a form of SGML. 2. SGML got a lot of things right.
SGML was a way of specifying the structural elements of a document. In the case of an encyclopedia, elements might include volumes, articles, article titles, subheadings, body text, illustrations, captions, references, and see-also’s. You could also specify the metadata for each element: this illustration is of a dress, its topic is “clothing,” its era is 1920-1930. SGML also let you specify rules about what constitutes a valid instance of a document. For example, the rules might say that a valid encyclopedia article has to have one and only one title, it can any number of illustrations, and every illustration has to have a caption. Once you have created a valid set of documents, you can then use your fancy-dancy computers to assemble views at will: Show me all the illustrations whose “topic” is “clothing” from the era 1920-1930. Etc. Incredibly useful.
You haven’t heard about SGML (at least not much) for a few reasons.
First, industries that wanted to be able to share data wrapped themselves in knots trying to tie down the specific specifications for their documents. Endless and endlessly geeky arguments ensued about how exactly to encode a table of parts.
Second, outside of technical documentation designers, most people don’t think about documents in terms of their structural elements. Rather, they think of documents as a series of formatting decisions. SGML was not designed to capture format. From SGML’s point of view, the title of an article is simply an element called “title” and it’s up to someone else to decide whether titles are bolded, underlined, or printed in red. Now, let me hasten to add that people actually do think of documents in terms of their structure: We decide to make this piece of text bold because it’s the title. But we seem to be reluctant to note those decisions in terms of structure; we’d rather just drag-select the text and hit the bold-it key. That’s why Microsoft Word over the years has made “procedural markup” (drag-select-bold) more prominent in its UI than “declarative markup” (declare this paragraph to be a Title element, and then tell it how to format Titles).
Third, HTML swept the world. HTML is a set of SGML elements and rules specified by a certain Sir Tim. Because HTML is designed not for encyclopedia articles or for shopping lists but for anything that might be put on the Web, it has highly generic elements that do not reflect the content of particular types of pages: It has six levels of headings, two types of lists, one type of image, etc. The SGML folks initially sneered at this. It looked “brain dead” to them. The documents were too generic. There wasn’t enough semantics: That something is a second-level heading expresses its place in the document’s structure, but not the fact that it’s the name of a repair procedure or a list of ingredients. And, HTML seemed too interested in capturing formatting. That’s why newer versions of HTML want you to use <em> (em=emphasis) instead of the original <i>: the old way had you making a formatting decision (“Italicize it”) rather then a structural one (“The role this point plays is that of being emphatic, which the browser should visually express in the way it feels is proper”).
The other side of the coin is that HTML is way way way easier to use than having to design and then follow a set of SGML design rules, with specific elements, for every different sort of document you want to create. Its simplicity meant that people actually succeeded at it. Furthermore, it was in the interest of the browsers to forgive all errors: If browser X rejects a page because it didn’t follow HTML’s rules, you would be driven to see if browser Y could display the page. If Y could, you’d consider X — not the page — to be broken. The browser economics favored sloppiness and forgiveness, neither of which were hallmarks of the SGML’s discipline-based culture.
Now, as the great dialectical pendulum swings, the Semantic Web has arisen to remind us of the value of metadata. If it can avoid the perfectionism and discipline that left SGML as a tool for the few, it will add back in some of the smarts the loose ‘n’ low-hangin’ HTML usefully took out. As the name implies, the Semantic Web is more about expressing the structure of meaning and concepts in a field than about expressing the structure of documents. For an encyclopedia, you wouldn’t want to wait for the Semantic Web to create the entire web of meaning, because that web would have to be as wide as the topical coverage of the encyclopedia itself. You might instead want to come up with a set of standard document elements, perhaps applied somewhat loosely, with the ability to slather on rich layers of metadata, and then watch webs of semantics get spun. Which is pretty much exactly what we’re seeing at Wikipedia.
Meanwhile, Encarta remains an example — along with the Oxford English Dictionary and others — of the value of rigorously structured and metadated documents.