Semantic Wet Dreams

Some days ago I had a discussion about the Semantic Web with Henry Story (one of the creators of BabelFish, now working for Sun). While he thinks that the Semantic Web is something attainable and he is probably not the only one, I think it’s just a dream some people share, but will not happen no matter how nice the dream itself might be. I was busy with other (more useful) things, like working on my thesis, so I didn’t have time to write these thoughts properly until now. Moreover, before posting, I also wanted to read this: “The Semantic Web Revisted (pdf, 2006)”. Sure, I already knew pretty well what this is all about (after all, my licentiate thesis was about “Semantic Web-Based Agent Communication”) and I had already read their previous vision statements. However, those statements were all at least 5 years old (which by web standards is ancient) so I expected great changes. Changes like coming back down from the clouds into reality. As you might expect by the title of the post, these changes did not happen. So let me summarize in just one phrase for those who don’t have time to waste reading all this:

It’s the same old bullshit!.


First of all, what is the Semantic Web? In my opinion it is just an artificial concept, that tries to solve artificial problems by using artificial intelligence techniques. The only relation to the real Web: Tim Berners-Lee invented both. Relation to real-life: none. Relation to afterlife: none. Achievements so far: standards for ancient ideas, some of them already proved impractical.

The Semantic Web is an artificial concept because it does not exist, nor will it exist in the foreseeable future. It’s the vision of Tim Berners-Lee, and the dream of the AI research community (they are already well-known for dreaming a lot). There is nothing wrong about dreaming, and there is nothing wrong about having a vision. However, the distance between dreams and reality can be infinitely long, and in this case I think it really is as remote as general AI (i.e. making machines think).

According to the newest vision paper, the problem the Semantic Web tries to solve is: “a growing need for data integration” in fields like e-science and e-goverment, but not yet in business and commerce. There is no proof whatsoever (e.g., empirical studies) to support this claim, which makes it terribly weak, but also makes it quite hard to argue against. The fact that the domains stated start with e (part of a similar bubble, the .com bubble) makes things even fuzzier, and I am not even convinced that there is a problem at all. What I think is that, if this problem really exists, it is not comparable to the dozens of real problems the real Web is solving right now. Is the real Web a mess? Sure it is! But there is no reason to believe that a more “semantic” one, one about which machines could reason more easily, would be any different. And some machines reason pretty well about the real Web already: they are called search engines and have been around for years.

And before discussing the artificial intelligence techniques they are going to use, let’s look at what have 50 years of AI research given us so far? I don’t really know, but the paper mentions: “functional and logic programming methods, ways to understand distributed systems, pattern detection and data mining tools, approaches to inference, ontological engineering and knowledge representation”. It doesn’t look like much, but I will only focus on what I know about: programming languages. Functional programming? They gave us horrid LISP and are probably the only ones to still use it after almost 50 years (1958). Don’t think that they have much to do with the much more beautiful Haskell or ML which were developed more recently on more solid foundations. Logic programming? Since I am working at the PS Lab it would be a quite bad idea to say something evil about it. I did program in Prolog for a course, and it was not that bad. It was a quite nice idea, but which never really caught. Functional programming languages on the other hand, are nicer and are widely used in the research community.

However, maybe it’s also good to know what the AI research community did not give us, even if it is much more related to artificial intelligence than many of the things on that list (or maybe one day I will get the connection between AI and functional programming). No, I won’t tell you about strong AI (even though until the ’90s that was the promise), but about: statistical learning, information retrieval or mathematical image processing. They are all fields with solid research results, all based on strong mathematical foundations — statistics, and even if it makes little sense have very little to do with the AI research community.

So what are their achievements so far regarding the Semantic Web? They have build standards: URI (addresses for resources), RDF (triples), SPARQL (querying triples), OWL (ontologies), RIF (rules) and they probably won’t stop here — if it’s hard to follow, it’s all like a birthday cake with a lot of layers. Such standards are of course nothing bad (the concepts behind most of them were floating around for a very long time in AI), however having standards does not mean somebody will use them — you can see my recent post about the revival of HTML and how will this might affect XHTML adoption, which is going very slow anyway. And such standards already existed: anybody ever heard about FIPA? Thought so. What’s however more interesting is that even if everyone who is currently using a knowledge representation language switches to RDF it still makes little difference. Who is using KIF and DAML now? Almost nobody. So they are hoping for some sort of network effect similar to one of the real Web, also driven by the real Web. Well, needless to say that this is just wishful thinking and will most likely not happen, especially if one looks at the slow rates the really useful Web standards (like XHTML or SVG) are adopted by most users (i.e. Windows users). And even if all the standards gained enough acceptance (which is again very unlikely), there are still harder and harder problems to be solved as one goes up the birthday cake.

This does not mean that the problems for the lower layers of the cake are already solved. Not at all. Let’s start from the most foundational thing: the URIs (Uniform Resource Identifiers), which should identify resources. In theory at least. In practice it’s quite hard to assign meaningful URIs to resources like people or places. Most of the time the URI you assign to a resource is arbitrary, and someone else will most likely assign a different URI for the same thing. But let’s say assigning URls is still possible. However, what makes things worse is that URIs are not identifiers in the strong sense. They are not unique and not permanent, actually they change pretty often (remember how often your email or website address changed during your life time). So the most foundational brick of the Semantic Web is not that strong after all.

Then there are the nicest layers: XML, XML Schema and XQuery. While not really a part of the Semantic Web effort, they are used by the upper layers. I’m not going to talk about them here, since they are pretty well-established (not so well established if you look at the HTML vs XHTML struggle), and their strengths and weaknesses are more or less well-known. Moreover, they seem to be the strongest part of the whole construction (even though they are not a silver bullet), so let’s just focus on the weaker spots.

RDF, the knowledge representation language is pretty simple: just triples. But who on earth would store their data as triples. Let me set this straight: nobody. Relational databases have stored our data for over 30 years, and they work damn well. So their only hope is people exporting their data from their relational database as triples. My opinion: never going to happen. Go to a bank and ask them to export their data (not as triples, as anything). For many high tech companies their data is one of their most valuable assets and will never be exported (and it’s not only for privacy reasons). As for the other, non-mission critical data, you already have most of it online in broken HTML “tag soup”. The people that publish it cannot be even convinced to use XHTML. Convincing them to use an obscure machine-understandable language would be a lot harder, especially since the benefits of doing so are not evident at all.

The ontologies are however an even weaker part of the cake. The fact that some researchers are using ontologies does not mean that “the argument in favor of using ontologies has been won”, it hardly means anything. I already discussed ontologies on my blog one and a half years ago, when comparing them to tagging (Folksonomies vs. Ontologies). Since then things become even more clearer, and they don’t look good at all for ontologies. Tagging and searching skyrocketed, while ontologies remained as unattractive as they have always been. So before we see some examples, let me first tell you that in the newest Semantic Web vision paper they discuss folksonomies too. They think they are some kind of “broken ontologies”, and this really makes me laugh every time I think about it. It is like saying now (in 2006) that a PC is some sort of broken mainframe, in the hope that people will start using mainframes again. Sure, as Henry Story pointed out folksonomies and ontologies are not really exclusive. However, folksonomies are not broken ontologies. Folksonomies are widely used, even though they appeared only some years ago, while ontologies never caught even if it existed since Aristotle.

The Open Directory Project is the largest, most comprehensive human-edited directory of the Web”: a large ontology. “It is constructed and maintained by a vast, global community of volunteer editors”. I am a small editor for ODP and I know the whole principle behind it is flowed. You simply cannot classify web sites meaningfully into a general ontology, because there are way too many categories a site has to belong too (and by way too many, I mean hundreds if not thousands). The process is usually quite arbitrary, and in the end nobody browses the categories anyway, even when using ODP directly: ODP has search. And del.icio.us is now better anyway, after only three years of existence.

Yahoo Directory: Yahoo hired the best people they could find to build its directory service (also an ontology). The result: Who uses Yahoo Directory when there is Yahoo Search? Now when I go to yahoo directory I only get a search field. They entirely disabled browsing the directory (should they still call it a directory in this case?). And when searching for more obscure words you get: “No Directory Search results were found. Showing Web Search results for the term …”.

Cyc is “the world’s largest and most complete general knowledge base and commonsense reasoning engine”. They have one of the largest ontologies in the world, and it was recently open sourced. Did anyone care? No! Probably they finally realized it’s not worth anything so they dumped it to the community under the apache license. They worked more than 20 years on it! It was supposed to make true A.I. happen. Now it’s the last individual of an already extinct species. The bubble bursted in AI a long time ago, like the one in Semantic Web will burst sooner or later. Everything that is build only of hot air will eventually burst.

File systems have always been the best example for a very specific ontology the user has to build. The response: desktop search is a technology that is currently skyrocketing. Why? Simply because people don’t really want to have the trouble of maintaining even a very small ontology. Because usually the files end up in a terrible mess on the desktop and then in garbage can. When I moved to a Mac, I first thought that programs like iTunes, that want to organize my music files themselves using metadata from the web, are way too intrusive. This was because I was still used to maintain the ontology myself. But not I am thinking, why bother? Why not let iTunes organize (i.e. index) my music? Then I can use search to find whatever I am looking for.

OWL is just an XML format for ontologies. It is subject to all the problems of ontologies I mentioned above. I hardly know of any places where it’s meaningfully used. Every time I ask people give me the FOAF example. The vision paper talks about domain-specific ontologies, for example in life sciences. However, I think even domain-specific ontologies are too general and too hard to build. Take as an example the gene ontology, everybody had different ideas of what information should be stored there, and almost everybody has something to add. Getting the community agree on something is very hard, because they all have different problems that they are trying to solve using different techniques. And if you consider the benefits versus costs of building task-specific ontologies for every application you build, then you will probably come to the same conclusion I have. Ontologies are not overrated, they are conceptually broken.

Some people are (at least in part) sharing the same opinion. Whether you believe it or not, I did not read any of them before writing this, just used search to find them:

(to be continued, someday, after this bubble bursts)

Advertisements

2 Responses to Semantic Wet Dreams

  1. […] So what might have been an alternative solution? Now I really don’t know any more. When I first posted this I thought that deprecating HTML and XHTML Transitional entirely (and maybe removing the validators for them, they are poor anyway) in favor of XHTML would have been an alternative. Then whoever would still want to publish “tag soup” online would not adhere to any standard, and whoever wants to render “tag soup” in a browser would not adhere to any standard — this is the current situation anyway, and it’s quite unlikely to change. However, maybe this could have been an incentive for everybody (web developers, web publishing software developers and web browser developers) to go away from “tag soup” and towards something more “meaningful”. Maybe I am wrong, but I don’t think that they can built their semantic wet dreams by having “tag soup” as a foundation. But well, who cares about their semantic dreams anyway? At least not me. Not any more. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: