Words, Ideas, and Things: December 2012

Thursday, December 27, 2012

What Is The Semantic Web?

The Semantic Web is—or is hoped to be—the next revolution in the way the Internet is used, just as the World Wide Web was a revolution in the way the Internet was used. To get some perspective, we need to look back at history.

Before the Internet, computers existed as standalone machines, possibly with multiple monitor/keyboard terminals spread around a building. For long distance connections, wired circuits (think modems) had to be be brought up and then maintained throughout a session. Local networks existed, but each network vendor had its own incompatible system. There wasn't a standard way of communicating across networks.

The Internet began as a U.S. Department of Defense project to connect research universities. By the end of 1969, networks at four universities were connected to each other. In 1983, the communication standard of this inter-network ("between"-network) was changed to the TCP/IP protocol suite, which is still the basis of Internet communication today. With an IP address (e.g. 203.0.113.100) and a port (e.g. 25), a computer in California can connect to the email program on a computer in Germany and leave a message for a user there. Or, slightly more user friendly, a kid growing up in rural North Dakota could use the telnet application to connect to a domain name (genesis.cs.chalmers.se) along with a port (3011) to play a text adventure game running on a university server in Sweden. (They've changed the address a bit since I was in high school.)

There was useful and fun stuff going on before the World Wide Web, but it was hard to discover new resources. It's hard to believe now, but it was common in those days to learn about Internet sites by reading about them in books. Paper books! Sure, there were Gopher servers with manually-maintained hierarchical categories of Internet resources, but these directories didn't keep up very well and the resources didn't usually link to each other.

The World Wide Web began as an internal project at CERN, the particle physics research center on the border of Switzerland and France. Researchers needed a better way to organize their information in a busy environment with lots of job turnover, so Tim Berners-Lee proposed a solution for CERN intentionally designed to work on a global scale as well. He wrote:

"a 'web' of notes with links (like references) between them is far more useful than a fixed hierarchical system. When describing a complex system, many people resort to diagrams with circles and arrows. Circles and arrows leave one free to describe the interrelationships between things in a way that tables, for example, do not. The system we need is like a diagram of circles and arrows, where circles and arrows can stand for anything." (source)

He was serious about the "anything" part, but we'll get back to that. As implemented, the circles came to represent documents and the arrows became references to other documents. Web pages linking to other web pages! The notion of document interlinking had been around for decades, but the World Wide Web turned the idea into practical, worldwide reality.

Linked documents sounds a little boring, but programmers have found ways to make web "documents" very interactive. Many other Internet applications have migrated into the web browser. Gopher was replaced by Yahoo (before Yahoo became a tabloid). Home users are more likely to use web mail than a standalone mail client. Twitter and Facebook have largely replaced IRC and other instant messaging clients. Web services (and web mail, unfortunately) are used to transfer files instead of FTP. It's a good thing that applications like Skype and BitTorrent exist, or people might forget there's a difference between the Internet and the World Wide Web!

What's next?

Many great things happened after we started linking documents; what if we try linking finer-grained pieces of data in usable ways? That's the idea behind the Semantic Web.

Think of it this way: the World Wide Web allowed organizations and individuals to put their relatively static documents "out there" for the world to see. But what about database generated content like library catalogs, or online store pricing, or current weather conditions? Web crawlers might be able to retrieve and usefully interpret some of this data, but that usually requires special per-site programming that breaks if the API or web formatting changes.

Getting Across Town, The Semantic Way

Here is an example of a web-published bus route:

http://lincoln.ne.gov/city/pworks/startran/routemap/weekday/route41.htm

An experienced bus rider can read this page and figure out how to plan a trip. A computer program would need help understanding how to parse all of this visually-structured data into precisely labeled information that it can reason about. Quick, what time does the last southbound bus leave "North Walmart" on Thursdays? It's not a trivial process to give that answer, even after we visually interpret the numbers as times in columns that correspond to bus stop locations on the map below. An even harder question might be: "I'm at arbitrary location X and want to reach location Y; what bus route gives me the shortest total walking distance?" In this case, a human on the right website might still have to manually look through all bus route pages, narrow it down to a couple of likely shortest routes, then spend more time comparing the tradeoff between walking farther to the first bus stop or walking farther from the last bus stop.

What would be really neat is a way for bus services and street map services to publish their data on the web in a computer-friendly form that allows third party web apps to combine all of this information and calculate answers to such questions. Even better: a universal format so mash-ups from unexpected combinations of data sources are easier to make. I'm thinking of a music app that checks your GPS position and your destination so it can create a playlist that ends within thirty seconds before your final stop. Or an emergency flight plan app that cross references ticket pricing options with weather predictions. Or a recipe web site that lets you mark missing ingredients and shows their pricing from the five closest stores. Or a personalized book recommendation site that filters by currently available titles in local public libraries. Or imagine searching the web for information on a brand-name drug, and the top results use the drug's generic name without mentioning the brand-name.

Many of these things are possible without semantic web technology; they just require more work to set up and don't tend to be very reusable. For example, Google Transit can help with bus route planning, if a city has formatted their data specifically for this Google web app and joined the transit partner program. But what if a new business wants to reuse this information in a creative way? What if Google cancels the Transit service? It would preferable to have an open standard for open data.

Linked Data

What's the plan, then? Open existing relational databases to the public? Not exactly. The World Wide Web Consortium is pushing for another database model that's a more natural fit for the web: a graph-style data model. From the Wikipedia article:

"Compared with relational databases, graph databases are often faster for associative data sets, and map more directly to the structure of object-oriented applications. They can scale more naturally to large data sets as they do not typically require expensive join operations. As they depend less on a rigid schema, they are more suitable to manage ad-hoc and changing data with evolving schemas. Conversely, relational databases are typically faster at performing the same operation on large numbers of data elements."

In other words, graph databases are less efficient but more flexible (see also The Death of the Relational Database). For people who aren't math majors or computer programmers, "graph database" may sound like "graphical database." But what's meant is graph theory: a bunch of nodes and connections between nodes, usually visualized as circles and lines. A directed graph adds direction to those lines, so you get circles and arrows. Recall what Tim Berners-Lee wrote in his original proposal for the World Wide Web: "The system we need is like a diagram of circles and arrows, where circles and arrows can stand for anything." The World Wide Web is made of connections like this:

(http://en.wikipedia.org/wiki/Cat) --links to--> (http://www.catpert.com/)

Each URL (Uniform Resource Locator) is a circle and web links are the arrows. If you can imagine all URLs and all arrows between them as a gigantic diagram, you're visualizing the World Wide Web as one big directed graph.

Now imagine that the circles can stand for anything, not just web documents. Imagine that the arrows can stand for any relationship, not just navigation links.

(rain gauge #2,388) --detected rain depth--> (3 cm)
(rain gauge #2,388) --time since last emptied--> (60 min)
(rain gauge #2,388) --location--> (Millennium Stadium)
(Cardiff) --contains--> (Millennium Stadium)

A web app that has access to this information can now give an answer the question, "How much has it rained in Cardiff in the last hour?" "An average of 3 cm, as reported by 1 rain gauge." Or with more gauges it might be, "An average of 2.95 cm, as reported by 15 rain gauges." These (something) --related somehow--> (something) snippets of information called triples can combine together into complex graphs of data. And, like web pages, this can happen across servers. The rain depth information could be on one server that only knows the gauge is in Millennium Stadium, while another server knows that Millennium Stadium is in Cardiff. In fact, it makes sense to reference a separate server with lots of geographical knowledge rather than trying to maintain geographical info on a specialized rain gauge server. If the geography server is updated, the rain server automatically and instantly benefits! This is an example of the synergy that can happen with linked data.

Wait, Where Are These Factoids?

Regular web links are in web pages and point to other web pages; we're used to that by now. But where are these triples located? They can be embedded into web page code in the form of RDFa. Graph databases called triplestores can also be put on the Internet and directly queried, much as a SQL database could be if it weren't hidden behind an intermediary website. In either case, typical Internet users won't "see" the Semantic Web directly as they see the World Wide Web's documents and links. The Semantic Web exists as a programming-oriented sibling or add-on to the World Wide Web, not as a replacement. Applications use the Semantic Web to enhance traditional web services.

What Makes the Semantic Web "Semantic"?

In philosophy, linguistics, and computer science, semantics has to do with meaning in contrast to syntax (which has to do with structure or format). Remember ad-libs?

The [adjective] outlaw [transitive past tense verb] a [common noun].

So long as these blanks are filled in with the specified parts of speech, the resulting sentence will be syntactically correct; it will have the right format for an English sentence. For example:

The lonely outlaw whistled a tune.
The law-abiding outlaw drank a mortgage.

The second sentence may have proper syntax, but it's nonsense. Because of their meaning, certain words and phrases don't go well together, at least not in a literal sense. Something else to consider:

This isn't a dog, it's a doberman pinscher.

Again, nothing wrong with the syntax, but a doberman pinscher is a type of dog. Another case:

There were witch trials in Salem.

The truth of this sentence depends (in part) on which Salem is meant. It's a true claim when referring to Salem, Massachusetts. It's false for Salem, Iowa...and many other Salems. In standalone databases, ambiguities and mis-matched concepts like these aren't much of a problem. A database created for a certain purpose in a certain context has implicit restrictions on the meaning of its data. A Massachusetts newspaper database and a Iowa newspaper database are going to mean something different by just plain "Salem." What happens if we try to publish all of these databases on the web and expect the data to mesh well together? Chaos, unintentional humor, and a general lack of usefulness!

For this reason, the Semantic Web has to be about more than just publishing everyone's data as (subject) --predicate--> (object) triples. Here's a flawed set of triples:

(witch trials) --took place in--> (Salem)
(Tom) --born in--> (Salem)

Was Tom born in the same city that the witch trials took place in? We can't tell because we don't know if the two "Salem"s are the same, or which "Tom" is meant. To solve this problem, URIs (Uniform Resource Identifiers) are used, roughly like this:

(http://dbpedia.org/resource/Category:Salem_witch_trials)
--(http://sw.opencyc.org/2008/06/10/concept/en/eventOccursAt)-->
(http://dbpedia.org/resource/Salem,_Massachusetts)

(http://dbpedia.org/page/Thomas_Poulter)
--(http://dbpedia.org/ontology/birthPlace)-->
(http://dbpedia.org/resource/Salem,_Iowa)

In this case, the "Tom" in question was born in a different Salem. If the URIs had matched up, it would have been possible to draw a new conclusion along the lines of (Tom) --born where occurred--> (Salem witch trials). Why call these URIs rather than URLs? Because they don't necessarily correspond to a visitable web page, although it's considered best practice to make such a page available when possible. A URI can identify a resource (or a concept!) without necessarily providing a location.

Did you notice that the URIs above come from both dbpedia.org and opencyc.org? There isn't a single, authorized web domain for the URIs used in linked data. Different organizations can contribute to the pool of URIs. What if two organizations use different URIs for the same thing? There's a triple for that!

(http://dbpedia.org/resource/Salem,_Massachusetts)
--(http://www.w3.org/2002/07/owl#sameAs)-->
(http://sw.cyc.com/concept/Mx4rvViiFpwpEbGdrcN5Y29ycA)

What about mismatches between URIs for "doberman pinscher" and "dog." As you might guess by now, a predicate (i.e. middle URI) can be used to say that a doberman is a type of dog. Then, hopefully, any computer program trying to decide if a given specimen is a dog won't stop at finding out that it's a "doberman pinscher"; it will check to see if doberman pinschers are dogs.

To answer the original question, what makes the Semantic Web "semantic"? All of this background work done by ontologists to separate and combine concepts and to specify the relationships among them. The Semantic Web isn't just about breaking data out of individual databases, but to publish data in terms of these shared vocabularies and relationship schemes. For data to be useful (and reusable) in a giant, global database, the information that was implicit in the context and structure of local databases has to become explicit. Triples format does this for structure. Ontology work does this for meaning.

When Will "Semantic Web" Be a Household Name?

It probably won't ever be a term everyone knows. The semantic revolution is happening behind the scenes among scientific, business, and cultural heritage groups. If things go well, the Semantic Web will increasingly influence the average person's experience with traditional web sites and services. Even if today's technical implementation of the Semantic Web remains niche, I have no doubt that some of its motivating ideas will reappear in future technologies.

Related Reading

W3C Semantic Web (Standards Information)
Semantic University (Introductory lessons)
Linked Data: Evolving the Web into a Global Data Space (Online book)
Semantic Data Integration on Biomedical Data Using Semantic Web Technologies (Book Chapter)
Semantic Web Challenge (Yearly App Awards)
LinkedData.org (Guide to projects and resources)

Sunday, December 16, 2012

Lingo: Authority Control

http://www.flickr.com/photos/kamikazestoat/425526222

Some Delicious tags:

These are user-submitted tags to help other users find webpages on a given topic.

Suppose I just found some interesting Legend of Zelda alt art and want to link it on Delicious. Which tag do I use? legendofzelda is popular, but so is zelda. If I want everyone to see my link, I had better use both! Maybe this is good enough, but since there will still be people browsing through the other tags listed above, should I use all of them? How do I know I've even found them all? What if someone starts using the tag zeldaseries next week?

Hey, maybe someone should clean up this mess by designating an official tag for the Legend of Zelda video game series. Or we call this the authorized tag. Here is a great three-part plan:

Decide on authorized tags for every distinct topic on Delicious.
Make sure that all current and future Delicious links use the authorized tags.
Enjoy finding all links related to a topic under one tag (and nothing unrelated)!

In Library Science lingo, steps one and two are called authority work: the behind-the-scenes work that needs to be done to have neatly organized access points to resources. Access points can be titles, names, or topics.

The Legend of Zelda: The Wind Waker (Video Game) -- a title
Miyamoto, Shigeru, 1952- -- a name
Sailing -- a topic

or a little older:

Dracula (Novel) -- a title
Stoker, Bram, 1847-1912 -- a name
Vampires -- a topic

A close synonym to authority work is authority control. I prefer to think of authority control as the goal of authority work. In other words, we do authority work to achieve a state of authority control (as in step three above). But it's more common to combine the concepts:

"Authority control is the process of bringing together all of the forms of name that apply to a single name; all the variant titles that apply to a single work; and relating all the synonyms, related terms, broader terms, and narrower terms that apply to a single subject heading." — Arlene Taylor, The Organization of Information (3rd edition), p. 44

It's not the most intuitive terminology. "Access point control" or "name deduplication" or "not having a pile of inconsistent labels" would all be better.

A Professionals Only Club?

Delicious is not likely to change its tagging system. Authority control has great benefits, but it takes a lot of extra time and effort. Delicious is fantastic for what it offers: quick-and-easy bookmark tagging and decent (if flawed) bookmark discovery.

Does this mean authority control is only in reach for professional librarians? Nope! I can think of a major Web 2.0 site that lets users participate in a kind of authority work: Wikipedia.

http://commons.wikimedia.org/wiki/File:Pommes-1.jpg

Quick! What are these called:

...pommes, chips, French fries? ...pommes frites, slap chips, Belgian fries?

Imagine separate Wikipedia articles for these variations and many more. Not desirable, to say the least. Wikipedia handles this situation by letting users decide on a single article title (e.g. French Fries) and creating redirects for alternate titles.

Why does this work for Wikipedia but not for Delicious? Primarily because of the number of volunteer editors willing to do this kind of behind-the-scenes work for articles. Trying to keep Delicious links organized would be much more maddening with much less payoff.

Controlled Vocabulary Resources

Not every library or website needs to come up with its own authorized titles, names, or subjects. Here are some (more or less) publicly available lists that can at least serve as a starting point:

Library of Congress Subject Headings. A very broad and inclusive set of subject terms. Academic libraries tend to re-use these for their collections. Example: Ships. Smaller libraries often use the Sears List of Subject Headings instead.

Library of Congress Name Authority File. Example: Rice, Anne, 1941-. Also see Getty's Union List of Artist Names. Example: Mondrian, Piet (Dutch painter, 1872-1944).

Library of Congress' Thesaurus for Graphic Materials. Check the three "Browse By" links on the left. Example: Nitrate negatives. Getty's Art & Architecture Thesaurus. Example: Googie.

Individuals might prefer to use vocabularies like these rather than come up with their own blog tags, image tags, or music tags. You can look beyond the library and archives scene too. If I had a music review blog, I would probably use AllMusic's genre name hierarchy. Example: Americana. Right now this doesn't do a lot of good on one blog, but the growth of Semantic Web technologies may mean better use of authorized vocabulary on the public web in the future. Or the SEO leeches might just mess that up too. Either way, you can always visit your library and take advantage of the authority control someone worked so hard to set up there!

Friday, December 7, 2012

Nicomachean Ethics (Pt. 2)

[Series introduction and table of contents here.]

Book I, Chapter 4

In my comments on Chapter 2, I described Aristotle's "grand goal" as the political art. That wasn't quite right. What he was saying back then and reiterates here in Chapter 4 is that the highest of goods is the same as whatever the political art's goal is. He sees politics as the most encompassing activity in human life, so its goal would be the most encompassing goal. And what is the goal of the political art? Happiness.

All human activities are subordinate to politics and politics is aimed at happiness. Got it. Aristotle doesn't feel the need to argue for the answer of "happiness" because he takes it as universally accepted by both "the many" and "the refined." (Yes, he's just a tad elitist.) He does note that "the many" give a variety of explanations for what constitutes happiness, e.g. health, wealth, pleasure, etc.

"Certain others, in addition, used to suppose that the good is something else, by itself, apart from these many good things, which is also the cause of their all being good."

"Certain others" being Plato and friends, obviously. It's interesting how Aristotle puts some distance between himself and this view. Before he elaborates, however, he goes off on another tangent about arguing from principles vs. arguing to principles. Why does he do this? I think it's because he wants to excuse himself from starting with Plato's principles. He actually names Plato as someone who understood these two different directions of argument. He's tip-toeing around his audience's reverence for his own former teacher. Aristotle is firmly on the side of arguing to principles, which might sound bad until you realize he's trying to be more of a scientist than an ideologue; he wants to use induction to discover what the true principles are from "things known to us" rather than "things known simply."

"Perhaps it is necessary for us, at least, to begin from the things known to us."

See, he's not being arrogant by going his own way from Plato. He's being extra humble.

Book I, Chapter 5

There are three "especially prominent" ways of life:

The life of enjoyment. This is what "the many" choose to pursue, though some rulers do as well. Aristotle calls this "the life of fattened cattle." These people think happiness and pleasure are the same.

The political life. The "refined and active" live the political life by pursuing honor...or maybe virtue. Aristotle considers the possibility that honor is more of a reaction people have when they encounter a person with virtue, which would make virtue the primary goal. He's not quite happy with this result, however, since there are many cases where the exercise of virtue and happiness seem at odds.

"For it seems to be possible for someone to possess virtue even while asleep or while being inactive throughout life and, in addition to these, while suffering badly and undergoing the greatest misfortune. But no one would deem happy somebody living in this way, unless he were defending a thesis."

Funny! But I have to wonder if Aristotle is being overly dismissive of the possibility of being fulfilled and happy despite great suffering, because a person is so overwhelmingly interested in what they're accomplishing.

The contemplative life. A footnote here says that Aristotle doesn't get around to explaining the contemplative life until Book X, Chapters 6-8. I've already seen how easily distracted he is, but this has to be some kind of record! Is "sophistication" a Greek word meaning "disorganized"?

Book I, Chapter 6

Aristotle argues that good can't be a Platonic form (see the "Certain others..." block quote above) because, roughly:

For something to have a Platonic form, its expressions must pertain to a "common idea."
Good can pertain to both what something is and its relations to other things.
What something is is an essential property.
How something relates to other things is an accidental property.
A common idea can't be both essential and accidental.
Therefore good can't be a Platonic form.

He goes on to list other difficulties in understanding good as a single idea. But then he admits that maybe we can divide instances of good into "things good in themselves" and things that "are advantageous" so we can consider whether the multiplicities of good might only be a problem for the latter category (what philosophers today call "instrumental good"). Perhaps there is a single idea common to all things good in themselves. For example, what if the idea of good itself is the only thing that is good in itself? Aristotle calls this "pointless."

In order to avoid pointlessness, it must be the case that all instances of things that are good in themselves outwardly manifest good in a common way, "just as the definition of whiteness is the same in the case of snow and in that of white lead." Aristotle believes that "honor, prudence, and pleasure" are good in themselves because people pursue these things for their own sake (even if they also pursue them in an instrumental sense). He doesn't see how the good of honor and the good of pleasure, for example, manifest in a common way, so good can't be a Platonic form even if we set aside instrumental goodness.

Now Aristotle has a problem. Why the heck do we call all of these disparate things "good" if they don't share a common idea?

"For they are not like things that share the same name by chance. Is it by dint of their stemming from one thing or because they all contribute to one thing? Or is it more that they are such by analogy?"

He doesn't have a ready answer. Instead, he points back at the Platonists and accuses them of having problems explaining how totally abstract forms and concrete human action interact with each other. Reminds me of physicalists in philosophy of mind who defend themselves by pointing out issues with Cartesian dualism.

I wonder what Aristotle would have made of Paul Ziff's book, Semantic Analysis. It seems to me that Ziff answered the question by discovering that things are never good in themselves and it's the other category that can fold neatly into a single idea.

Quotes from: Bartlett, R.C. & Collins, S.D. (2011). Aristotle's nicomachean ethics: A new translation. Chicago: The University of Chicago Press.