Semantic wikipedia search

Master Thesis - Applied Computer Science
Albert-Ludwigs-Universität Freiburg im Breisgau
SUSI: Wikipedia Search Using
Semantic Index Annotations
Albert-Ludwigs-Universität Freiburg im Breisgau Faculty of Engineering Supervisor Prof. Dr. Hannah Bast Prof. Dr. Hannah Bast Prof. Dr. Hannah Bast Prof. Dr. Georg Lausen We present Susi, a system for efficient semantic search on the English Wikipedia.
Susi combines full-text and ontology search. For example, for the query penicillinscientist, Susi recognizes that scientist is a type of person, and returns a list ofnames of scientists that are mentioned along with the word penicillin. We arguethat neither full-text search alone nor ontology-search alone is able to answer thesekinds of queries satisfactorily. Along with the list of entities matching the query(the name of the scientists in the example), Susi also provides excerpts from thetext as evidence.
The data structure behind Susi is an index for the CompleteSearch search engine.
This index is enriched by constructs derived from facts from the Yago ontology. Thechallenge was to do this in a way that keeps the index small and enables fast queryprocessing times. We present an annotation style, specifically designed to eliminateindex-blowup associated with adding semantic information. In our experiments onthe complete English Wikipedia (26GB XML dump), Susi achieved average querytimes of around 200 milliseconds with an index blowup of only 42% compared toordinary full-text search.
We also examine result quality by comparing the contents of hand-compiled Wikipedialists like "List of drug-related deaths" against the output of Susi for correspondingsemantic queries (drug death person). We come up with a simple typification ofthe kinds of errors that can occur. One of our findings is that the vast majority offalse-positives is due to false omissions on the side of the Wikipedia lists, while thevast majority of false-negatives is due to omissions in the Yago ontology.
Wir präsentieren Susi, ein System für effziente, semantische Suche auf der engli-schen Wikipedia. Susi kombiniert Volltext-Suche mit Suche in Ontologien. Für eineAnfrage penicillin scientist, zum Beispiel, erkennt Susi, dass scientist eine be-stimmte Art Person ist, und findet entsprechend die Namen von Wissenschaftlern,die mit dem Wort penicillin zusammen genannt werden. Wir argumentieren, dassweder Volltext-Suche, noch Suche in Ontologien allein, diese Art von Anfragen zu-friedenstellend beantworten können. Zusammen mit der Liste von Entitäten, die aufdie Anfrage passen, präsentiert Susi außerdem Ausschnitte aus dem Volltext alsBeleg.
Die Datenstruktur hinter Susi ist ein Index für die CompleteSearch Suchmaschine.
Dieser Index ist um aus der Yago Ontologie abgeleitete Konstrukte erweitert, diedie gewünschte, semantische Suche ermöglichen. Die Herausforderung war dabei,den Index klein zu halten und schnelle Antwortzeiten auf Anfragen zu ermöglichen.
In unseren Experimenten auf der englischen Wikipedia (26GB XML dump), erzielt Susi durchschnittliche Antwortzeiten von 200 Millisekunden mit einem Index, dernur 42% größer ist als für herkömmliche Volltext-Suche.
Außerdem untersuchen wir, durch Vergleiche mit manuell erstellen Wikipedia Listenwie „List of drug-related deaths", die Qualität der Antworten, die Susi für entspre-chende, semantische Anfragen (drug death person) liefert. Wir stellen eine einfa-che Einteilung möglicher Fehler in Kategorien vor. Eine unserer Feststellungen istdabei, dass die Mehrheit der false-positives auf fehlende Einträge in den WikipediaListen zurückzuführen sind, während die Mehrheit der false-negatives auf fehlendeEinträge in der Yago Ontologie zurückzuführen sind.
Imagine the following, simple question: What scientists have been involved withpenicillin? While this seems to be a question that could easily be answered withthe help of Wikipedia, imagine a query for a search engine that is supposed to fulfillthis purpose. Traditional search in the Wikipedia documents is not able to reflectthe semantics of our query. Finding the word scientist is not what we want. Instead,we are interested in instances of the class scientist.
We follow the idea of combining full text and ontology search. In this thesis wediscuss the creation of our system Susi that enables semantic search on the En-glish Wikipedia. Since the term semantic search is quite loose, we will discuss theexact problem in the following. Section 1.1 introduces the motivation for work onthat topic and points out which aspects are being addressed. Subsequently, section1.2 explicates the thesis' contributions. Section 1.3, finally, gives a survey at thestructure of the rest of this document.
The idea of a Semantic Web where computers are supposed to understand the mean-ing behind the information on the Internet has been around for years. Modernapproaches usually involve the usage of some database that provides a formal repre-sentation of the semantic concepts of a domain of interest. There is also great workthat accomplishes semantic search in this sense. Some of them are described furtherin But while the technology works great, a query's result can only be asgood as the database.
In general, many resources carry their information in somewhere in the text. Humanscan understand it well, but it starkly differs from formal, structured representationsthat are used by existing approaches. Although information and facts can be suc-cessfully extracted from full-text the amount and depth ofextracted information is always limited. Any extraction tries to distinguish impor-tant facts from lesser important ones. Usually this evolves around a specializationon some domain of interest. The majority of unspecific information remains hidden.
Search engines are without doubt the most popular way to find certain pieces ofinformation in a huge amount of text. However, they usually do not try to grasp thesemantics of a query. They rather aim to deliver results based on the occurrences of query words in documenWhile this concept clearly has proven itself in practice,we can still find some limitations.
Think of the query penicillin scientist. This query should return scientiststhat are involved with penicillin in some way. First and foremost its discovererAlexander Fleming but also the Nobel laureates that accomplished its extractionand many scientists that are involved in one way or the other. For ordinary queriesthis task is typically solved quite well by search engines. However, the difficult partof this particular kind of query is understanding the word scientist. Importanthits may evolve around documents where penicillin is mentioned close to an instanceof a scientist. Usually this is a name but a pronoun that references a scientist isalso very likely. Additionally, there may be a mentioning of some chemist, biologistor whatever specific kind of scientist.
Consider the following example text from figure 1.1: Alexander Fleming was a bacteriologist. He discovered penicillin by accident.
Figure 1.1.: Example Text Excerpt
While being obvious for a human reader, the information that this is a hit for ascientist that has something to do with penicillin, is not obvious for a computer whodoes not understand human language. First of all, this may be an excerpt of a hugedocument that mentions many different scientists in several contexts. Hence, onesomehow has to preserve the fact that Alexander Fleming is mentioned very close topenicillin or ideally that the pronoun "he" refers to him. Secondly, one has to knowthat the particular Alexander Fleming, that is mentioned here, is a bacteriologistor biologist. And finally, it has to be clear that a bacteriologist or biologist also isa scientist.
Luckily, the required information on types of persons (or all entities in a wider sense)is manageable. It is much more likely that semantic ontologies exist that containthis kind of facts than that there are ontologies that cover basically everything thatis present somewhere in the full-text. Hence, combining the power of full-text searchwith knowledge from semantic ontologies should enable new possibilities. Naturally,there are lots of challenging questions involved and only some of them can be dealtwith at a time. Chapter 2 points out which of them are taken into consideration forthis thesis, which approaches are chosen to realize them and discusses this choice.
1Modern search engines are able to find similar words in the sense of error-tolerant search handle synonyms and more. However, these refinements are not relevant forthe following argument and not discussed further here.
1.2 Contributions This thesis presents the power of the combination of ontology and full-text search.
Contributions include: • Susi, a fully operational system based on an index for the English Wikipedia that allows the CompleteSearch search engine hand-ing queries that feature semantic categories.
• Heuristics for simple entity recognition in Wikipedia pages.
• Development, implementation and comparison of different concepts for anno- tating entities with their semantic categories.
• An implementation that fits the modular architecture of the CompleteSearch search engine, including: A parser for the wikipedia markup that recognizes and handles entities.
Tools that extract relevant knowledge from Yago
and combine it with Wikipedia entities.
Integration in the CompleteSearch engine by providing files that stick to
the standards used by CompleteSearch.
• An evaluation framework for semantic queries. Both, general difficulties with semantic search and those specific to susi are identified and classified intogroups of problems.
While we are able to present a demo version that works nicely, there are still manythings to improve. This thesis contains an outlook on possible future improvements.
Some of them already include concrete solutions that can be put into practice with-out much effort, others only give an outlook on more complex problems that shouldbe addressed in the future. However, all of them do contribute towards semanticWikipedia search.
1.3. Structure of this Thesis
This document is separated into six chapters. Each chapter is supposed to providethe answer to a specific question.
Why do we do all this? This first chapter has introduced the problem and pointed
out why we expect interesting possibilities for semantic search that have notbeen researched yet. Also, we have seen the contributions we wish to make bythe construction of our system susi and its discussion in this document.
What exactly do we do? In the second chapter, we define the scope of this thesis
and hence the scope of Susi. We want to make clear what aspects we want tocover now, which ones are postponed and give reasons for that choice.
What have others done? Chapter 3 presents what other researchers have con-
tributed to the field. We identify and briefly examine related work concerningsemantic search and point out where our approach differs.
How do we do it? The fourth chapter is the main chapter of this thesis. We present
our solution step by step and try to give a general understanding for all aspectsrelevant to Susi. Finally, we reconsider a simplified example and try to see allof the steps discussed orchestrated and working together in action.
How well did we do it? Chapter 5 contains an evaluation of our work. We examine
both, quality and performance of Susi on an experimental basis. We identifyproblems within the current version and relate the observed issues to them.
How can we improve? The final chapter contains a discussion. After a short con-
clusion, we present possible future work as a list. Whenever possible, we alsotry to state solutions for the problems to be tackled in the future and give anestimation of how much effort is involved for which point.
Throughout this document, we try to establish the query penicillin scientistas what could be called a "running example". Whenever possible, we relate thecurrent topics to this example and demonstrate their purpose by showing how theycontribute to processing this query.
2. Scope of this Thesis
This chapter points out the scope of this thesis. First we give a reason for choosingthe general approach of full-text search with additional semantic categories anddescribe why we implement this as index with semantic annotations. Secondly wediscuss why this thesis is limited to the full English Wikipedia. Finally we explainwhy the Yago ontology has been chosenas source of the semantic knowledge that is used to enhance the index we build.
2.1. Choice of the general Approach
Since the term semantic search is quite inaccurate, numerous approaches arise fromresearch in this area. Usually, they even aim to achieve different goals. Some ofthem are described in detail in In this chapter, however, we discusswhich approach is followed by this thesis and why we chose it.
Recall our goal as it was discussed in Motivation We want to solve queriesof the form penicillin scientist, similar to the way various search engines solvesuch queries. That means that we want to receive a list of documenthat matchour query words. Specifically, we want to receive documents that match the meaningof the words. As discussed earlier, a sentence that is a proper hit should contain thewords penicillin and chemist, or penicillin and the name Alexander Fleming and soon.
In the authors distinguish between two kinds of search: naviga-tional search, which is supposed to find a document that is relevant for the query,and research search that is supposed to gather information on an object denotedby the query. However, there is also a third, interesting kind of search. Considerthe example above again. It is entirely conceivable to understand the query in asense of: "Which scientist has something to do with penicillin?". Likewise, imaginea query penicillin discovered scientist to find out who actually discoveredpenicillin. If we simply regard our search results and additionally distinguish whichexact word occurrence lead to a hit, i.e. the word "biologist", the name "AlexanderFleming" or the name "Howard Flor, we can easily provide answers to this kind 1for the sake of this argument it does not matter if the term "documents" refers to documents in a classical sense or if it only refers to some unit of words; a sentence for instance.
2Howard Florey is a scientist invloded in the extraction of penicillin as a drug. Scope of this Thesis of search, too. Doubtless, enabling this kind of search can be really useful and henceit is the main aim of this thesis.
Note, that our goal is to exceed simply giving an answer to the query. The matchingsentence in the document acts as evidence for that fact.
In order to reach this goal, we need to know about the semantic relations betweencertain words on the one hand, and on the other hand, to comprise the knowledgeabout co-occurrence of words in our document collection. In the example above, thisrefers to facts like "a chemist is a scientist" or "Alexander Fleming is a scientist"and to the fact that the word "penicillin" co-occurs with some sort of scientist in ourdocuments. We use two sources of input: A document collection with lots of facts(see Limitation on the English Wikipedia for why the English Wikipediawas chosen for that purpose) and an ontology that provides the semantic informationwe need (see Incorporation of the Yago Database for why Yago was chosen). Naturally, we need to combine this information in a waythat enables really efficient queries, since the amount of data will be huge. Basically,we can distinguish two approaches: 1. Adding the information from the the document collection's text to the ontology and keep the ontology in a well-suited data structure that allows efficientqueries.
2. Constructing an index on the words in the document collection, just like com- mon search engines do and somehow add the additional knowledge from theontology.
As pointed out before, no valuable information should be lost and we do not want torestrict ourselves on a certain domain. Hence, we end up with a huge amount of data.
The current version of the English Wikipedia, at the time of writing this thesis, hasmore than 1.4 billion word occurrences in about 130 million sentences. Assume thesemantic ontology is represented in (Resource Description Framework), whichis a very reasonable format that is used by many well-known ontologies, including Yago. If we really were to model co-occurrence with all words that possibly matter,we would end up with billions of triples. This is quite large regarding that thisyear's Semantic Web cencourages researchers to present useful end-userapplications that use their so-called "billion triples" data set with 3.2 billion triples.
Hence, we know for a fact that the resulting ontology would be of critical size.
Classical search engines and their indices, on the other hand, operate on even largerdata sets without problems. Just think of the big web search engines as an extremeexample. They create an index that allows to look up in which documents some 3The Resource Description Framework (RDF) is a W3C specification for modelling facts. A single fact is represented as subject-predicate-object triple, e.g. "al-bert_einstein - isA - physicist". Conceptually, multiple facts form a directed, labeled graph.
See for further details.
2.2 Limitation on the English Wikipedia word occurs within a few milliseconds. For queries with multiple words, they can in-tersect the document list. See for more detailed information on this topic.
Consequently, we want to use such an index and somehow incorporate knowledgefrom our ontology.
The information from the ontology can be added to the index in various ways. Weuse annotations that simply are additional words written to the index. Details arediscussed in Anyway, all techniques that are around for search enginesobviously still apply for such an augmented index; in particular prefix search whichis of high importance for our approach as discussed later.
Still, the additional annotations may possibly blowup the index size and hence spoil everything. Theindex blowup is one of the main issues addressed by this thesis and the steps thatprovide a solution are outlined in 2.2. Limitation on the English Wikipedia
The previous chapter explains the need for a document collection. The EnglishWikipedia is great in both ways, rewarding to perform semantic search on, andcomparatively easy to use.
First of all, it is a huge data set and full of interesting facts. After all, it is thelargest encyclopedia ever assembled. Hence, being able to perform semantic searchon this source is without a doubt really valuable. Secondly, from our perspective,several facts make it really easy to work with. Documents have a common structurethat is a lot clearer than the structure of an arbitrary collection of web pages, forinstance. There is always a title, headlines, paragraphs and the whole Wikipediais available in format. The size of this dump currently exceeds 26GB so weare able to deal with decent collection size. Last but not least, entity recognitionin full-text is immensely simplified for correctly maintained Wikipedia pages, sincethe first mentioning of some entity is always supposed to be linked to the entity'sarticle. This fact allows for simple heuristics to perform decently in recognizingentities, whereas more complex rules can be added to improve the recognition evenfurther. More details on entity recognition are given in 2.3. Incorporation of the YAGO database
In addition to the document collection as main source of facts, we previously pointedout the need for an ontology that contains facts on the relations between words, e.g.
"a chemist is a scientist", "a scientist is a person" or "a person is a living thing".
This kind of knowledge is not easy to gather and requires an immense amount workthat only humans can do. Fortunately there is a project called WordNet Scope of this Thesis that contains exactly this kind of information. WordNet is described furtherin Since there is no alternative to WordNet at the time of writing this thesis, anontology has to make use of it in order suit our needs. Yago ;does fulfill that criteria. Additionally, Yago explicitly establishesthe relation between WordNet's entities and Wikipedia categories and ultimatelyeven contains "isA"-facts for entities, e.g. the fact that "Alexander Fleming is amicrobiologist". The actual content and structure of Yago is described later in thisdocument and not of importance here. Summing up, we want to make clear that Yago offers everything our search requires from an ontology and its closeness tothe English Wikipedia highly facilitates its incorporation.
3. Related Work
Considering semantic search in general, most research features search in ontologiesin Rdf format. Usually this goes hand in hand with aquery language. So in a way, all kinds of engines that enable a query language for Rdf graphs can be seen as related work. Rdf-3x is presented exemplarily. Related work with exactly the same goal as this thesis,on the other hand, is quite rare. The only publication we know of, Ester is discussed in the beginning of this chapter. Subsequently, we regardanother approach. In F. Suchanek describes the Yago ontologyand its automatic construction and growth. Although this is no search engine atall, it, first of all, served as an important foundation for this thesis. Secondly,enabling semantic search by querying ontologies requires automated construction ofthe later and hence projects like Yago are inevitable for this approach and shouldbe mentioned here.
3.1. ESTER
From all existing work, Ester (Efficient Search on Text, Entities, and Relations)is the project that shares the most similarities with Susi. Thefundamental understanding of semantic search, i.e. how queries look like and whatresults are desired, goes hand in hand with the goal of this thesis. On top of that,the combination of Yago and the English Wikipedia is also used for Ester. Inorder to accomplish its goal, Ester reduces the problem of semantic search to twooperations: prefix search and joins. In the authors preciselycarry out how Ester works in general. This is not repeated in this document.
Instead we consider an example query, examine how Ester solves it and do someforeshadowing on which aspects are handled differently in this thesis.
Consider our example query penicillin scientist. Ester uses an index withword-in-documents occurrences that allows efficient prefix-search. For penicillin*,the first part of the query, a simple look-up yields a list of pairs of words that startwith the prefix penicillin and the documents in which they occur. For our examplequery, a slightly different query is actually processed, i.e. penicillin person:*(the reason for this extension is pointed out below) which returns the following list: l1 = {(w, d) w is a word that starts with the prefix "person:"; d is a list of documents in which wand the word "penicilin" occur.} The second part of the query is more interesting, since the additions for handlingsemantics come into play here. In general, Ester uses the following concept: Occur-rences of entities are recognized in the text and special words, marking an occurrenceof the recognized entity, are written to the index. Additionally, facts about each en-tity are located at a certain position in the document that describes the entity itself.
For semantic queries, the entity occurrences are joined with the entity's facts.
So now let us examine this principle in practice and see how the second part ofour query is handled. First of all, Ester has to decide if there is a semantic classthat fits the query word scientist. Therefore, it launches a query of the formbaseclass:scientist* . In this case, the query would be successful and the classperson would be found. This class is used as the level of detail on which joins areperformed. On a side note, wise choice of this base-classes is necessary to avoid verylarge lists when performing the joins. In a next step, the query class:scientist -is_a - person:* (where - is a proximity operator, which relates to the particularposition in the documents where the facts about entities are stored) yields anotherlist of word-in-documents pairs, such that: l2 = {(w, d) w is a word that starts with the prefix person: ; d is a list of documents in which woccurs.} One of those w's will be a word like person:alexanderflemming, that likely infact co-occurs with the word penicillin in some documents. Other w's will bewords like person:aristotle, that do not co-occur with penicillin. Likewise,l1 contains numerous persons that co-occur with the word penicillin but do notnecessarily have to be scientist. Hence, the final step in solving the query is a joinof the lists l1 and l2 that uses w as join attribute. The resulting list, finally, is thedesired answer to the semantic query.
The approach used in this thesis is quite similar to the ideas behind Ester. How-ever, there are some significant differences. First of all, entity recognition in thetext is done differently. Secondly, we regard sentences as documents instead ofwhole Wikipedia articles. This improves precision as pointed out in Nev-ertheless, the most interesting difference is that we aim to reduce the problem evenfurther. Instead of using prefix search and joins, we show that we can solve thisproblem with prefix search only. This can easily be done by writing the facts always 3.2 YAGO, LEILA and SOFIE next to an occurrence of an entity. Unfortunately, this blows up the index size andtherefore Ester uses joins to avoid this phenomenon. The techniques carried out inhowever, can be used to significantly reduce the blowup and hence solvesemantic queries with prefix search only without letting the index size explode.
3.2. YAGO, LEILA and SOFIE
As we have already discussed at several points in this document, Yago is an on-tology and no search engine itself. However, huge parts of its construction areautomated. This means that automatically retrieving facts from full-text, esp. theEnglish Wikipedia is part of the research done for Yago. Hence, even though thegoal is different, there is a relation between the work concerning the construction of Yago and this thesis.
In his PhD thesis, titled Automated Construction and Growth of a Large Ontologythe author presents the concepts behind Yago. Instead of goinginto detail too much, let us focus on the overall structure. There are three buildingblocks: Yago is the name of the ontology that is constructed. This is also the part that is
used by this thesis' work to enrich full-text search with additional semanticknowledge.
Leila is a tool that extracts information from the text. It uses linguistic analyses
to understand the structure of sentences rather than seeing them as sequenceof words only.
Sofie validates new facts, ensuring that they do not contradict existing facts. This
ensures that only decent facts are added to Yago.
Using the public demo, it is easy to convince oneself that this composition of toolsworks pretty well and hence Yago has been chosen as ontology used by Susi.
As discussed in many of the concepts from can beexamined in order to improve the entity recognition. However, the construction ofan ontology like Yago is a totally different goal than what this thesis aims to achieve.
First of all the produced output obviously is different and secondly Yago limits itselfon a fixed set of facts that are actually extracted and added to the ontology. Forexample there are facts like isA or bornInYear. Other than that, this thesis' searchengine aims for a lot more arbitrary facts with the flavor of hasSomethingToDoWiththat relates to co-occurrence of words.
3.3. RDF-3X
The most popular query language for Rdf is Sparql Since Rdf is the most common format to represent semantic ontologies, a Sparql engine can also be regarded as performing semantic search. Rdf-3x is such a Sparql engine that is able to process evencomplex queries on large ontologies in milliseconds. In the following we briefly lookat the basic ideas behind RDF-3X. Those ideas are based on cleverly constructedindices which might be useful for any kind of search engine.
Recall that Rdf represents facts as subject-predicate-object triples. Sparql sup-ports conjunctions of triple patterns, which correspond to select-project-join queriesin a relational engine. In order to solve a Sparql query, it is therefore necessaryto support retrieval of triples with zero to two variable attributes and joins between Rdf triples. So first of all, the internal storage of the triples is an issue. While someapproaches favor a table for each distinct predicate, Rdf-3x uses a single table withthree columns for subject, predicate and object. For that table, Rdf-3x maintainsindices for each possible permutation of subject, predicate and object. This enablesfast look-up of patterns by simple range scans on the correct index whereas theordering of the result is already known depending on the index used. Consequently,joins on any join attribute can also be performed efficiently.
Apart from that, Rdf-3x speeds up complex queries with multiple joins by process-ing them in a clever way. First of all, statistical histograms help to predict the resultsize of a join. This allows ordering joins in a way that they can be performed moreefficiently. Secondly, in the authors present wayshow Rdf-3x speeds up joins even further, using methods of efficient list intersectionthat remember gaps in sorted join candidates.
In summary, we can see that Rdf-3x enables very fast search in large ontologies byprocessing Sparql queries. But as discussed earlier, this thesis aims for search alsoin full text and not only in an ontology. Additionally, Sparql is a query languagethat is not as intuitive as a query to a search engine ideally is. Still, some concepts,like the join heuristics for instance and the general idea of building multiple indices,are interesting and can probably be useful for semantic search in full text as well.
3.4. DBpedia
DBpedia is a community effort to extract structured informationfrom Wikipedia and to make this information available on the Web. It is an ontologyin Rdfs format, that combines many concepts from other ontologies or directlyextracted from Wikipedia. It is very similar to Yago since it uses the class andtype hierarchy from it. Many facts in DBpedia are obtained by extractions from theinfo boxes on Wikipedia pages.
There is also an application that uses DBpedia in order to offer a faceted Wikipediasearch While this application provides great search on therelations from the DBpedia ontology with a sophisticated user interface, there aresome notable differences to our work: First of all, the search from does not provide evidence. Thefacts are directly taken from the ontology and their origination is no longer known.
Therefore, hits are the entities' Wikipedia articles and not all pages that containa suitable fact. Secondly, semantic search is limited to the relations offered bythe ontology. The only full-text that can be searched is the short abstract in thebeginning of each entity's Wikipedia article. Additionally this search is limittedon the abstracts of the entities that are not filtered out by some semantic facet.
Consequently, a query for penicillin faceted by the type scientist returns muchless hits than Susi does for the same query, because all facts where a scientist ismentioned with penicillin outside of the abstract and all facts in a document otherthan the one for the scientist itself, are missed. Finally, the public demo of theapplication appears to have response times of tens of seconds for many queries.
This may be acceptable for some use-cases, but is not what we want to focus on.
However, the great user interface is an inspiration for future work.
In the previous chapters, we have seen a motivation for semantic search, defined
what exactly we want to achieve and had a look at research that is already present
in this area. This chapter now covers the way that actually solves our problem. We
present a system Susi (Wikipedia Search Using Semantic Index Annotations) that
enables semantic search on a combination of the full-text of the English Wikipedia
and the Yago ontology. We compare possible choices whenever there are multiple
available. Especially, when it comes to supplying entities with semantic facts. we
can distinguish several methods to add the desired information to the search in-
dex. By actually implementing them, we are able to provide concrete numbers for
their application on the full English Wikipedia database. Hence, we can easily and
precisely compare them.
However, at first, we want to recall necessary foundations. Afterward, we discussall steps towards a semantic search engine individually. Finally, we reconsider ourexample query penicillin scientist and reconstruct how it is actually solved.
This demonstrates the worth of every step presented previously.
This first section briefly recalls the foundations for Susi. We only want to namethe important building blocks behind the underlying search engine and describe thegeneral idea behind those concepts. For more details please refer to the material inthe references.
First of all, we have a look at the Hyb index that allows super-fast prefix search.
Secondly, we regard WordNet which provides information on English words andtheir relation. A physicist, for example, is a scientist, a person, an organism andmore. The effort put into the creation of WordNet is crucial for Susi. This sectionfinally ends with a short description of CompleteSearch, a fully operational searchengine that features the Hyb index and is the engine behind Susi.
4.1.1. Prefix Search and the HYB Index
The Hyb index is an index for search engines that allowsfast prefix search. Thus, it enables auto-completion search as known from various applications. But prefix search is also important for other features. For exam-ple, synonyms can be supported. As an example consider the words novice andbeginner. Instead of simply writing the words themselves, one can write some-thing like syngrp1:novice and syngrp1:beginner to the index. A prefix-query forsyngrp1:* should now find occurrences of any of the words. Likewise, for semanticsearch it is very useful to have a powerful tool like prefix search. E. g., we have seenin that Ester is build around two operations, one of them being prefixsearch. On top of that, Susi even reduces semantic search to prefix search only.
However, a big challenge for prefix search is that multiple words match a query andit is therefore required to merge the document list of all those words in order to pro-vide the result. The Hyb index stores precomputed doc lists in a very clever way.
Just like an encyclopedia may be split across multiple books, the Hyb index dividesthe possible prefixes in multiple blocks and stores each of them independently. De-pending on the query, multiple blocks can be read and united, or the doc-lists readcan be filtered so that they fit a more precise query. Note that sufficiently smallblock sizes will lead to negligible costs for filtering.
List of word - document occurrences Table 4.1.: HYB Example
Table 4.1 illustrates this principle of dividing the index into several block. Notethat this is only an example and there are in fact numerical IDs stored for wordsand documents for obvious efficiency reasons. The division into blocks overcomesthe problem that storing all those precomputed lists leads to a much larger index.
In the authors mathematically show that the Hyb indexdoes not use more space than a conventional inverted index. Hence, the Hyb indexis a very powerful way to enable prefix search in an efficient way.
WordNet is a semantic lexicon for the English language developedat the Cognitive Science Laboratory of Princeton University. It associates words tocertain classes and provides relations such as subClassOf. This relation is exactlywhat we need for our semantic search, since it contains the fact that a scientist also isa person. Conceptually, the subClassOf relation in WordNet spans a directed acyclicgraph with a single root node called entity. Although we only make indirect use ofWordNet through Yago, most of the interesting facts for our work are actually 4.2 Enabling Semantic Wikipedia Search inherited from WordNet which therefore deserves to be mentioned here in additionto Yago.
CompleteSearch is a fully functional search engine thatsupports a variety of features. It uses the Hyb index which has been discussedpreviously and implements all necessary components to allow searching by end-users.
There are multiple research projects (e.g.
different areas of search engines that blend into CompleteSearch. The same thingis true for Susi. CompleteSearch's modular architecture allows substituting certaincomponents and keeping the rest. Hence, we can provide a special index for theEnglish Wikipedia (plus some minor modifications to the user interface) that willenable semantic search, while no changes are necessary to the basic operations ofthe search engine itself.
4.2. Enabling Semantic Wikipedia Search
As discussed above, Susi is build into the CompleteSearch engine. Actually, thesemantic Wikipedia is nothing else but a database built from a document collectionto search on. Semantic information is added during construction and treated as if itwas part of the original documents. For CompleteSearch, such a database consistsof three parts. The search index, the vocabulary and a structured version of theoriginal text that allows displaying excerpts for all hits. Since the vocabulary candirectly be found in the index, the interesting parts are the construction of the filefor the excerpts and especially of the search index itself.
4.2.1. Index Construction
Recall the Hyb index that has been discussed in the section on necessary founda-tions. Although the Hyb index structures its data into several blocks, the smallestunits of information remain postings. Such a posting is basically a word-in-documentpair, while CompleteSearch has additional support for scores and positions. So whilethe actual index consists of multiple blocks, the input to index construction is justa set of postings. In the following, we can imagine the search index as a set of lines.
Note that in practice, word-ids are used instead of the actual words and the inputis passed in binary format. This is absolutely crucial to performance but for thesake of all arguments and explanations we can always imagine the readable formatfrom table 4.2. For Susi, we want to see each sentence (or row in a list or table)as document. This leads to the following behavior of the search engine. A hit for Table 4.2.: The index as set of postings.
all query words is a sentence that contains all of this words, supposing this is a factin the Wikipedia that the query attempts to find. Actually, using sentences is justa very basic heuristic for determining the scope of facts. For future work we mightconsider trying to find better units by linguistic analysis. See for furthernotice.
Positions momentarily only play a minor role and scores are currently even unused.
However, they may be inferred soon to realize new features.
4.2.2. Enabling Excerpts
In addition to the index, we also need to produce a file that contains the necessaryinformation for providing excerpts for each hit. For the basic search features, this isquite easy. We simply write a line with <doc-Id> <TAB> u:Wikipedia URL <TAB>t:<title> <TAB> H:<original text> into a file. The CompleteSearch engine isthen able displays excerpts that belong to a hit in a nice format, provide a link tothe corresponding Wikipedia article and also highlights the query words.
However, Susi contains artificial words that contain the semantic additions. High-lighting those requires special effort, because the artificial words should not be vis-ible and the corresponding entities should be highlighted instead. Fortunately, theCompleteSearch excerpt generator already supports special rules for artificial words.
This is best explained with an example. Consider the string A B C D. Theletters A, B and C (preceded by ) now will not be visible in the excerpts. It is D(preceded by ), that is displayed in their place. If a query now contains any ofthe first characters, the D will be highlighted instead by the excerpt generator. Sofor Susi, we can simply hide our artificial words for the semantics and highlight thecorresponding entity instead.
The following screen-shot is taken from the current version of Susi. It shows a queryfor the entity Alistar Sinclair that is supposed to co-occur with the words geometricand algorithm.
4.2 Enabling Semantic Wikipedia Search Figure 4.1.: Highlighting Alistair Sinclair, geometric and algorithm in the excerpts.
Not only does figure 4.1 demonstrate how well the highlighting of words like "his"works, it also shows that Susi properly identifies entities. As a fun fact, we also getto see that even Wikipedia authors seem to like to copy from Wikipedia.
Both, the index and the file with the excerpt information, are constructed whileparsing the Wikipedia dump. The process takes several files as input which areconstructed beforehand. For example, we use a redirect map that maps links toredirecting Wikipedia pages to a distinct entity name for every entity, a list ofpronouns that are words which will reference some entity and many more. Apartfrom the information for parsing, the input also contains the data extracted from Yago. The format of this data is discussed later in this section.
The software that builds index and excerpts file consists of three classes. The struc-ture is very simple but still applies a clear separation of concerns: Figure 4.2.: The Semantic Wikipedia Parser
Figure 4.2 illustrates this concept. The class SemanticWikipediaParser (SWP) han-dles parsing the markup and recognizes, words, sections, links (to entities), tables,math and more. According to what is found the SemanticWikipediaParser callsmethods of the EntityAwareWriter (EAW).
Deciding what is actually written to the output lies in the responsibility of theEntityAwareWriter. It provides methods that handle entity occurrences throughlinks, redirects, entities with renaming, beginnings of new documents, new sentencesor new sections to give some examples. The EAW then reacts accordingly. Addi-tionally, it also decides on the format of the things written to the index and theexcerpts file. Just think of the rules for highlighting words in the text that referencean entity in place of it. However, the exact things to write for some words dependon the context. For example, reacting on the word "he" will be different in thebeginning of a document about tomatoes where no person has been mentioned yetand in a document about Albert Einstein, where the pronouns will likely referencea person. The word "apple" will require different things written to the index if itis found in a document Apple Computers or an article on Isaac Newton where anapple falling from a tree is mentioned.
This context information is stored in the EntityRepository (ER). The EntityRepos-itory knows entities that previously occurred in the same section, the documententity, the last entity that occurred at any point in time and entities that relate to"active" (the scope depends on settings passed to the parser call) section headings.
Whenever the EAW is supposed to write some word, it can query the EntityRepos-itory to determine if the word might refer to an entity and which entity this wouldbe.
4.3. Wikipedia Markup Parsing
Building a search index for the English Wikipedia, obviously requires access to thecontent of the Wikipedia in a proper format. Fortunately, dumps are being made 4.3 Wikipedia Markup Parsing and provided for download on a regular basis. These dumps are available in XMLformat that sticks to a reasonably simple The content of every Wikipedia page can be found in a single XML tag and formatted directly in Wikipedia markup.
Hence, parsing the Wikipedia dump consist of two steps: Parsing the XML for thepage tags and subsequently parsing the markup of each page. While there is nothingspecial to parsing the Wikipedia XML, maybe except for the fact that using a Saxstrategy is absolutely mandatory due to the sheer size of the data, parsing themarkup is special and therefore discussed in the following.
There are some challenges to parsing the Wikipedia markup: 1. Wikipedia accepts messy pages. An error in the markup will result in some formatting issues or a link not working correctly but the page will still bedisplayed in general. Sometimes, minor mistakes will not even have an effectat all. Consequently, there are many pages with flawed markup in the dumpand a parser may not rely on opening tags to eventually get closed and so on.
2. Wikipedia is nice to its writer. There are many ways of doing the same thing.
For example, there are special templates for tables that are usually used, but HTML tables do work as well. Thus, a perfect parser would have to take allpossibilities into account.
Due to the challenges listed above, the parser written for Susi is not fully mature,yet. However, it properly processes at least almost everything. The remainingdifficulties will have to be addressed in the future. The parser follows a very simpleprinciple: Avoid going through the same passage more than once. It usually reactsto the start of some construct and behaves accordingly. There are lots of possibleelements that can be found. A complete list is available We will not discussall of them here but examine a few examples instead: Internal Links may be the most important construct to parse. These internal links
point to some other Wikipedia page and thus to the entity that is describedon that page. Internal links are the easiest and yet most reliable way to rec-ognize entities. Usually they consist of the entity title in double brackets, e.g.
[[Albert Einstein]] but variations are possible. [[(Albert) Einstein]]or [[Albert Einstein That physics guy]] can be used to reference thesame entity but display only "Einstein" or "That physics guy" instead. Thesetwo were just chosen as example. More similar variations are possible and aparser has to react accordingly.
Sections and Section Headlines Sections and their headlines have to be found be-
cause they are important for entity recognition. See the section on entityrecognition for further details. Apart from that, we can supply the parser witha list of section headlines that indicate sections that will always be skippedduring parsing. One example is "References" which are usually ignored.
Math and similar Structures are ignored altogether. Usually they do not contain
any meaningful text that could be of use for Susi.
Wikipedia Templates are usually skipped since there are so many of them and
even user defined ones are possible. However, some really important ones, liketables have to be parsed in order not to lose valuable information.
While those examples and more cases are already implemented in the parser, thereare still others to be added in the future. Those, that might have been of use atsome point already, are listed in 4.4. Entity Recognition
Semantic Search is about finding facts, facts about entities. So if we are looking at asystem like Susi, where semantic search is performed on a combination of ontologiesand full-text, entity recognition in the full-text is obviously necessary and even akey aspect. In general, this is a really challenging problem and solutions usuallyinvolve linguistic analysis of the text (done by Sophie which issupposed to find facts for Yago) or analysis based on statistical co-occurrence (hasbeen tried for Ester However, entity recognition is much easier on Wikipedia pages. At least if thepages live up to Wikipedia's standards, the first occurrence of any entity shouldalways be linked to the page describing that entity. This allows for simple, yetexcellently precise heuristics that abuse the structure of Wikipedia pages.
entity recognition performed for Susi follows exactly this idea.
• First of all, direct links to Wikipedia pages are always occurrences of the linked entities. Technically we have to take redirecting Wikipedia pages into account(e.g. the page Einstein redirects to Albert_Einstein), but a map with allredirecting pages is easily constructed by parsing the Wikipedia dump once.
• Secondly, further occurrences of an entity usually are not linked anymore.
Additionally, those occurrences do not necessarily match the linked entity en-tirely. The most common example evolves around persons that are mentionedby their last name only. Consider a passage like: "Einstein thought that.". Ifthe entity Albert_Einstein has just been mentioned, It is clear that Einsteinin fact refers to that entity.
• Finally, there are expressions referring to another entity. While those ex- pressions are called anaphora in linguistics and their strict definition dependson theories, they are again best understood by some example. Again, theWikipedia on Albert Einstein contains the following excerpt: "He received the1921 Nobel Prize in Physics for his services to Theoretical Physic.". Both, heand his do reference the entity. Anaphora handled by Susi are exactly thosepronouns.
4.4 Entity Recognition Now that we have seen the different kinds of words that may represent entity oc-currences, we still have to associate them with the correct entities whenever wediscover these words during parsing. Obviously, the correct choice always dependson the context of the current page. Therefore, we keep track of several entities whenbuilding the index for Susi: • Entities inside the current section. This case is quite intuitive. If an entity has just been mentioned, it is a possible match. This can be seen in the followingsentence: Although Einstein had early speech difficulties, he was a top studentin elementary school. Obviously, the word he references the Einstein who hasjust recently been mentioned. Therefore we keep all entity occurrences in amap which has all words from their title as separate keys and the entity asvalue.
• The page entity. A Wikipedia page is always dedicated to one certain entity.
This entity is always referenced by other expressions throughout the wholedocument.
• Entities referenced by section headlines. Just like the document entity, an entity in the headline of the current section can also be seen as present ev-erywhere in the section. While this feature has been implemented for Susi,practice has shown that it is not really helpful and therefore it is currently de-activated, or to be precise, it has been set to a level where only the documententity is handled in a special way.
These cases fill the map of currently "near" entities. This map is cleared whenevernew sections are discovered (the document entity remains in the map in the case)or if a new document starts. Additionally we always remember the last entity thatoccurred separately, because it may be more likely that this particular entity isreferenced. The information can be used to associate word occurrences with entitiesand thus recognize entities in the text. In a nutshell, we can describe the strategyused by Susi for this association by the following algorithm: Algorithm 1. For every word occurrence, check if it should be treated as anaphora.
For pronouns assume that he/his/him/she/her/they and so on, will always reference
persons and it/its and so on will always reference non-person entities. Decide if the
document entity is suited. Alternatively check if the last entity fits the type of the
pronoun. For others words that exceed a minimum word length, check if any recent
entity contains that word.

This heuristic works surprisingly well, but there still are some problems that wouldrequire more sophisticated entity recognition strategies.
Those are discussed in in the section of future work. Also deals with the evaluationof Susi and examines which failures can be blamed on entity recognition. In fact,there are not many failures although such a simple heuristic is used, so it workssurprisingly well.
4.5. Incorporation of YAGO
This chapter discusses the main challenge during the development of Susi, i.e. asso-ciating entities with their semantic classes. For example, we want our search to knowthat Albert Einstein is a physicist and thus return occurrences of Albert Einsteinas hits of queries for physicists. As we have discussed in previous chapters, Yagocontains this kind of knowledge. It associates Wikipedia pages, and hence entities,with their semantic classes. Regarding these facts from Yago, we distinguish twocategories: Definition 1. Let Classes denote all semantic classes from Yago (which inherits
them from WordNet) that some entity belongs to. They may range from interesting,
specific ones like "scientist" to very generic ones like "entity", which is a class that
every entity belongs to trivially.
However, these classes can play a special role for an entity: Definition 2. Leaves is a term we use to describe special facts. Consider the fact
that Albert Einstein is a physicist, again. This fact is directly contained in Yago.
However, being a physicist implies several other things. We know that an entity
that is a physicist also has to be a scientist, a person and so on. A leaf is a class
that is directly associated with the entity and not implied by some other class.
Yago provides facts about the leaves in a designated file that directly associatesthem with entities. Therefore it is easy to use this kind of knowledge, although wehave to consider special cases where Yago lists a leaf for some entity, that is alreadyimplied by another one. We will address this issue later on.
Concerning the implied classes, on the other hand, Yago provides a subclass relationcalled subClassOf . This relation is crucial for the ideas discussed in this section.
We therefore introduce operators that should be intuitive and increase readability: Definition 3. We write "⊂" (read: "subclass Of") when the following holds:
(a b) ⇔ (a, b) ∈ subClassOf . For example we can write scientist person.
Obviously, a subclass relation should be transitive. Accordingly, what we calledimplied facts earlier, actually means that we require the set of an entity's classes tobe closed under the subclass relation.
Definition 4. We write "⊂c" to denote that one class is a subclass of another,
regardless of if they directly are in relation. Thus, we can use the following recursive
definition:
(a c b) ⇔ ∃a0.(a a0) ∧ (a0 ⊂c b). This allows writing something like scientst c
entity
.
Technically we can easily construct a closed set of an entity's classes by the followingalgorithm: 4.6 Entity Annotations Algorithm 2. Start with the set of leaves for that entity which is directly available
from
Yago. Continuously add all direct parent classes to the set until nothing new
is added.

In practice, we use a more complicated algorithm because we require the result tobe of a very special format. This format can be used to annotate entities with theirsemantic classes in an efficient way. The next section presents how this is done indetail.
4.6. Entity Annotations
In the previous section we have seen what kinds of facts are supposed to be extractedfrom YAGO and have to be made available for search queries. Somehow those classesneed to be associated with the entities they describe and the occurrences of thoseentities in the text. This section presents different ways of doing this and explainswhich approach has been chosen for Susi.
4.6.1. Naive Annotation
The simplest way is writing all facts next to an entity. If we imagine that everyoccurrence of a scientist in the text is annotated with the word scientist, a queryfor that word, will find all occurrences of Albert Einstein amongst others. If welook for a simple way to find out which exact scientist is found in the hit for ourquery, we can just add the entities' name after each fact. This means we write aword scientist:alberteinstein instead. A prefix-query scientist:* will conse-quently deliver the entities that are scientists as last part of the query completionswith hits.
This simple idea works perfectly fine, but unfortunately it requires lots of additionalwords to be written to the index. This causes a blow-up of the index that is notacceptable for really large document collections like the whole English Wikipedia(concrete numbers are presented later in this section). Thus, more clever ways arerequired to achieve the same thing with less space consumption.
4.6.2. ESTER style
The research work for Ester features a neat idea to providethe semantic information with very low space consumption. As mentioned in thechapter on related work, Ester writes all the facts available for an entity onlyonce in the document describing that entity. Entity occurrences somewhere in thetext are annotated with a single join attribute instead. Queries containing semanticcategories like penicillin scientist, then are processed using joins.
This strategy is really space efficient but the additional join can be a costly operationthat may slow down the query processing times.
4.6.3. Path-Style Annotations
The way of annotating entities with semantic facts that is actually used in the indexof Susi is yet different and supposed to be superior. The idea is based on theobservation that many facts implicitly belong to an entity that is of a certain type.
Just think of the entity Albert Einstein who is, amongst others, a physicist. Thissingle fact implies that he also is a scientist, a person, an organism and many otherthings. A great example is a vegetarian which is implicitly all of: eater, consumer,user, person, organism, living thing, whole, object, physical entity, entity. Obviously,a single semantic class can carry lots of information about an entity.
In Yago, the subclass relation spans a directly acyclic graph. However, this graphis very tree-like since the only violation of the tree properties is that there are classesthat have two or more independent super-classes. As example, consider the classwoman. A woman is both, an adult and a female. However, the classes femaleand adult are no subclasses of each other. For the sake of the following arguments,imagine the subclass relation to span a "broken" tree. Apart from the fact that somenodes have more than one parent, the model of a tree fits well. All types are reallysubclasses of some other type, except for entity. This is intended by WordNet butdue to bugs in Yago, some type entries seem to be missing in the current version(e.g. the type health professional). Their subclasses (e.g. doctor) may end upwithout a parent. In this case, we simply add an entry to the subClassOf relationthat makes the lonely facts direct subclasses of the root fact entity. This designdecision leads to the following fact: Fact 1. The class entity is the only class that is no subclass of some other class.
Thus, all other classes have super-classes and the chains eventually end at entity.

We can make some statements on the subclass tree. Let a and b be classes and thusnodes in the tree: • a is a parent of b iff b a, respectively b is a child of a.
a is an ancestor of b iff b c a, respectively b is a descendant of a.
The following figure now shows a tiny excerpt of the whole subclass-tree: 4.6 Entity Annotations Figure 4.3.: YAGO Paths Example
The idea of path-style entity annotations is to write all paths from the root to theleaves for each class an entity belongs to. Consequently, it is no longer necessaryto write all those facts that are implied by the fact as additional words. Con-sider figure 4.1 and let us say we recognize an entity (e.g. Bacillus Thuringien-sis to pick one at random that is not associated with a subclass of organism,like persons or animals are) that is an organism. We now write the additionalword :e:entity:physicalentity:object:whole:livingthing:organism:bacillusthin-stead of all those artificial words like organism:bacillusthuringiensis or livingth-ing:bacillusthuringiensis and so on. Examples involving persons lead to even longerpaths so that it is hard to actually draw the tree for them. Assume we discover anoccurrence of the entity Albert Einstein during parsing. If we now want to coverthe fact that he is a physicist, we can write something like this to the index: 3The prefix :e: is only there for technical reasons.
For now, ignore the prefix :e: of the long word which is only there for technicalreason. More interestingly, we can now use prefix search to find this word, that endswith alberteinstein, as a completion. Doing this we are free to choose which of thecontained types we are looking for. The word is a viable completion for the queries:e:entity:* or :e:entity:physicalentity:object:whole:livingthing:organism:person:*and so on. The viability of this idea, in the sense that it really does save a significantamount of space compared to the naive way, has been confirmed experimentally: Facts / Naive Annotations number of (entity, word) pairs number of entities avg. facts per entity
median facts per entity
max facts per entity
aWe use the term leaf to denote an entities fact that has no subclasses that can also be associated with that entity.
bThis number can be less than the number of leaves due to the removal of redundant facts.
cSome entities are merged due to case insensitivity. See future work on plans to fix this.
dThe number of entities for the naive style is lower since entities that can only be associated with the generic fact entity are eliminated here, while they may or may not be contained in thepaths. This does not matter during index construction, because the same thing is written tothe index for entities without facts from YAGO or entities with the trivial fact only.
eThis number can be less than the number of leaves due to the removal of redundant facts.
fFactor of size increase compared to not annotating entities at all. Compared number of index items instead of index file size.
gBuilding an index with the leaves only makes no sense, since we are no longer able process queries Table 4.3.: Space Consumption of Entity Annotations
Table 4.2 compares the path-style annotations to the naive way. Note, that there areactually less paths per entity than there are leaves. This is related to the fact, thatleaves are taken from Yago which directly derives them from Wikipedia categories.
This may lead to redundancy. The most common example is probably the classperson which is present as leaf for many entities, while it is implied by the class thatreflects the entity's occupation.
Since we now have to write much less words next to each entity compared to thenaive way, the index blowup by adding the semantic facts is negligible. On top of 4.6 Entity Annotations that, the HYB index enables very fast queries. Query processing times constituteone of the aspects that are evaluated in Now that we have seen that this idea can effectively decrease the size of the index,we want to confirm that this strategy delivers correct results. Therefore we need tomake the things, we have seen in the previous illustration, formal.
Definition 5. The operator closure(class) denotes the smallest set that exhibits
deductive closure under the subclass relation and contains class. This is simply the
class itself and all implied classes. Formally this means that:
closure(c) := {c} ∪ {x c c x)}.
For example it holds that closure(scientist) = {person, organism, . . , entity}.
Obviously, this definition can be used to characterize the set of classes of an entitye by building the union over all closures of its leaves.
Definition 6. The classes of an entity e correspond to the union of the closed sets
containing its leaves:
classes(e) = Sc leaves(e) closure(c).
For example classes(alberteinstein) = {vegetarian, scientist, person, . . , entity}.
This set of classes should now be represented by the long words we write to theindex. These words correspond to paths in the tree that is spanned by the subclassrelation. Thus, we also use the term path to describe them.
Definition 7. A path is a word of the following form: It starts at the univer-
sal class entity and reaches to an arbitrary class that is some entity's leaf. We
write a path as sequence of classes separated by colons. An example path could be
entity:physicalentity:object:whole:livingthing:organism:person.
The operator paths(class), denotes all possible paths from the universal root factentity to the class that serves as argument for the operator. There may be multiplepaths to a single class, because our tree does not really fulfill the tree propertiesand there are those few counter examples where a node has more than one parent.
Formally we want that: paths(c) = {c1 : c2 : · · · : cn cn = c c1 = entity ∧ ∀i, j < n.(i + 1 = j) ⇒ (cj , ci)} As the classes of a path p, we simply collect all classes that occur along it. We writeclasses(p) = {ci c1 : c2 : · · · : cn = p}.
The previous definitions can be used to show that the path-style annotations deliverthe correct result.
Theorem 1. The path-style annotations deliver the correct result.
The desired annotations for an entity e are exactly all of its classes. Accordingto definition 6, the set of those classes corresponds to the union of all closed setscontaining a leaf of e. The path annotations written for e are exactly all pathsleading to leaves of the entity. Thus, we can show the following lemma instead, toproof correctness of the theorem: Lemma 1. Closure(c) = classes(paths(c)). The set of classes found along all paths
to a class c is exactly the smallest set containing c that is closed under the subclass
relation (closure
(c)).
Proof. We show that closure(c) = classes(paths(c)) by showing containment be-tween the two sets in both directions: The proof for this direction follows directly from the definition of a path. Consideran arbitrary path p = c1 : c2 : · · · : cn. The definition enforces that ∀i . j = i + 1 ⇒cj ci. Therefore ∀i . cn c ci and hence ∀i . ci closure(cn). Consequently,closure(c) ⊆ classes(paths(c)).
The other direction is not as obvious. We need to recall fact 1 which states that entityis the universal super-class and all other classes are subclasses of entity. Now considerand arbitrary class c and the set of facts closure(c). By definition 5, all classes inthis set are either c itself or super-classes of c. ∀ci closure(c) . ci = c c c ci.
Therefore, every class ci is an ancestor of c in the tree. Fact 1 now ensures that everypath from c to some ci can be continued to the root class entity, since entity has to bean ancestor of ci and hence such a path exists. Since the definition of paths(c) collectsall those possible paths between entity and c and classes(path) collects all classesthat occur along a path, we can be sure that closure(c) ⊇ classes(paths(c)).
We have shown that using all paths according to definition 7 delivers the correctresult. However, our goal is to minimize the index-blowup caused by the entityannotations. Maintaining correctness is a necessity that has to be kept in mind, butnot the ultimate goal. While we have shown that the paths neither lose real factsnor add wrong ones, we have not shown that they produce minimal overhead. Infact, the paths are not minimal at all. We will see that we can safely eliminate someof the paths and still retain the same expressive power. Those paths, that can beeliminated, arise from the following situation in the tree: Looking at figure 4.2 and the paths from the root to the node labeled G, we see thatthese double diamond patterns may lead to paths that do not contain any new class.
Four paths can be collected: 4.6 Entity Annotations Figure 4.4.: Structure causing unnecessary paths
We see that while four paths are possible, the paths from 1. and 4. or from 2. and3. alternatively already contain all classes. At first sight, it looks like one of thepairs might suffice for annotating an entity that as G as one of its leaves. However,we still want to be able to find all classes by prefix search and using one of the pairsof paths only, leads to the following problem: Let us consider the pair of the paths from 1. and 4. (the argument is just as validfor the other pair). If an entity, that has a leaf G, was now annotated with the wordsA:B:D:E:G:<entity> and A:C:D:F:G:<entity>,we would have to query for E withthe prefix query A:B:D:E:* and a prefix query with A:C:D:E:* could not be used.
Unfortunately, this is hard to decide when all we know is that we want to query forE. The two possibilities A:B:D:E:* and A:C:D:E:* appear to be no equally suited.
Hence, we need some kind of agreement which paths to use for queries. For thecurrent version of Susi, we chose the following heuristic: Algorithm 3. Always query with the first path towards a class. Being first means
that we apply the following ordering: Shorter paths take precedence over longer paths
and equally long paths are ordered lexicographically. All paths that do not add any
new class, can be dropped and hence can we reduce the number of paths without
loosing any power of the search.

for all c Classes do
sort paths(c) short paths first, sort lexicographically on same length
seen ← {} (HashSet of Strings)
newP aths ← {}
for all p paths do
f oundN ewClass f alse
classes
classes(p)
for all e classes do
if e 6∈ seen then
seen seen ∪ {e}f oundN ewClass true end for
if
f oundN ewClass = true then
newP aths newP aths ∪ {p} end if
paths(c) ← newP aths
Unfortunately, this heuristic does not necessarily lead to an ideally decreased indexsize. In fact, for the example above we can only eliminate one of the four paths andnot two. This leaves still room for improvements and is also discussed in when we have a look at future work. However, for now we successfully eliminatesome facts and we can safely claim that the path-style annotations always decreasethe index size compared to naive annotations.
Lemma 2. Using the path-style annotations with this heuristic for improvement,
the index is always smaller than it is with naive annotations.

Note 1. Note that the measurements from table 4.3 show that it is in fact significantlysmaller (the blowup compared to no annotations drops from a factor of 4.16 to1.15) and that the improvements discussed here only contribute slightly to thatobservation. The most benefits are obtained from the idea of using paths itself andcaused by the fact that the graph spanned by the subclass relation is very tree-like.
For a real tree, the path annotations would be even better and no improvementswere necessary.
Proof. Consider any class other than entity (for the class entity the paths dotrivially lead to the same result as the naive annotations). In this case, the firstpath will definitely contain multiple classes. I.e. entity and the class itself andpossibly more. All further paths do add at least one new class. Since the naive wayneeds a posting for each class, we ca be sure that we need strictly less postings forall entities that do have leaves other than entity and exactly the same amount (1)for those that only belong to the class entity.
4.7 Realization of User Queries Another improvement we make, is the elimination of unnecessary leaves. It mayhappen that Yago associates an entity with multiple leaves, whereas one leave isactually already implied by another one. For that purpose we post-process the finalmap that associates entities with paths. In that map, we look at each entity's pathsand eliminate those that are only a prefix of some other path. This is very simplebut perfectly works for filtering out paths caused by unnecessary leaves.
4.7. Realization of User Queries
The last section has introduced very long words to represent paths in the tree ofsemantic classes. We have seen that prefix search can be used to query the indexwith those words efficiently. However, the queries were extremely long. Anyone whowas about to use Susi, would never want to type those extraordinarily long wordsfor simple queries like our running example, penicillin scientist. On top ofthat, one would also have to know the correct prefixes and thus the paths in thetree. For queries for the class scientist, a path with no less than eight classes isnecessary. This is not acceptable for any user.
Fortunately, CompleteSearch already has a powerful user-interface that is easilyextended for Susi. We add so-called translations for classes that provide the user-interface with the corresponding paths. Therefore we add a special document tothe index that contains words with a prefix :t:. For scientist, for instance, we adda word of the form :t:scientist::e:.:person:scientist.
pleteSearch user interface can internally send additional queries for words with the:t: prefix. The returned completions then relate to possible paths.
At that point we distinguish two cases: 1. The :t: query has exactly one completion and hence a unique translation.
For example, the query scientist only has the one meaning that we expectand hence the corresponding path the unique translation.
2. The :t: query has more than one completion / translation. The query person leads to that case, since person can mean the organism person that we expectbut also refers to the grammatical category called person.
In the first case, the user-interface immediately replaces the word in the query byits translation. Hence, it transforms it into a semantic query and directly displaysthe result for the translated query. In the second case, the user-interface lets theuser choose a suitable translation. Therefore it displays possible translations in adedicated box and translations can be chosen by clicking them. For the displayedvalue in the box, it appends the last class on the path before the class itself, hencedisplaying entries like person, the ORGANISM.
This concept works well for the :t: queries that are supposed to find translations, butcomes in just as handy for the real semantic queries. Since all paths annotating en-tities, do end with the entity name itself (:e:.:scientist:alberteinstein:), Figure 4.5.: Refinements by Instance
the same principle can be used to refine queries by instance, i.e. by actual enti-ties. Consequently, Susi is able to provide the box from figure 4.5 for the querypenicillin scientist.
Clicking on such a refinement will limit the displayed hits to those that are evidencefor facts concerning this concrete entity.
4.8. Software Architecture
This section summarizes how Susi is built into the CompleteSearch search engineand what technology is used. The whole chain described here is defined and auto-mated in a MakWhile all steps can be performed individually, they can alsobe processed together and required artifacts are created whenever they are needed.
Building Susi follows this overall steps: • Generate the redirect map by parsing the Wikipedia dump once. Done by a simple SAX parser that fulfills exactly this purpose.
• Extract the facts from Yago and construct the paths according to the idea of the subclass tree as discussed above. This is done in three steps by scripts: Extract the entity → leaves association from Yago into a map.
Construct the paths by starting with the root class entity and contin-
uously appending all direct subclasses of the rightmost element to theright hand side. Also do improvements as discussed above.
Replace each leave in the entity → leaves map by all paths leading to
that class. Again do improvements as discussed above.
• Now parse the Wikipedia dump and write the postings for the index and the file for the excerpts.
• Convert the postings for the index to binary format, sort them and generate the vocabulary, the Hyb index and compress the file for the excerpts.
4.9 Exemplary Execution of a Query • Launch the CompleteSearch server with suitable arguments and pass the index, the vocabulary and the excerpts file as data base.
Independent from this chain, there are several unit tests, generators for test dataand the evaluation framework discussed in Of course, the CompleteSearchengine consists of many components itself but these are not discussed here.
4.9. Exemplary Execution of a Query
Now that we have discussed every aspect of Susi separately, we should be able tounderstand the whole way from the content of the text and our query to the finalresult. Recall the example query penicillin scientist once more. Additionally,we want to use the text from the very beginning of this document while we simplifyit even further to save some writing in the following.
Alexander Fleming was Scottish. He discovered penicillin.
Figure 4.6.: Fictional Text Excerpt
For the text presented in figure 4.6, the postings in the index will look someholike this: Table 4.4.: Example Index Content
Now let us examine what happens when the query penicillin scientist is typedinto the user-interface. First of all the user-interface will try to find translations for 5In fact, the entity Alexander Fleming is associated with more classes. However, we leave them out to keep things readable. Also penicillin will most likely be associated with a number ofsemantic classes, such as "drug".
the word scientist. This will succeed and it will internally replace scientist by theprefix :e:entity:.:scientist:* , thus transforming it into a semantic query.
When the CompleteSearch engine processes this query it will find the word penicillinin a list of documents. This list will include document 13 because of the postingthat is contained in table 4.4. Secondly, it will look for the prefix :e:entity:.:scientist:* which will return a number of completions - each with a listof documents where this particular completion is found. One of them should be:e:entity:.:scientist:biologist:alexanderfleming with a doc-list includ-ing document 13, too. Now, the doc lists for both query words are intersected,eliminating occurrences of scientists in documents that have nothing to do withpenicillin and vice versa.
Thus, CompleteSearch will be able to tell that the completion :e:entity:.:scientist:biologist:alexanderfleming leads to a hit and that this hit isfound in document 13. Hence, it is able to display Alexander Fleming, the biologistas a possible refinement by instance and to present an excerpt for our document13 as hit. The internal representation of the excerpt of document 13, will containthe string :e:entity:.:alexanderfleming He discovered penicillin.
Consequently, the completion will match the hidden part and the word he is high-lighted instead. The word penicillin matches right away and can be highlightedeasily.
Figure 4.7.: Result returned for the query penicillin scientist
Like this, Susi enables CompleteSearch to present Alexander Fleming as completion 4.9 Exemplary Execution of a Query and thus answer to the query. Additionally, the excerpt is presented as evidence andthe words that made the query match are highlighted as can bee see in figure 4.7.
The evaluation of Susi has two different goals. First of all, we want to demonstratethe performance of Susi with respect to both, quality and efficiency. While it isalways nice to present decent results, the current state of Susi makes the second goalmuch more important: We can examine any discovered deficits in detail, find outwhat causes them and react accordingly by fixing minor bugs or revising conceptsthat do not seem to work out as well as intended.
5.1. Quality
Examining the quality of a search engine usually evolves around the question: Didthe expected documents get returned? Although Susi delivers something similar inform of the evidence provided for each fact, we rather want to answer the question:Did the expected entities get returned? This makes sense, because the evidence willalways be the passage of the text that contains the found fact. As long as the entityreally fits the fact, the evidence is likely to be correct as well.
We perform several queries, compare the result to a set of expected entities andtake precision, recall and f-measure. Finally, we discuss the outcome. We examinequeries with less good results in depth and try to determine the predominant reasonsfor misbehavior.
5.1.1. Experimental Setup
The quality measurements have been executed on a machine where no running ver-sion of Susi was available. Consequently, queries are processed over the network,although this had only practical reasons. Basically, the evaluation framework com-municates with Susi via HTTP, receives a result in XML format, which is thenparsed for a list of entities.
As expected entities, our ground truth, we chose Wikipedia lists. There are manyWikipedia Lists that are sometimes generated from external resources and oftencreated by hand. Those lists may concern pretty much any topic. To name someexamples, there is a list of bridges, a list of British monarchs or a list of plants withedible leaves. Each of the chosen lists is parsed by a dedicated Perl script. For mostof them, the relevant entities can easily be matched by a regular expression. Note that most existing approaches to semantic search could not even process all of thosequeries, because some lists evolve around very specific facts that usually are notreflected by relations in an ontology. Just consider the list of drug related deaths.
It is very unlikely to list the cause of death for every person.
In order to evaluate a query we join both lists, the expected one and the one Susireturned, and mark each entity with one of the following: True, which is used to mark entities that are in both lists. Everything works as
expected for them.
False-Pos for "false positives", which is used for entities that are returned by Susi
but not in the ground-truth.
False-Neg for "false negatives", which is used for entities that are in the ground-
truth but not returned by Susi.
The queries that are supposed to recreate those lists are chosen by hand. Sometimes,we use multiple queries for the same list in order to compare them.
Wikipedia List as Ground Truth Computer Scientists Computer Scientists computer science .:person:scientist:* Regions of France Regions of France regions of france .:location:region:* Political Authors emi :.:creator:artist:* Saturated Fatty Acids saturated fatty .:compound:acid:* Drug Related Deaths drug death .:organism:person:* Drug Related Deaths drug died .:organism:person:* Drug Related Deaths .:substance:agent:drug:* died .:organism:person:* Presidents of the United States united states .:corporateexecutive:president:* Presidents of the United States united states elected .:corporateexecutive:president:* Table 5.1.: Quality Evaluation Queries
Table 5.1 contains all lists used for evaluation and the corresponding queries. Addi-tionally, we assign IDs to the queries so that we can easily refer to them later. Notethat queries contain our long, artificial paths and had to be abbreviated. The fullones can be found in the appendix.
Susi, or actually CompleteSearch, already includes some kind of ranking for thereturned completions and hence the found entities. This rating simply reflects thenumber of different positions of the completion in the text that are matched byour search query. Consequently, we are able to examine the top-k entities returned.
This is an interesting measure, because an entity at the bottom of the result list isusually not perceived at all or at least not as important as the top ones. For thisdocument we included the measurements for top-500.
For each query we collect the total numbers and calculate precision, recall and f-measure. With the current settings, Susi performs in the following way: False-Neg Precision Recall F-Measure Table 5.2.: Quality Evaluation - All Completions
The statistics for each query are shown. True counts all entities that are in thelist and also get returned by Susi. False counts those that get returned but arenot in the list and missing counts those that are in the list but are not returnedby Susi. Note, that the total statistics are influence by the extreme outliers. Inthe following section, we examine why some queries perform good, while othersdo not and determine common reasons for failure.
Table 5.3.: Quality Evaluation - Top 500 Completions
The statistics for each query with top 500 retrieval are shown. Note that thetop 500 completions do not necessarily reflect 500 distinct entities. For examplescientist:biologist:aristotle and scientist:mathematician:aristotle are two comple-tions describing the same entity. Apart from that, the principle is identical to thestyle of table 5.2.
In order to correctly interpret the results, we have to pay special attention to queriesQ2.1 and Q4. For each of those queries, there is coincidentally a Yago categorythat should directly reflect the Wikipedia list used as ground truth. However, theresults are not close to 100% precision and recall at all. Consequently, we notethat there has to be a mismatch. Further observation shows that sometimes Yagodoes not completely classify all entities, but on the other hand, the Wikipedia liststhemselves may have lots of flaws. The most common issue is that manually createdlists are far from being complete. Hence, many entities returned by Susi are actuallycorrect but recognized as false positives. Additionally, the Wikipedia lists may alsocontain false positives themselves. For example, the list of cocktails lists drinks thatare not actual cocktails but would commonly be classified as liqueurs.
Keeping this issue in mind, we still notice significant differences between the queriesthat cannot be related to flaws in the Wikipedia lists. Naturally, we want to examine possible reasons for this observation. Therefore, we pick some queries and analyzethem in detail. Fortunately, we do not only have total and relative numbers for ourevaluation but also the particular entities that are classified as either True, False-Pos or False-Neg. Thus, we can look at both, false positives and false negatives, gothrough the Wikipedia documents, check Yago and finally state as reason for thismismatch between the result from Susi and the Wikipedia list. In order to establisha systematic system we define several classes of common errors. Subsequently, wecan assign one of those classes to each incorrectly returned entity. We use thefollowing classes: L List Problem.
There is a problem with the Wikipedia list used as ground truth for that query.
Either the entity should be in the list but actually is missing or it is in the listwhile it does not deserve to be.
Example: Thomas Hobbes missing in the Wikipedia list of political writers.
Y Yago Problem.
There is a mistake in Yago. Usually this means that an entity misses a classit actually belongs to. This usually happens when Wikipedia articles are notassigned to categories, the source for huge parts of the classifications in Yago.
Example: Bill Clinton not being classified as president.
A Abstract Entity Problem.
The entity is some abstract entity like computer scientist instead of a name of aconcrete person. While the page entity computer scientist is a valid instance ofscientist and of a person, lists usually only contain concrete persons or likewiseconcrete entities. Currently, Susi does not distinguish between abstract andconcrete entities.
Example: The entity Journalist is returned as political writer.
R Referral Problem.
The fact refers to some other entity but both are mentioned in the samesentence.
Example: John Doe saw on the television that Jane Smith discovered a cure forcancer. John Doe would accidentally get associated with the cure for cancer.
Q Query Problem.
There is no suitable way to express the precise semantics of a list as query.
Often this is caused by imprecise Yago categories.
Example: Only single persons are classified as musicians or artists. Bands areclassified as groups at best, while group is a very general class that containsthings entirely unrelated to musicians such as institutions or organizations andmany more.
E Entity Recognition Problem.
Some passage in the text is recognized as wrong entity.
Example: Herman Einstein was a salesman and engineer. Herman, who isAlbert's father, might mistakenly get recognized as Albert Einstein himself.
W Incomplete Wikipedia.
The entity in the list references an entity that does not have its own Wikipediapage.
Example: There is no page for Tridecylic acid, while it occurs in the list ofsaturated fatty acids. In the list, there is a link to a nonexistent page.
D Disambiguation Problem.
Example: The word wardrobe describes many things. A piece of furniture, aclothing and more. An occurrence of this word might get recognized as thewrong entity.
F Fact not Found.
The fact cannot be found in Wikipedia apart from the list itself. This meansit is definitely not on the entity's page (confirmed manually) and probably notlocated on some other page either (hard to check manually).
Example: Victoria Beckham is in the list of EMI artists. However, no otherpoint could be found where it is stated that she really released somethingunder this label.
N Negation Problem.
Often negations contain the same words as the positive case, plus an addition-ally not.
Example: This fatty acid is not saturated.
S Synonymy.
The fact is expressed differently. This is a common problem for search enginesin general and not specifically related to semantic Wikipedia search.
Example: He committed suicide vs he shot himself.
Z Others.
This label classifies problems that could not be assigned to any of the abovecategories.
These categories may also serve as basis for any further evaluation that is performedon a larger scale. For now, we can only examine a few example queries. First of all,we want to have a look at the list of political writers which led to poor values. Letus recall the statistics for this query.
Table 5.4.: Statistics for Query Q5 (political writers)
Since there are actually quite few entities that Susi could not find, we are able toregard all of them (false negatives). Additionally, we want to look at the topmost30 of the false positives: Alireza Jafarzadeh Andy Croft (Writer) Charles E. Silberman Winston Churchill Gheorghe Alexandrescu Niccolò Machiavelli Jean Edward Smith Theodore Roosevelt Józef Mackiewicz Lewis Gassic Gibbon Plato (Computer System) Vladimir Tismaneau George Washington Otto von Bismarck Benjamin Franklin Table 5.5.: Evaluation of Query Q5 (political writers)
Table 5.5 gives us some insight on what went wrong with query Q5 that is supposedto recreate the list of political writers. We can see that the false positives containsome abstract entities while most of them are writers that actually should be in thelist. The entity Plato is a special case. As we can see the ground truth somehowcontains the entity Plato, the Computer System. However, this is not an error inthe Wikipedia list, instead there seems to be a problem with case insensitivity of theredirect map. The Wikipedia page PLATO in all upper case redirects to the page of the computer system. Plato, on the other hand is the page for the philosopheras expected. Unfortunately, the redirect map redirects occurrences of Plato to thepage of the computer system. This issue can easily be fixed as discussed in chapter6, Future Work. Apart from that, the false positives only confirm how bad thehand-made Wikipedia lists can be and we should note, that abstract entities arein fact an issue. The false negatives, however, are more interesting. While thereare some lesser known writers whose Wikipedia pages do not contain all necessaryinformation, the predominant problem are Yago categories. Many philosophers arenot classified as writers although they published several books or articles.
As a second example, we want to examine the query for the regions of France.
Especially query Q3.1 shows terrible precision. The first guess should be that theword region is ambiguous. While the Wikipedia list contains the 26 political regionsof France, the area between two arbitrary villages, a city or even another country,can also be described as a region. In order to confirm this theory, we look at thetop 20 false positives.
Reason According to the Categories Q - The class region is too general.
Departments of France A - Abstract entity.
Communes of France A - Abstract entity Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Regions of France A - Abstract entity, the list itself.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Q - The class region is too general.
Table 5.6.: Top 20 False Positives for Q3.1 (regions of France)
Table 5.6 clearly confirms our theory. The class region is too general and not limitedon the political regions form the list. However, we observe a second phenomenon.
The word France also causes problem. While the lists uses it to describe the region, the word also describes the entity France. Hence, many regions - especially countries- that have some relation with the country France are contained in the result of ourquery. Either way, it is safe to say that the bad precision is caused by our ambiguousquery.
Finally, we want to consider another example. In order to regard something new,we pick a query with many false negatives. Query Q6, which is supposed to recreatethe list of EMI artists has the worst recall with only 30%. We examine 20 falsepositives ones and 20 false negatives for this query.
Alice (Italian Singer) Wolfgang Amadeus Mozart Berlin Philharmoic Ludwig van Beethoven Table 5.7.: Evaluation of Query Q6 (EMI artists)
Our results listed in table 5.7 give an obvious answer to why this query has alower recall than others. When we want to express a class that describes artists ormusicians, we cannot find a suitable Yago category. All bands, ensembles, etc donot share a class with solo artists. This is why the query does not return groupsand hence the recall suffers immensely. The false positives can also be related toan interesting issue. Many musicians are returned that were never signed at EMIduring their solo career. However, they are part of bands that are. This leads to theBeatles being one of the false negatives, while their members occur as false positives.
Another issue is that composers get returned whose work has been performed bysome EMI artist. This is a referral problem and caused by using mere co-occurrencein a sentence for identifying semantic facts.
In summary, we see that most analysis lead to a discovery that allows really easy fixesand consequently improves to quality of the results returned by Susi. Obviously,this makes further evaluation an important goal for future work.
Apart from quality, there is a second very important aspect to our evaluation. Per-formance and efficiency are absolutely crucial. We have already seen the path styleannotations that are supposed to achieve better performance. While we have seenthat they are in fact very space efficient, we now want to to measure and examineresponse times for queries in practice. Especially since fine-tuning some settingsmay have a huge impact on performance, we really want an evaluation of concretequeries in order to spot cases where existing heuristics seem to fail.
5.2.1. Experimental Setup
All our experiments were run on an Intel Xeon X5560 2.8GHz machine with 16processors, 40 GB of main memory, and running Linux. For the evaluation, anexisting framework has been used which is part of the CompleteSearch project. Itprovides detailed statistics for each query as well as a summary. The completestatistics can be found in the appendix.
The queries from table 5.8 have been used for this part of the evaluation. Note, thatQ1 to Q10.2 denote the queries from the quality evaluation whose performance isevaluated, too.
computer science :e:.:organism:person:* Table 5.8.: Queries for the Performance Evaluation
We have extended the queries by two warm-up queries. The & symbol marks queriesthat are not supposed to contribute to the statistics. Those warm-up queries fill aspecific part of Susi's history that is always kept in the cache. In production mode,Susi will always have this caches filled, too. However, we also evaluate our querieswith cold caches once, in order to gain more insight on cache effects and possibleproblems.
Table 5.9.: Performance Evaluation
The measurements presented above provide insight into the current state of Susi'sperformance. We see that most queries are processed within only tens of millisec-onds. However, there are also queries that require about a whole second to finishwhich is not really acceptable. This behavior is most likely related to suboptimalblock boundaries for the HYB index.
Just look at the query for American presidents which performed poorly. The rea-son is, that the current heuristic uses one block for each direct subclass of person.
However, in the case of the class president, the content is located in a block for allleaders which has about 6 million entries. Subsequently, only words that are actualcompletions of president have to be filtered. In fact, we have examined this furtherand were able to conclude that there are 5,930,590 leader vs 1,343,830 scientist oc-currences and 164,812 leader vs 36,782 scientist completions. So we can be prettysure that this block is in fact too huge to deliver the desired performance results.
In conclusion, we can say that it would be wise to divide the vocabulary further.
Especially since no queries required scanning lots of blocks, it is save to say thatfurther division should not harm the performance. However, fine-tuning is postponedfor now and counted towards future work instead.
The second thing we notice concerns query P3 which queries for vegetarians. Beinga subclass of person, the entries can be filtered from history. In this particularcase, however, the cold queries show that actually reading the corresponding blockfrom disk instead of filtering, is faster by order of magnitude. CompleteSearch isconfigured based on the expectation that filtering form history is always faster thanreading blocks from disk and decompressing them. However, the different natureof the artificial words used by Susi, render this expectation wrong under somecircumstances. Consequently, this is another issue that should be regarded in thefuture.
This final chapter concludes the work done for Susi. Additionally, it features a listof pieces of future work. For some of the issues, we can already propose possiblesolutions. On top of that, we try to judge the necessary effort involved wheneverpossible.
We have introduced an approach to semantic search that combines full-text withontology search and presented its application in form of our system Susi. Susi en-ables search that directly finds facts in the English Wikipedia and provides excerptsfrom the text as evidence. While existing approaches to semantic search restrictthemselves to search on a fixed set of attributes or relations, being able to access allinformation in the text obviously exceeds those restrictions.
Our evaluation has shown that we already achieve query processing times of fractionsof a second. However, there are still some outliers that leave room for improvements.
Queries that take over half a second are not really acceptable. Fortunately, thosequeries can easily be accelerated by fine-tuning such as choosing more suitable blockboundaries for the HYB index.
On the other hand, quality evaluation has show that precision and recall are notentirely satisfying, yet. However, a huge portion of unexpected outcomes can berelated to mistakes and especially incompleteness of the Wikipedia lists that havebeen used as ground truth. Still, there are several issues that spoil precision andrecall and that should be tackled in order to improve the quality of the resultsdelivered by Susi. Especially distinguishing abstract and concrete entities shouldpossibly help a lot. Fortunately, detailed examination of our results has shown thatfor each query with bad statistics there usually is one distinct problem to blame iton. This observation allows focusing on issues whose resolution will really improvethe system.
6.2. Future Work
Throughout this whole thesis we have presented all relevant aspects to the creationof Susi and identified several issues that currently require further improvements.
We now want to summarize future work as a list. Note, that we try to state possiblesolutions and estimate the effort for them. The list items are ordered by decreasingpriority.
1. Add case sensitivity to the redirect map. Wikipedia URLs and the resulting redirects are case sensitive, too. For example, PLATO redirects to the entityPLATO_(Computer_System) while Plato points to the philosopher's article.
This issue has high priority because it is a real bug and no failure of a heuristicand hence is pretty easy to fix. Estimated effort: 1-2 days, since the parserhas to be modified in order to preserve the case of recognized entities a bitlonger, too.
2. Stop parsing Wikipedia pages called Template:. Those pages do not contain interesting facts and spoil both, the result itself and especially the excerpts.
Estimated effort: A few minutes plus rebuilding the index.
3. Add a ranking to the excerpts. While the completions are already ranked by frequency and deliver a really convenient output, the excerpts are ranked bydocument ID. Consequently, displaying excerpts for lowly ranked completionshappens a lot. Frequently, this even involves list-like documents or Wikipediatemplate pages that deliver really messy excerpts. Estimated effort: 1 weekor more, since a whole new concept of relating their ranking to the rankiongof the completions has to be established.
4. Distinguish abstract entities (physicist) from concrete ones (Albert Einstein).
This could be done by using different prefixes and adapting the UI accordingly.
The entities can actually be distinguished very well. Yago uses different filesfor entities that are derived from WordNet and for those that are extractedfrom Wikipedia articles. Abstract entities are usually contained in WordNetand hence the ones extracted from articles can be assumed to be concreteentities. There is already a script that is able to flag abstract or concreteentities that has been used for measurements. For the current version of Susi,it is not included in the make-chain. Estimated time: 1 day for changes to thecreated index, multiple days or weeks for the necessary UI remake to supportthe distinction.
5. Quality evaluations at a larger scope. With help of Amazon Mechanical Turk or similar systems, we could evaluate a lot more queries and gain even moreinsight.
6. Forbid some queries in the UI. For example, typing :e:entity can be deadly for the system because it is a prefix of half of the index which is just toomuch to process.
So just like the UI prevents prefix queries that consist of only one letter, those queries should be avoided as well and a prefix for:e:entity:alberteinstein: should only be processed as soon as it reachesa certain length or probably the second colon. Estimated effort: 1 day.
7. Add more relations from Yago or some other ontology like DBpedia. While it is nice to express some semantic relations by mere co-occurrence, useful realtions are already present in existing ontologies. Using them should be asource of highly precise facts.
8. Prefixes for semantic classes may currently have more completions than there are actual entity occurrences that fit. For example the prefix .organism:person:will have more multiple completions for each occurrence of Albert Einstein -e.g. physicist:alberteinstein, .vegetarian:alberteinstein andso on. We have taken measurements and for the prefix person the currentversion has 2.59 times more completions than there are actual occurrences.
Consequently, this is not critical at the moment but still a huge possible sourceof inefficiency of the path-style annotations. Thus, the path-style annotationsstill leave room for improvements.
9. Refine the scope of facts. Simply choosing sentences, items in lists and rows in tables works for many cases but at the same time ignores many constructs inhuman language. Consequently, some facts are lost while false ones are added.
10. Experiment with the pronoun recognition. If both, the document entity and the last entity from the text, match the type of the pronoun it is hard to decidewhich one to choose. Currently, no real experiments have been done. One coldeasily experiment with the order for a few hours. However, in the long run,linguistic analysis may be far superior, but infering them involves much morework.
11. Rework the decision when to filter query results from history and when to scan blocks from the disk. For the query .:person:user:eater:vegetarian:*scanning only takes about 10ms whereas filtering may take more than 500ms,if the CompleteSearch engine filters from the huge result set for the query.:person:*. The block that would be scanned (user:*), on the other hand,is really small.
12. Noun phrase recognition. For example, an entity therapy center should be recognized as one thing so that it does not match the sole word therapy atsome other point. For names, on the other hand, matching only first- or lastname is sometimes desired. However, sometimes one also has to take carenot to confuse relatives or siblings. While this sounds important, almost noconcrete case occurred during evaluation where an error could be related tothis or the following phenomenon.
13. Disambiguation between abstract entities. Since abstract entities are not re- lated to Wikipedia articles in Yago, it is hard to associate the right one withan occurrence in the text. For example, there are five WordNet entities forwardrobe meaning the piece or furniture, the clothing and so on. While theyare distinguished in WordNet and Yago, it is hard to tell which one to chooseif a Wikipedia entity called wardrobe is discovered.
Apart from the improvements from the above list, there is also more to addressin the future. While the above list contains pieces of future work that contribute further to the current goal behind Susi, we can also extend the goal by some facets.
First of all, we may want to extend the ontology part of the search. Currently, Susionly uses information about the semantic classes of entities while Yago actuallyalso provides a lot more relations such as bornIn, hasGdp and so on. Includingsome or all of those relations can be used to restrict entities not only by class butalso by attributes.
Apart from that, the creation of a user-interface dedicated to semantic search isan interesting project for the future. The current UI has been created for tradi-tional search applications. The extensions made to support semantic queries arepossible due to its modular architecture that evolves around multiple independentboxes. However, most of those extensions are mere hacks and a system dedicatedto semantic queries would be a lot more suited.
Either way, there are lots of topics that can be addressed in the future. Querying Susi already is real fun and delivers interesting results, but there is still so much todo until all semantic queries can be processed perfectly.
I am heartily thankful to my supervisor, Hannah Bast, whose guidance and supportfrom the initial to the final level enabled me to develop an understanding of theused technology and the subject itself.
I also want to thank all friends that helped me a lot by proofreading this document,Elmar Haussmann, Nils Schmidt, Daniel Schauenberg, Alexander Gutjahr, LindaKelch, Florian Bäurle and Christian Simon.
Lastly, I offer my regards to all of those who contributed to CompleteSearch.
A. Appendix
A.1. Full Queries of Quality Evaluation
region of france :e:entity:physicalentity:object:location:region:* A.2. Full Queries of Performance Evaluation
search engine :e:entity:alberteinstein:* Bast, H., Chitea, A., Suchanek, F. M., and Weber, I. 2007. Ester: efficient search on text, entities, and relations. In SIGIR, W. Kraaij, A. P. de Vries,C. L. A. Clarke, N. Fuhr, and N. Kando, Eds. ACM, 671–678.
Bast, H. and Weber, I. 2006. Type less, find more: fast autocompletion search with a succinct index. In SIGIR, E. N. Efthimiadis, S. T. Dumais, D. Hawking,and K. Jaervelin, Eds. ACM, 364–371.
Bast, H. and Weber, I. 2007. The completesearch engine: Interactive, efficient, and towards ir& db integration. In CIDR. www.crdrdb.org, 88–95.
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., and Hellmann, S. 2009. Dbpedia - a crystallization point for the web ofdata. Web Semantics: Science, Services and Agents on the World Wide Web 7, 3,154 – 165. The Web of Data.
Resource description framework (RDF): Concepts and abstract syntax.
W3C recommendation, W3C. Feb.
Celikik, M. and Bast, H. 2009. Fast error-tolerant search on very large texts.
In SAC. 1724–1731.
Guha, R. V., McCool, R., and Miller, E. 2003. Semantic search. In WWW.
Hahn, R., Bizer, C., Sahnwaldt, C., Herta, C., Robinson, S., Buergle, M., Duewiger, H., and Scheel, U. 2010. Faceted wikipedia search. In Busi-ness Information Systems, W. Aalst, J. Mylopoulos, N. M. Sadeh, M. J. Shaw,C. Szyperski, W. Abramowicz, and R. Tolksdorf, Eds. Lecture Notes in BusinessInformation Processing, vol. 47. Springer Berlin Heidelberg, 1–11.
Miller, G. A. 1994. Wordnet: A lexical database for english. In HLT. Morgan Neumann, T. and Weikum, G. 2008.
Rdf-3x: a risc-style engine for rdf.
PVLDB 1, 1, 647–659.
Neumann, T. and Weikum, G. 2009. Scalable join processing on very large rdf graphs. In SIGMOD Conference, U. Cetintemel, S. B. Zdonik, D. Kossmann, andN. Tatbul, Eds. ACM, 627–640.
Prud'hommeaux, E. and Seaborne, A. 2008. SPARQL query language for RDF. W3C recommendation, W3C. Jan. http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/.
Suchanek, F. 2009. Automated construction and growth of a large ontology. Ph.D.
Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In WWW, C. L. Williamson, M. E. Zurko, P. F. Patel-Schneider, andP. J. Shenoy, Eds. ACM, 697–706.
Hiermit erkläre ich, dass ich diese Abschlussarbeit selbständig verfasst habe, keineanderen als die angegebenen Quellen/Hilfsmittel verwendet habe und alle Stellen,die wörtlich oder sinngemäß aus veröffentlichten Schriften entnommen wurden, alssolche kenntlich gemacht habe. Darüber hinaus erkläre ich, dass diese Abschluss-arbeit nicht, auch nicht auszugsweise, bereits für eine andere Prüfung angefertigtwurde.

Source: https://ad.informatik.uni-freiburg.de/files/susi

hum.utah.edu

Race and Medicine Genetic studies of population differences, although controversial, promise David Goldstein of University College in clues to disease as well as new drug targets, scientists believe London agrees: "If you say on average the difference between West Africans and Eu- Mention race and medicine in a group of racial identity biologically irrelevant. But ropeans is slight, that does not rule out a

quickscribe.bc.ca

PDF Version [Printer-friendly - ideal for printing entire document] VENEREAL DISEASE ACT TREATMENT REGULATION 64/84 [Repealed March 31/09 by B.C. Published by Quickscribe Services Ltd. [includes B.C. Reg. 164/97 amendments] Important: Printing multiple copies of a statute or regulation for the purpose of distribution without the