Exploring possibilities for improving search and usability of our digital content is one of the core activities of the research department of the Koninklijke Bibliotheek (KB). One way in which we aim to achieve this is by enriching our collections with extracted or related information, from internal as well as external sources. These enrichments can be of many kinds: from an extracted genre or sentiment to geographical coordinates or a related movie on the web. Our current focus is on enriching the historical newspapers collection with linked named entities, i.e. names of persons, locations and organisations mentioned in the newspaper articles that are linked to corresponding resource descriptions in international knowledge bases such as DBpedia, Wikidata and VIAF.
We have set up a generic enrichment infrastructure, consisting of an enrichment database and a number of services, that is able to store any type of enrichment for any object from the KB collections, without modifying the original data. Generally speaking, the enrichment database contains links between identifiers of objects from our collections and related identifiers.
In the case of linked named entities, it links identifiers of newspaper articles to records representing entities, in which any known links to descriptions of the entity in a thesaurus or external knowledge base are combined. If available we include some metadata about a link, such as its provenance and confidence.
Entity linking process
Named entities are automatically recognized in the articles by means of specific software. We generate a set of candidate links for each entity by querying an entity index constructed out of DBpedia dumps. For each candidate a number of features is determined from properties of the name itself as well as contextual information, such as date of birth and profession. A machine learning model that was trained on a manually annotated set of articles selects the best candidate, if any, based on the feature values. Although our software has reached an accuracy of over 85%, we encourage users to correct any remaining errors and add missing links. This user feedback also serves as additional training data for the entity linking software.
When an article is accessed for indexing or presentation the associated enrichments can be retrieved from the enrichment database. Presentation software may show links to resource descriptions or relevant context information from these descriptions, such as an abstract or image of an entity. Indexing identifiers for the linked named entities along with the newspaper articles opens up new possibilities for (semantic) search. Users are able to search for articles containing entities that possess certain (combinations of) properties, for example, such as articles about Roman emperors. Our software obtains the identifiers of the entities with the property of being a Roman emperor from Wikidata in the background and then uses them to query the enriched newspaper index.
In order to demonstrate this functionality an online research portal has been created, where users can explore the available enrichments and experiment with semantic search in the historical newspaper collection. The portal supports full SPARQL queries in Wikidata, but also offers a number of user-friendly forms of semantic querying, e.g. by automatically generating a “best guess” SPARQL query from a conventional search string. There are also links to extra services, among which is a page for removing and adding enrichments per article.
Theo van Veen and Juliette Lonij