WebART: enabling Scholarly Research in the KB Web Archive

WebART is not about art, but about Web Archive Retrieval Tools

Archives are assembled with the intention that the information be (re)used in some way or another. The Dutch Web Archive, as collected by the KB, is no exception to this rule. Established in 2007, the Web Archive now harvests some 5,100 websites on a regular basis. There is more than 8 Terabyte of data in the archive.
The harvested data are a lot more complex than a digitised book or journal. A single website can contain all sorts of files: text, software, pictures, movies, links to other websites, etc. So the question arises what type of questions researchers will have for the web archive. And how can the KB organise the data in such a way that researchers can actually find the information they are looking for?Researchers are also looking into these issues. Jaap Kamps of the University of Amsterdam initiated a project (WebART) which brings together the web archive and researchers. It is a collaborative effort between the KB, the University of Amsterdam and the Centrum Wiskunde en Informatica (CWI), the Dutch national research institute for mathematics and computer science.

Hugo Huurdeman of WebART: "Scholarly research in web archives is a very young discipline"

Presently, the KB Web Archive is available in the reading rooms only; the interface is the WayBack Machine developed by the InternetArchive(click to enlarge image)

Results of the WebARTist search engine (click to enlarge)

Perhaps the acronym of the project (WebART) is a bit confusing, because the project has nothing to do with the fine arts. WebART stands for web archive retrieval tools, explains researcher Hugo Huurdeman. On behalf of the University of Amsterdam, Hugo spends quite a bit of his project time virtually in the KB Web Archive, while he is physically embedded in the Research Department of the KB.

Websites have very unique dynamics

"Websites represent an entirely new category of information," says Hugo. "An important difference with books and journals is the time factor. Google only brings you to the last version of a website; in the web archive you can find the history – not only of the texts, images, etc., but also of the way information is created and shared in the internet age, the dynamics of the web. The 'new media' inspire a brand-new research discipline."

"For example: in a trial we collected Dutch websites about the situation in Syria, and analysed the links. It turned out that there were marked differences between news sites in the way they gathered their information. Some relied on (semi)official sources only, while others relied on user-generated content. In addition it transpired that most of the news about Syria originated from outside Syria. Such phenomena are interesting to study."

Tools for using a web archive

One needs specific tools to be able to access and analyse a web archive in a meaningful way. Developing such tools has been the focus of the first year of the WebARTproject. "We have developed a type of full-text search engine, that can take into account the time factor in locating specific texts or images [it is called WebARTist – another acronym that has nothing to do with the fine arts].And we developed all sorts of filters, e.g., specific versions of a site, specific periods, specific (UNESCO) categories, links between pages, etc."

Interaction between web archive and research community

Hugo's Amsterdam colleague Anat Ben-David concerns herself with the more theoretical issues in the project (What type of research questions could a web archive answer?) and most of the programming work is carried out by Thaer Sammar of the CWI. At the KB, staff from the Collections and Product Support departments are involved, in addition to Research staff. Such interaction between supply and demand is essential for enabling meaningful scholarly research in the web archive.

"For instance, during the trial the researchers complained that the KB Web Archive was a bit too proper." This assessment is no surprise, of course. The KB so far has harvested mostly official and semi-official websites. It has opted for a selective approach to web archiving rather than harvesting the entire .nl domain, as this is more in line withthe KB's remit and available (scarce) resources. But selection policies may change overtime, as both the KB and researchers gain more experience with the web archive.

Reuse limited by copyright law

One important hindrance to reusing the data in the web archive must be mentioned here: the Copyright Act. The present Act (1912) is still based on printed information. Until 70 years after the death of an author/photographer etc. permission must be granted by the rights holders for any type of copying/reuse. And each version of a website may have many rights holders (photographer, journalist, designer, writers, etc.). How the KB handles these issues will be discussed in a future article.

Facts and figures

WebART is a project under the CATCH umbrella (Continuous Access to Cultural Heritage), funded by the Dutch Organisation for Scientific Research, NWO. Duration: 2012-2016.
Project staff: (KB) Hildelies Balk, Paul Doorenbosch, René Voorburg, Victor-Jan Vos, (University of Amsterdam) Jaap Kamps (project lead), Richard Rogers, Hugo Huurdeman, Anat Ben-David, (Centre for Mathematics) Arjen de Vries, Thaer Sammar.