How does web archiving work?
The archiving of websites involves a number of steps. After a selection is made of the websites to be archived, the next step is to gather in these websites (this is called harvesting or crawling) using specially developed software. This is quite similar to what the crawlers used by search engines like Google do, except that a web archive crawler actually tries to harvest all the files in a single website. With the domain to be archived as its basis (www.kb.nl, for example), the crawler follows all the links starting with the homepage. In the example shown below, the crawler starts on the index page and follows the references to 1a, 2a, 1b and so on until the entire original website is harvested.
The goal of the KB is to archive all the files that constitute a single selected website (insofar as this is technically permissible and there are no security barriers). It is also possible to limit the amount of data to be harvested and the maximum number of files with regard to a single website. Such an approach is used in the making of a snapshot. The crawler can be configured in such a way that it works within these limits. In the example, the crawler limits itself to the documents within the indicated domain (kb.nl).
Generally speaking, websites are built from a large number of individual files. The Heritrix crawler, which is the one used by the KB, “wraps” all these individual files in a kind of “container”, making the archived version of the site easier to manage. The various individual files in this wrapping are described in terms of metadata. These metadata contain information about the file format, time and date that the crawling was conducted and the size of the file.
Before the crawled websites can be stored in the e-Depot they undergo a quality control check. This first involves looking at the completeness and the quality of the harvested sites to see whether any parts are missing and whether the links work. Then data are collected concerning the various file formats and versions of those formats. These are the data that are of primary importance to the future presentation. This information is stored as technical metadata.
Then the archived websites are catalogued and indexed. The cataloguing is done automatically as much as possible, ensuring that the websites included in the web archive are searchable via the KB’s central catalogue. The websites are also full-text indexed, enabling the user to conduct a free text search in the archive. The user searches the index with the help of a search engine specifically developed for the web archive. The result of the search is called up from the e-Depot and presented in an interface that not only shows the requested version of the site but also makes it possible to consult earlier and later versions of the site by means of a time bar.