How does web archiving work?
The KB makes use of a set of open source tools specifically designed for web archiving under the umbrella of the International Internet Preservation Consortium (IIPC).
Archiving websites involves a number of steps. After a selection is made of the websites to be archived, the next step is to collect these websites (this is called harvesting or crawling) using specially developed software. This software is quite similar to the software used by search engines such as Google, except that a web archive crawler actually tries to harvest all the files within a website. With the domain to be archived as its basis (www.kb.nl, for example), the crawler follows all the links starting with the homepage.
The goal of the KB is to archive all the files that constitute a single website (insofar as this is technically possible and not impeded by security barriers).Generally speaking, websites contain a large number of individual files. The Heritrix crawler, which is the one used by the KB, “wraps” all these individual files into a kind of “container”, making the archived version of the site easier to manage. The various individual files in this wrapper are described by metadata. These metadata contain information about the file format, time and date of the crawl and the size of the file.