The KB's e-Depot, a dedicated archiving environment for long-term preservation of digital objects, has been in operation since 2003. One marked difference between the present day and 2003 is that the volume of objects to be handled and preserved has increased dramatically. This results in new challenges for the digital library.
You just cannot check millions of objects manually
What can happen when a JP2 file is corrupted.
Johan van der Knijff shares his KB office with colleague Clemens Neudecker and his collection of vintage vinyl LP covers.
Structure of JP2
Johan van der Knijff of the KB's Research Department: "A few years ago the KB decided to migrate millions of older TIFF-files into JPEG-2000 (aka JP2). JP2 files are much smaller, so migrating the files would save storage space and money. I was asked to design part of the workflow for the migration, including some type of quality check. The most widely used tool to check the technical qualities of JP2 was JHOVE. But colleagues at the British Library in London had warned me that JHOVE did not work well. They had carried out some spot checks during digitization projects and had discovered that JHOVE passed corrupt files. So I did some experimenting by deliberately damaging a few images. JHOVE passed them. It even passed a 2 MB image which I had reduced to 4 KB. Obviously, something had to be done, because errors such as these can cause substantial problems, especially in the long term."
"We had already known that not all software produces valid JP2's, but this time the issue is about managing a large migration process. One can imagine that a minimal dip in the power supply, which might go entirely unnoticed, can damage hundreds or even thousands of files. We want to be able to isolate those files, reliably and automatically, because manually checking millions of files is just not an option."
An alternative JP2-checker
Some odd afternoon – or so he tells us – Johan decided to have a look at the JP2 standard documentation. He discovered a box-like structure which looked like it would be rather straightforward to check (the technical details are in his blog post announcing the first rudimentary software. He ended his post on a modest note: "I'm curious to hear if anyone finds [this software] useful at all."
It only took 45 minutes for Paul Wheatley of the British Library to reply: "Excellent work, Johan. We will definitely be testing this out." And more was to come. A lively online debate ensued about what the software should look like and how it could be developed. There seemed to be demand for the tool. Johan: "I decided it would be worth my while to spend more time on it." This work fitted nicely within the framework of the European SCAPE project which had just begun and which focuses on issues of scale in digital preservation (SCAPE stands for SCAlable Preservation Environments). Both the KB and the BL are partners in SCAPE.
From prototype to operational tool
Johan emphasizes that the development of a tool like this is not a one-man effort. "My first bit of code was a bit clumsy and unnecessarily lengthy in places, so my KB colleague René van der Ark volunteered to clean it up. We named the tool jpylyzer (JP2-analyzer) and published a prototype on GitHub – an online collaboration platform for software developers. During SCAPE meetings colleagues from England, Austria and Portual joined in to further develop and improve the tool." At the iPRES in Toronto, the annual digital preservation conference, Johan's demo of the jplyzyer tool attracted quite a bit of attention (see photo).
Jpylyzer's take-up by the community
"In the spring of 2012 the KB made use of jpylyzer in the migration of its TIFFs to JP2. Around the same time the British Library ran 25 million JP2s through jplyzyer and discovered some 680 faulty files. At the request of the Wellcome Library, the German company Intranda incorporated jpylyzer in Goobi – a much used suite for digitization workflows. The Wellcome Library uses jpylyzer to check JP2s produced by third parties. Other jpylyzer users include the British National Archives and the Danish Statsbiblioteket."
What about jpylyzer's future?
Johan: "A tool like this needs more than a single individual or the KB to support and develop it. That's why we published jpylyzer under an open source licence. Anyone can use it and contribute to its development. If you want, you can even incorporate jpylyzer into your own software or projects."
"Jpylyzer is now supported by the Open Planets Foundation (OPF), an international organisation established to support and develop tools emerging from European projects; both the KB and BL are members of OPF. On a day-to-day basis, I am still responsible for support and development, but in the long run it is quite conceivable that someone else takes over. Through our OPF network, jpylyzer has also been adopted by the Ubuntu/Debian community – groups of developers around the Ubuntu and Debian operating systems. Such involvement contributes to the sustainability of jpylyzer and also increases visibilityoutside the limited circle of digital libraries/archives."
- Jpylyzer home page at the Open Planets Foundation, with links to the software, the manual and relevant blog posts
- Johan van der Knijff, A simple JP2 file structure checker, Open Planets Foundation (OPF) blog post 1 September 2011
- Johan van der Knijff, A prototype JP2 validator and properties extractor, OPF blog post, 14 December 2011
- David Tarrant, Johan van der Knijff , ‘Jpylyzer: analysing JP2000 files with a community-supported tool’, iPRES 2012 proceedings
- KB Research's GitHub site