Long-term archiving at Österreichische Mediathek

The Österreichische Mediathek is continuously working on digitizing its analogue holdings and collecting analogue and digital audio and video recordings. As of January 2024, the Österreichische Mediathek's digital archive contains around 300,000 digital objects. In order to keep these files intact and playable in the long term, various measures must be taken, as explained in the following text. These include the redundant storage of copies, the regular migration of storage media, the continuous and repeated checking of the integrity of digital files and the checking of file formats for obsolescence. Digital long-term archiving is therefore an ongoing work process and not a one-off action.

Long-term archiving against the background of loss

Magnetic tape-based audiovisual archives such as the Österreichische Mediathek are essentially characterized by the fact that the playability of the originals in the analog archive is under threat. Digitization is the only solution to keep the sources of the collection usable in the long term. Media archives are generally driven by the threat of losing their holdings; in a nutshell: what is not digitized in time is lost. Embedded in a long-term archiving strategy, the digitization of analogue sound recordings represents the first major migration step in a long series of future migrations.

Against the background of loss, not only digitization but also long-term archiving is of particular importance. While the aim of preservation digitization is to create a digital image of the analogue source that comes as close as possible to the original, the task of long-term archiving is to preserve the digital objects for all future generations and to assess and reduce the risk of digital loss as far as possible.

Concept of long-term archiving at the Österreichische Mediathek

Eternal migration

How can a digital object be preserved for all time? The goal of preserving digital objects for all eternity seems unimaginable. This can only succeed if the essence of the file is preserved, while the form ("format") and information carrier are constantly updated and adapted. The concept of "eternal migration" describes precisely this strategy. This essentially involves two points: Carrier migration and format migration. In carrier migration, the data is transferred to another storage medium true to the bit stream. Checksums can be used to verify whether the copy on the new storage medium is mathematically identical to its source. Such migrations of the storage medium are time-consuming and expensive, and yet form the basis of all long-term archiving. 

In addition to a carrier migration that takes place every few years, it is also important to ensure that the files can be used permanently in the form in which they are available - "played" in the AV archive. This requires a forward-looking view of the digital collection and early recognition of potential risks with regard to format obsolescence.

Four pillars of long-time archiving

The long-term archiving strategy of the Österreichische Mediathek is based on four pillars of the LTA:

1. Storage

The basis of any long-term archiving is the storage infrastructure. The Österreichische Mediathek stores four identical, verified copies of each digital object on three different storage media: hard disk storage, a magnetic tape storage (LTO tape library) and external offline LTO magnetic tapes. This strategy follows the "3-2-1 backup rule", which advocates storing data in triplicate on different storage technologies and in different locations (disaster prevention). 

Spatial distribution protects the databases from damage that could occur to infrastructure and buildings (fire, water ingress, etc.). In addition to the locations of the data pools at different sites in Vienna, a data set is stored in the central backup system of the federal government (ZAS) in St. Johann im Pongau.

A key aspect of the Österreichische Mediathek's long-term archiving strategy is to store the complete data set offline several times in the form of LTO tapes. This is an efficient way - not only from an ecological point of view - to protect the data stock from external influences (e.g. malware).

 

Eternal storage?

As described above, the Österreichische Mediathek relies on regular carrier migration for its long-term archiving. This means that every few years the entire data stock of the Österreichische Mediathek is copied to a different storage medium - de facto an up-to-date storage system with new hardware. Carrier media that promise a long physical shelf life (e.g. optical data carriers made of robust materials such as gold, analog (micro) film as a carrier of digital data, data carriers based on ceramics, ...) are presented time and again. Storage on such carriers is not part of the archiving concept of the Österreichische Mediathek. On the one hand, this is because the storage of large amounts of digital data, as is the case with audiovisual media, would be very time-consuming to transfer to such storage media. On the other hand, with such storage media it is necessary to consider that, in addition to the durability of the media, the availability of hardware to read the information on the media must also be taken into account. In this respect, the question of format obsolescence also arises here - especially with storage media that are not widely used.

Another aspect is that the digital long-term archive of the Österreichische Mediathek is a living archive in which data is constantly being adapted, supplemented and updated, which means that a static repository designed to last 100 years or more makes little sense.

Carrier migration - Every year again...

Since damage to the hardware of hard disk storage is to be expected after a certain period of time, a migration of the data must be planned every few years. In addition to renewing hardware, there is another cycle to consider: the updating of LTO tapes. LTO stands for "Linear Tape Open". These are ½-inch magnetic tapes for backing up digital data. Every two years there is a new, updated generation of LTO tapes, which can also store a larger amount of data. The main reason for keeping the data on LTO up-to-date is to ensure the readability of the data with current tape drives. Each generation of LTO tape drives is compatible with the previous generation of tapes - i.e. they can be read and written to. The previous generation of tapes cannot be written to, but can be read. To avoid running the risk that your own data tapes can only be read by old devices, it is necessary to switch to the latest generation of LTO tapes and drives at the appropriate intervals.

2. Integrity

In addition to ensuring and regularly renewing the storage and its infrastructure, the core task of long-term archiving is to guarantee the integrity of the data. This involves permanently verifying the integrity of the data from the moment it enters the digital archive. Checksums such as "MD5" or "SHA" are suitable tools for this. A sum is generated from the sum of the bits of a file using an algorithm. The checksum can be recalculated after each copying process and at regular intervals and compared with the initial checksum. This makes it possible to determine exactly whether the data in a file has been changed and is therefore corrupted.

If data corruption has occurred, the checksum can be used to identify an intact copy of the file that can replace the corrupt file.

Check data integrity

The integrity of the digital objects must be checked from the moment they enter the archive. Ideally, a checksum should be provided as soon as the data is created, which serves to confirm the integrity of the data from the outset. This is often not the case when digital objects are handed over by external producers and collectors. In such cases, checksums must be created when the files are transferred - the "ingest". All files in the digital archive must be provided with checksums so that the status of the files can be checked automatically and regularly. This enables a prompt response to any data errors such as "bit flip" or "bit rot".

For ingest, the Österreichische Mediathek uses a system consisting of several so-called "inboxes". These inboxes have the task of checking digital objects that are to be written to the archive with regard to the validity and integrity of the data.

The Österreichische Mediathek carries out regular checks on the data stock as soon as it enters the digital archive. 

For this purpose, the Österreichische Mediathek has its own archive analysis and monitoring system: MEDIAS. MEDIAS is based on the "Search-IT" software.

3. Consistency

Storage and data integrity are the backbone of long-term archiving and enable the exact preservation of bits: "bitstream preservation". In practice, a key issue in long-term archiving is the question of data packaging: how should digital data be named and in which structures (files and folders) should the data be stored?

The well-known reference model OAIS ("Open Archival Information System" - ISO 14721) is a helpful aid to thinking here.[1] This model distinguishes between three information packages: "Submission Information Package (SIP)", "Archival Information Package (AIP)" and "Dissemination Information Package (DIP)". The key issue in long-term archiving in this area is the definition of AIPs for different groups of holdings and the reconciliation and maintenance of data consistency in practice - especially under changing conditions and requirements in digital archiving.

 

For these questions, the MEDIAS program is used to make systematic queries about the entire digital collection in order to check whether all AIPs in the digital archive correspond to our current AIP definition. Especially in a very old digital archive such as the Österreichische Mediathek (started in 2000), there are different generations of AIPs. Such differences in the nature of the data must be documented so that appropriate measures can be taken and AIPs - if necessary and appropriate - can be brought up to date.

 

4. Functionality

The question of the functionality of a digital object is essentially the question of preserving the "essence" of a file: can I use this object in its present form for its intended purpose or can I identify a risk in this respect in the foreseeable future? As a sound and video archive, the Österreichische Mediathek is confronted with an abundance of files in various formats. The first step in tackling the problem of impending format obsolescence is to know which formats are in the digital archive. This step alone is not an easy one, as common analyses for format identification such as PRONOM only treat audiovisual formats superficially. When analyzing AV files, it is important to consider the container format as well as the coding of the individual streams (audio stream, video stream). Helpful tools for this are ffprobe and MediaInfo. These analysis programs can be used to analyze the respective formats in the digital archive and carry out a risk assessment in this regard.

While the digital original is first archived in its original form for born-digital objects in the Österreichische Mediathek, a further lossless copy suitable for long-term archiving (archive copy) must be made for potentially endangered formats. Formats that are used at the Österreichische Mediathek to create lossless archive copies are FFV1/PCM/MKV for video and PCM/BWF for audio.

 

Digitale Langzeitarchivierung
Marion Jaks, Mag.a
+43 1 5973669-7162, E-Mail