Long Term Data Preservation


Category Archive

The following is a list of all entries from the standards category.

WARC file format becomes an ISO standard

WARC, an extension of the ARC file format, used for archiving web material, has been made an ISO standard.

WARC format offers new possibilities, notably the recording of HTTP request headers, the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

- Abby Grotke, IIPC Communications Officer, Library of Congress

See the IIPC press release.


Tim Berners-Lee on the semantic web

This Wednesday (07/09/02008), Sir Tim Berners-Lee did a short interview with BBC’s, Today in which he gave a brief, yet easy to understand description of the semantic web. It’s definitely something that all librarians and archivists should be thinking of and planning for. We’ve done plenty playing catch-up with Web 2.0, and I think we should take advantage of the fact that our gears are running fast enough now that we can start innovating for the future.


PDF is now an ISO standard

According to a DaniWeb blog post the Adobe Portable Document Format (PDF) has been approved as a standard by the International Organization for Standardization (ISO) [ISO International Standard - ISO 32000-1]. This bodes well for the Digital Preservation Coalition’s April 2008 report on using PDF/A as a preservation format. One of the concerns about using this format was its lack of a standard status, so Wednesday’s announcement gives the PDF format one more notch on its belt toward becoming a widely accepted standard for digital preservation.

One of my greatest misgivings about using PDF for preservation was its proprietary nature, but Kevin Lynch, Chief Technology Officer at Adobe, states that “By releasing the full PDF specification for ISO standardization, we are reinforcing our commitment to openness.” It’s a good trend and one that I hope continues.