Network Working Group A. Boyko Internet-Draft Library of Congress Expires: September 25, 2008 J. Kunze California Digital Library L. Madden J. Littman Library of Congress March 24, 2008 The BagIt File Package Format (V0.93) http://www.ietf.org/internet-drafts/draft-kunze-bagit-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 25, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Boyko, et al. Expires September 25, 2008 [Page 1] Internet-Draft BagIt March 2008 Abstract This document specifies BagIt, a hierarchical file package format for the exchange of generalized digital content. A "bag" has just enough structure to safely enclose a brief "tag" and a payload but does not require any knowledge of the payload's internal semantics. This BagIt format should be suitable for disk-based or network-based file package transfer. One important use case is the possibility of eventual safe return of a received bag. Tag information consists of a small number of top-level reserved file names, checksums for transfer validation, and optional small metadata blocks. Boyko, et al. Expires September 25, 2008 [Page 2] Internet-Draft BagIt March 2008 1. Introduction BagIt is a hierarchical file package format for the exchange of generalized digital content. A "bag" has just enough structure to safely enclose a brief "tag" and a payload but does not require any knowledge of the payload's internal semantics. This BagIt format should be suitable for disk-based or network-based file package transfer. Use cases include long-term storage and the possibility of eventual safe return of a received bag. Tag information consists of a small number of top-level reserved file names, checksums for transfer validation, and optional small metadata blocks. The name BagIt is inspired by the "enclose and deposit" method [ENCDEP], sometimes referred to as "bag it and tag it". In this document the word "directory" is used interchangeably with the word "folder" and all examples conform to Unix-based filesystem conventions which should tranlate easily to Windows conventions after substituting the path separator ('\' instead of '/'). The BagIt format itself places no limitations on file and path lengths, so implementors thinking about maximal interoperation may wish to consider the issues listed in the Interoperability section of this document. Boyko, et al. Expires September 25, 2008 [Page 3] Internet-Draft BagIt March 2008 2. BagIt Package Layout A "bag" consists of a base directory containing a sub-directory named "data/" that holds the payload and a set of top-level files comprising the "tag". The base directory may have any name and the "data/" directory may contain an arbitrary file hierarchy. / | manifest-.txt | bagit.txt | [optional additional tag files] \--- data/ | [optional file hierarchy] The "tag" consists of one or more files named "manifest- _algorithm_.txt", a file named "bagit.txt", and zero or more additional files. In top-level text files with ".txt" extension, each line should be terminated by a newline (LF) or carriage return plus newline (CRLF); in practice cautious programmers will also accept a carriage return by itself (CR) as a line terminator. In all such tag files, text is assumed to be Unicode encoded as UTF-8 [RFC3629]. The "bagit.txt" file should consist of exactly two lines, BagIt-Version: M.N Tag-File-Character-Encoding: UTF-8 where M.N identifies the BagIt major (M) and minor (N) version numbers, and UTF-8 identifies the character set encoding of tag files. 2.1. File Manifest One or more manifest files must be present. A manifest is a top- level file with a name of the form manifest-_algorithm_.txt, where _algorithm_ is a string specifying a cryptographic checksum algorithm, such as manifest-md5.txt manifest-sha1.txt Implementors of tools that create and validate bags are strongly encouraged to support at least two widely implemented checksum algorithms: "md5" [RFC1321] and "sha1" [RFC3174]. A manifest contains a complete list of payload files that must be present in a fully constituted bag. Each line of a file manifest-_algorithm_.txt has the form Boyko, et al. Expires September 25, 2008 [Page 4] Internet-Draft BagIt March 2008 CHECKSUM FILENAME where FILENAME is the pathname of a payload file relative to the base directory and CHECKSUM is a base64-encoded checksum calculated according to _algorithm_ over the file's contents. As a result, every payload FILENAME listed begins "data/...". Any tag (top-level) FILENAME may optionally appear in a manifest. One or more linear whitespace characters (spaces or tabs) separate the CHECKSUM and FILENAME. Sometimes it is desirable to record a checksum for a tag file without listing it in a manifest, but using instead an accompanying _tag checksum file_. Applications include recording a checksum for the file manifest itself or for a tag file added after the manifest was received. The name of a tag checksum file for tag file _tfname_ has the form _tfname_._algorithm_. For example, a tag checksum file using MD5 over "manifest-sha1.txt" would have the name manifest-sha1.txt.md5 A tag checksum file contains a single line having the same form (CHECKSUM FILENAME) and semantics as a file manifest. In essence, it is a one-line manifest listing the base64-encoded checksum calculated according to _algorithm_ over the contents of _tfname_. 2.2. Valid Bags and Complete Bags A bag is considered _valid_ if it is _complete_ and if each CHECKSUM in every manifest can be verified against the contents of its corresponding FILENAME. A bag is considered _complete_ if every manifest covers the same set of files and every file in the payload is listed in every manifest. This means that a bag is complete (a) if the set of files listed in any one manifest is identical to the set of files listed in every manifest, (b) if every payload and tag file listed in every manifest is present, and (c) if every file present in the payload is listed in every manifest. Hence tag files do not need to be listed in the manifest(s), but in a complete bag any tag files appearing in one manifest must appear in all manifests. For reasons of efficiency, a bag may be sent with a list of files to be fetched and added to the payload before it can meaningfully be checked for completeness. An optional top-level file named "fetch.txt", if present, contains such a list. Each line of "fetch.txt" has the form URL LENGTH FILENAME Boyko, et al. Expires September 25, 2008 [Page 5] Internet-Draft BagIt March 2008 where URL identifies the file to be fetched, LENGTH is the number of octets in the file (or "-" to leave it unspecified), and FILENAME identifies the corresponding payload file. One or more linear whitespace characters (spaces or tabs) separate these three values, and any such characters in the URL must be hex-encoded. Because "fetch.txt" lists files that are absent from a sent bag, receivers that are storing completed bags will want some way to record that the bag no longer needs completing, such as renaming this file (e.g., to "fetch-orig.txt") or changing a database flag. Receipt of a bag is not final until all such files are fetched. The receiver of a bag with a "fetch.txt" tag file is expected promptly to complete the bag by fetching all URL-identified components as the sender is not bound to make the absent components available indefinitely. It is often practical to transmit a bag with "holes", that is, with a "fetch.txt" file, since it obviates the need for the sender to create a large serialized copy of the content and stage that content until the bag is transferred to the receiver. Also, this method allows a sender to construct a bag from components that are either a subset of logically related components (e.g., the localized logical object could be much larger than what is intended for export) or assembled from logically distributed sources (e.g., the object components for export are not stored locally under one filesystem tree). Boyko, et al. Expires September 25, 2008 [Page 6] Internet-Draft BagIt March 2008 3. Other BagIt Metadata: package-info.txt Any other tag files are considered to be package information separate from the payload content. The "data/" directory is the custodial focus of a bag, and the top-level files comprising the tag are intended to facilitate and document the transfer. The tag could also be used to help in returning the bag to its sender at some point in the future. Tag information is optional. If present, tag information at a minimum consists of a package-info.txt file. This is a text file intended primarily for human readability using email-style headers [RFC2822]. It is recommended that lines not exceed 79 characters in length. As mentioned earlier, text is assumed to be Unicode encoded as UTF-8. The package-info.txt file contains metadata elements describing the overall package. It looks like this. Source-Organization: Spengler University Organization-Address: 1400 Elm St., Cupertino, California, 95014 Contact-Name: Edna Janssen Contact-Phone: +1 408-555-1212 Contact-Email: ej@spengler.edu External-Description: Uncompressed greyscale TIFF images from the Yoshimuri papers colle... Delivery-Date: 2008-01-15 External-Identifier: spengler_yoshimuri_001 Package-Size: 260 GB Bag-Group-Identifier: spengler_yoshimuri Bag-Count: 1 of 15 Internal-Sender-Identifier: /storage/images/yoshimuri Internal-Sender-Description: Uncompressed greyscale TIFFs created from microfilm and are... All elements are provided as clues to ease handling on the sender and receiver ends. No particular relationship between the sender organization and the payload content is assumed; for example, the sender may be a content aggregator, redistributor, collector, curator, or producer. Reserved element names are case-insensitive and defined as follows. Source-Organization Organization transferring the content. Boyko, et al. Expires September 25, 2008 [Page 7] Internet-Draft BagIt March 2008 Organization-Address Mailing address of the organization. Contact-Name Person at the source organization who is responsible for the content transfer. Contact-Phone International format telephone number of person or position responsible. Contact-Email Fully qualified email address of person or position responsible. External-Description A brief explanation of the contents and provenance. Delivery-Date Date (YYYY-MM-DD) that the content is being transferred. External-Identifier A sender-supplied identifier for the package. This identifier must be unique across the sender's content, and if recognizable as belonging to a globally unique scheme, the receiver should make an effort to honor reference to it. Package-Size Size or approximate size of the package being transferred, followed by an abbreviation such as MB (megabytes), GB, or TB; for example, 42600 MB, 42.6 GB, or .043 TB. Bag-Group-Identifier (optional) A sender-supplied identifier for the set, if any, of bags to which it logically belongs. This identifier must be unique across the sender's content, and if recognizable as belonging to a globally unique scheme, the receiver should make an effort to honor reference to it. Bag-Count (optional) Two numbers separated by "of", in particular, "N of T", where T is the total number of bags in a group of bags and N is the ordinal number within the group; if T is not known, specify it as "?" (question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145. Internal-Sender-Identifier (optional) An alternate sender-specific identifier for the content and/or package. This value may be useful to senders who may retrieve the content in the future. For instance, it might contain values that are relevant to the re-use of the content at the sender's organization. Internal-Sender-Description (optional) A sender-local prose description of the contents of the package, to assist in later use if returned to the sender. Boyko, et al. Expires September 25, 2008 [Page 8] Internet-Draft BagIt March 2008 Arbitrary other package metadata elements may follow these elements. Such elements could be used to describe the payload in ways intended for the sender in case of bag return. Boyko, et al. Expires September 25, 2008 [Page 9] Internet-Draft BagIt March 2008 4. Network Transfer and Serialization When sending a bag over a network, in some scenarios it is convenient for the sender first to serialize the filesystem hierarchy representing the bag (the outermost base directory) into a single- file archive format such as TAR or ZIP. After receiving the resulting aggregate file, which we will call a _serialization_, the receiver deserializes it to recreate the filesystem hierarchy. Several rules govern the serialization of a BagIt bag and apply equally to TAR or ZIP archive files: 1. One and only one bag is contained in one serialization. 2. The serialization has the same name as the bag's base directory, but with an extension added to identify the format; for example, the receiver of "mybag.tar.gz" expects the corresponding base directory to be created as "mybag". 3. A bag is never serialized from within its base directory, but from the parent of the base directory (where the base directory appears as an entry). Thus, after a bag is deserialized in an empty directory, a listing of that directory shows exactly one entry. For example, deserializing "mybag.zip" in an empty directory causes the creation of the base directory "mybag" with the payload and all the tag files deserialized beneath it. 4. One un-archiving (deserialization) step produces a single base directory bag with the top-level structure as described in this document without requiring an additional un-archiving step. For example, after one un-archiving step it would be an error for the "data/" directory to appear as "data.tar.gz". TAR and ZIP files may appear inside the payload beneath the "data/" directory, where they would be treated opaquely along with any other payload file or directory. When packaging a bag in an archive file format, care must be taken to ensure that the format's restrictions on file naming, such as allowable characters, length, or character encoding, will support the receiver requirements of the bag being packaged. The mechanics of sending and receiving of bags over networks is out of scope of the present document and may be facilitated by protocols such as [GRABIT]. Boyko, et al. Expires September 25, 2008 [Page 10] Internet-Draft BagIt March 2008 5. Example Bag Here's a bag of material resulting from a hypothetical web harvest. Lines of file content are shown in parentheses beneath the file name, with long lines continued indented on subsequent lines. This bag is not completely retrieved, of course, until every component listed in the fetch.txt file is retrieved. Boyko, et al. Expires September 25, 2008 [Page 11] Internet-Draft BagIt March 2008 mybag/ | | manifest-md5.txt | (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt ) | (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt ) | | fetch.txt | (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-oth-050019.arc.gz | 26583985 gov-20060601-oth-050019.arc.gz ) | (http://WB20.Stanford.Edu/gov-06-2006-ARC/gov-20060601-img-100002.arc.gz | 99509720 gov-20060601-img-100002.arc.gz ) | ( ..................................................................... ) | | package-info.txt | (Source-organization: California Digital Library ) | (Organization-address: 415 20th Street, 4th Floor, Oakland, CA. 94612 ) | (Contact-name: A. E. Newman ) | (Contact-phone: +1 510-555-1234 ) | (Contact-email: alfred@ucop.edu ) | (External-Description: The collection "Local Davis Flood Control ) | Collection" includes captured California State and local websites ) | containing information on flood control resources for the Davis and ) | Sacramento area. Sites were captured by UC Davis curator Wrigley ) | Spyder using the Web Archiving Service in February 2007 and ) | October 2007. ) | (Delivery-date: 2008.04.15 ) | (External-identifier: ark:/13030/fk4jm2bcp ) | (Package-size: about 22Gb ) | (Internal-sender-identifier: UCDL ) | (Internal-sender-description: University of California Davis Libraries) | | BagIt.txt | (BagIt-version: 0.9 ) | (Tag-File-Character-Encoding: UTF-8 ) | \--- data/ | | Collection Overview.txt | (... narrative description ... ) | | Seed List.txt | (... list of crawler starting point URLs ... ) .... Boyko, et al. Expires September 25, 2008 [Page 12] Internet-Draft BagIt March 2008 6. Interoperability: Windows and Unix File Naming Besides the fundamental difference between path separators ('\' and '/'), generally, Windows filesystems have more limitations than Unix filesystems. Windows path names have a maximum of 255 characters, and none of these characters may be used in a path component: < > : " / | ? * Windows also reserves the following names: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, and LPT9. See [MSFNAM] for more information. Boyko, et al. Expires September 25, 2008 [Page 13] Internet-Draft BagIt March 2008 7. Security Considerations The BagIt package format poses no direct risk to computers and networks. Implementors of tools that complete bags by retrieving URLs listed in a "fetch.txt" file need to be aware that some of those URLs may point to hosts, intentionally or unintentionally, that are not under control of the bag's sender. Checksum algorithms are designed to protect against corruption and spoofing in bag transfer, but they are not a guarantee. Boyko, et al. Expires September 25, 2008 [Page 14] Internet-Draft BagIt March 2008 8. References [ENCDEP] Tabata, K., "A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method", 2005, . [GRABIT] NDIIPP/CDL, "The GrabIt Package Exchange Protocol", 2008, . [MSFNAM] Microsoft, "Naming a File", 2008, . [RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321, April 1992. [RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April 2001. [RFC3174] Eastlake, D. and P. Jones, "US Secure Hash Algorithm 1 (SHA1)", RFC 3174, September 2001. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. Boyko, et al. Expires September 25, 2008 [Page 15] Internet-Draft BagIt March 2008 Appendix A. Change History (This appendix to be removed in the final draft.) A.1. Changes from V0.91 Draft, 2008.03.14 Added a "Security Considerations" section, as required for internet- drafts. Per Andy, added "tag checksum file" concept for files that aren't in the manifest (eg, the manifest itself, or files added after manifest was received and you don't want to re-build the manifest). Added Andy's MD5 and SHA-1 references. Decided for now to leave out the TAR reference to avoid POSIX/GNU religious wars, and for symmetry left off the ZIP reference too. Also, softened encouragement to serialize bags. Tightened wording in metadata section to split elements into recommended and optional. It is now recommended that lines not exceed 79 characters in length. Dates should use YYYY-MM-DD format. Added country code to phone number in example. Small edits to Abstract. Formatting changes to this Appendix. A.2. Changes from V0.9 Draft, 2008.03.14 Any run of one or more horizontal whitespace characters can separate values in manifest.txt and fetch.txt. Reduced protocol discussion to a one-sentence reference and changed the enclosing section name to Network Transfer and Serialization. Removed second method of conveying checksums (for tag files) based on preliminary acceptance per call of 3/14. Andy will try to confirm that this preserves the group consensus from call of 3/12. Miscellaneous small edits. A.3. Changes from 2008.03.12 Draft Added Bill LeFurgy's edits (removing "disposable" and "archival" words), and a statement that BagIt can provide a way to store exchanged content. To network transfer section added requirements of serialization (is that the right word) for single-archive format files (per E. Hetzner). Boyko, et al. Expires September 25, 2008 [Page 16] Internet-Draft BagIt March 2008 Added back to bagit.txt a clarified Tag-File-Character-Encoding statement, and suggested that defensive programmers accept an isolated CR as line terminator (per S. Abrams). Wrestled with tag file checksums as per call of 3/12 and Justin's suggestions, and think there's now a far simpler approach given that the manifest lists files relative to the base directory. Left in but softened the idea of altering the name "fetch.txt" after files have been fetched. This is important because the presence of the filename is essentially an instruction. More tightening of archive file guidelines. A.4. Changes from 2008.03.04 Draft According to the spirit of Justin's version, tightened wording and definitions, returned to a manifest that doesn't straddle a list of files and URLs, defined concepts of "complete" and "valid", removed the dubious Character-Encoding from bagit.txt. BagIt is clearly for exchange between storage systems A and B, but I think it's too much for BagIt to be about how something is stored on system A or B. I think it suffices to say that BagIt is designed to facilitate the possibility of eventual safe return of a received bag. With that in mind, BagIt clearly provides a natural way to store a received bag, but trying to dictate local storage layout doesn't buy us anything, may discourage potential adopters, and is impossible to police. And as validation goes, BagIt isn't more useful for long- term ongoing validation than any other self-contained structure, because trusted checksums are generally not stored in the same structure (or bag) as the stuff they're validating (ie, they should be held distant from the content to be validated). Reworded to reduce formalisms, eg, "profiles", "conformance", heavily layered and numbered sections. Also, I'd like to keep extensibility less explicit than profiles suggest, which is an invitation to complexity and non-interoperation in my experience. Per 3/7 phone call, eliminated 3 metadata elements: Access-Level, Metadata-Included-In-Package, and Metadata-Description. Per Erik Hetzner's feedback, tightened archive file guidelines. Miscellaneous wording improvements and typos. Boyko, et al. Expires September 25, 2008 [Page 17] Internet-Draft BagIt March 2008 A.5. Changes from 2008.02.28 Draft Per phone call of 2/29, changed holes.txt to urls-md5.txt. Tried tightening language around the concept of a "manifest", which is really the sum of the files list and the URLs list. A bag isn't completely received until all the URLs (and files) are accounted for. Propose changing manifest.txt to files-md5.txt, and declaring that "the manifest" = files-md5.txt + urls-md5.txt. Please review the network transfer considerations section. I tried to consolidate and shorten some discussion that may have been vague or redundant, but I may have removed something that was important. Per Stephen Abrams' suggestion, there's now a bagit.txt file that specifies the BagIt version number and UTF-8 encoding. Metadata elements to not contain spaces (more RFC822-compliant, per Stephen); declared them to be case-insensitive. A.6. Changes from 2008.02.20 Draft Based on 2/22 phone call, looked for new way of expressing file manifest separate from URL manifest. The proposal here is to have the file manifest revert to the original format, and add a new optional manifest file that specifies the "holes". A bag with "holes" isn't complete until the "holes are filled". To help fill the holes, each manifest line has a checksum, length, filename, and URL. The "filename" in this case is necessary in case a bag is returned (so the original sender knows what component is what), but the "filename" does not specify where the receiver must store the content. Added language to the definition of External Identifier to allow it to suggest a globally unique reference: "if the identifier is recognizable from a globally unique scheme, the receiver should make an effort to permit the package to be referenced by this identifier." Added an Example section. Added a concept of Bag Group and Bag Count (within a group) as per 2/22 phone call and comments from Mark Phillips. As per conference call, stopped using the terms "packager" or "producer" for the metadata, using instead "source organization" and "internal sender identifier" and "internal sender description". Incorporated Andy's prose on archive formats and transfer parallelism. Boyko, et al. Expires September 25, 2008 [Page 18] Internet-Draft BagIt March 2008 Miscellaneous edits arising from Mark Phillips' comments. Boyko, et al. Expires September 25, 2008 [Page 19] Internet-Draft BagIt March 2008 Authors' Addresses Andy Boyko Library of Congress 101 Independence Avenue SE Washington, DC 20540 USA Fax: +1 202-707-1957 Email: aboy@loc.gov John A. Kunze California Digital Library 415 20th St, 4th Floor Oakland, CA 94612 US Fax: +1 510-893-5212 Email: jak@ucop.edu Liz Madden Library of Congress 101 Independence Avenue SE Washington, DC 20540 USA Fax: +1 202-707-1957 Email: emad@loc.gov Justin Littman Library of Congress 101 Independence Avenue SE Washington, DC 20540 USA Fax: +1 202-707-1957 Email: jlit@loc.gov Boyko, et al. Expires September 25, 2008 [Page 20] Internet-Draft BagIt March 2008 Full Copyright Statement Copyright (C) The IETF Trust (2008). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgment Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA). Boyko, et al. Expires September 25, 2008 [Page 21]