Network Working Group A. Main Internet-Draft: draft-main-magic-00 Black Ops Ltd Category: Best Current Practice October 2001 Expires: April 2002 Care and Feeding of Magic Numbers Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract This memo describes techniques for the use of magic numbers in a multimedia context, for the in-band identification of digital file formats. Specific recommendations are made concerning the use of magic numbers in newly developed file formats. 1 Introduction There have historically been four main ways to determine the format of a digital data object. In decreasing order of desirability, they are: 1. Explicit indication in metadata accompanying the object. (E.g., the "Content-Type" header in a MIME message indicates the format of the body [MIME-MSG].) 2. From context, i.e., the way in which an object is being used. (E.g., passing a file to the `gunzip' program indicates that the Main expires April 2002 [Page 1] Internet-Draft Care and Feeding of Magic Numbers October 2001 file should be in `gzip' format.) 3. Inference from examination of the data: different data formats look different. 4. Implicit indication from the name under which a file is stored, in contexts where it is conventional to name files in a way that indicates their format. In operating systems that do not keep type metadata with a file, method 1 is not usually possible. For example, in Unix all files are typeless octet strings. In such operating systems, the collective wisdom has been to use a combination of methods 2 and 3 to support each other. More generally, out-of-band identification mechanisms (1, 2, and 4) are often not possible, not least because metadata tends to become detached from primary data. In-band identification (method 3) is the only file format identification mechanism that it is always possible to attempt. Because many non-textual file formats include some kind of fixed- format header, method 3 usually consists of examination of the beginning of the object to see what its header looks like. A convention has arisen of aiding this type of format identification by including in file formats header fields whose primary purpose is to assist in identifying the file format. These are known as `magic numbers'. Although the MIME system uses explicit type indication throughout, those developing MIME recognised the utility of other means of recognising file formats. [MIME-REG] section 2.2.9 encourages MIME media type registration documentation to include details of magic numbers and file naming conventions, among other optional data. Experience has shown the wisdom of this recommendation: it is not uncommon that, once a digital object has left the control of metadata-preserving MIME-based Internet protocols, its attendant type information is discarded in one way or another. There is also a problem in many cases when a file enters the realm of MIME-based protocols, of attaching the correct MIME type metadata. With the recent general increase in the popularity of multimedia applications, and the corresponding proliferation of new media types, magic numbers are becoming more widely useful than ever. It therefore seems prudent to offer to the Internet community at large the common wisdom among Unix software engineers concerning magic numbers. Main expires April 2002 [Page 2] Internet-Draft Care and Feeding of Magic Numbers October 2001 1.1 Requirements Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [REQ-TERM]. 1.2 Numerical Notation In this document, all numbers are given in decimal, except where otherwise indicated. A prefix "0x" indicates hexadecimal. 2 Use of Magic Numbers There are two basic purposes for which a magic number can be used: to guess the format of a file where it is not previously known, and to confirm a file format that was specified by other means. Other modes of use are combinations or extensions of these two. 2.1 Confirmation of File Format In the context of file format confirmation, a magic number isn't magical at all: it's simply a header field with fixed contents. Any file without the magic number trivially fails a header validity check. A magic number test used to confirm the file format is thus merely a partial file validity check, performed on a header field intended specifically for the purpose. By its nature, a magic number test is not a complete test of the validity of the data in a file. Even a complete syntactic check of a file cannot conclusively confirm that the file was originally intended to be interpreted in the file format that it is presently purported to be in. Therefore, strictly speaking, a magic number cannot positively confirm identification of a file format; it can only with certainty negate an identification. The usefulness of a magic number for the purpose of file format confirmation comes from its ability to provide a high degree of confidence in the format identification. The degree of confidence in a confirmation is directly correlated with the degree of probability that the magic number will not be found in a file of any other format. This confidence can therefore be increased by carefully engineering the magic number to increase the probability of correctly detecting a file format mis-indentification. See section 3 for more information. 2.2 Guessing File Format Magic numbers aren't magic: they can't generate a file format Main expires April 2002 [Page 3] Internet-Draft Care and Feeding of Magic Numbers October 2001 identification ex nihilo. All they can really do is confirm an existing guess about a file format, and even that is only probabilistic, as described in the previous section. Therefore, the process for guessing a file format using magic numbers consists of testing the file against a series of possible file formats' magic numbers to see which it matches. It is beneficial, in any context where correct file type identification is important, to minimise the number of file formats considered: the confidence in a magic-number-based file type identification depends on the chance of a file of unknown type having none of the magic numbers considered, and this chance decreases as the number of magic numbers used increases. The range of file formats to test against is inherently a context-dependent choice: most contexts will have a small number of meaningful file formats to be considered. For example, Unix operating systems have traditionally used magic numbers in the headers of their executable program file formats. When asked to execute a file, the operating system tests the purported executable against the various executable format magic numbers it knows about. When a match is found, this provides some assurance that the purported executable is indeed an executable appropriate for this system, as well as determining which type of executable file it is and therefore how to go about executing it. Incidentally, as a result of this procedure, the error message for an improperly-formatted executable file has on some versions of Unix been "Bad magic". There is a problem with the basic technique for guessing file formats based on magic numbers: it is possible for a single file to match the magic number requirements of more than one file format. In such a case, magic numbers would give no further insight into the file type. Among a set of cooperating file formats it is possible to completely avoid this problem by making their magic number requirements mutually incompatible; this is trivially achieved by giving them different magic number values to be stored at the same location in the file. 2.3 Detection of Corruption Magic numbers can be used to detect certain types of mangling of the data in a file, giving early indication that the data of interest in the file is not intact. This is really a special application of the technique of confirming an expected file format. Although file corruption due to transmission errors is now almost entirely a thing of the past, there are still some types of corruption that can occur due to mistakes, and that magic numbers can help to detect. Main expires April 2002 [Page 4] Internet-Draft Care and Feeding of Magic Numbers October 2001 Transmission of binary file formats through paths intended for text is as much a problem as it ever was. In addition to some octet values that just aren't handled by text gateways, and the historically-known problems of text being reformatted en route, this kind of misconfiguration can subject a file to unwanted character set conversions and newline format conversions. A magic number, particularly if it contains non-ASCII octet values, is likely to be damaged by such conversions. Magic numbers can also help to detect endianness errors. If a magic number is read as a numeric field, and the reader is interpreting numeric fields using a different endianness from that with which the file was written, then the magic number will appear to be incorrect, thus avoiding a potential silent misinterpretation of the rest of the file. The magic number for the PNG image format [PNG] takes this usage of magic numbers to an extreme. 4 of the first 8 octets of the PNG file format are intended, at least in part, to detect text-related manglings. 3 Putting Magic Numbers into New File Formats As should already be apparent, it is useful in several situations to have some chance of successful in-band file format identification. To this end, each new file format where it may be useful SHOULD have some kind of magic number. Magic numbers are useful in digital data objects, including particularly media objects, that are expected to be visible in more than one context. Formats for objects with more specialised use, such as the packets of a networked protocol, have less need for magic numbers. However, the same magic number techniques can still be reasonably used in such cases if there is no conflicting requirement, for example to make the packet as small as possible. A basic requirement for the usefulness of magic numbers is that different file formats with magic numbers MUST have different magic numbers. New magic number values SHOULD be completely unrelated to pre-existing magic numbers. It is common for the magic numbers of related file formats to be chosen to be similar, for example by having adjacent numeric values. Doing this reduces the effectiveness of the magic numbers, by making it more likely that arbitrary data of some other type will match one of the range of magic numbers. The chance of accidental collision of magic number with magic number or magic number with real data is minimised by having magic numbers for different file types be Main expires April 2002 [Page 5] Internet-Draft Care and Feeding of Magic Numbers October 2001 completely unrelated, and this is therefore RECOMMENDED. 3.1 Magic Numbers in Binary File Formats This section applies to file formats where the underlying format consists of a string of bits, which for convenience we divide into octets. 3.1.1 Recommended Placement It is desirable that as many file formats as possible should be mutually incompatible. This is achieved by them having different magic numbers at the same location within the file. By far the most common location for a magic number within a file is the very beginning, offset zero. This is also the most logical location for it, and also the easiest to read from. Therefore, any new binary file format SHOULD place its magic number at the very beginning of the file, offset zero. 3.1.2 Recommended Length Historically, many magic numbers have been very small, often only 2 octets. At the time of writing, the most popular size is 4 octets. Both of these sizes are rather small. There is almost never a real need to save a few octets in a file header; mass storage is orders of magnitude cheaper than it was when 2-octet magic numbers were popular. It seems more worthwhile to spend a few more octets to minimise the likelihood of accidental magic number collision. Therefore, considering the increasing popularity of 64-bit computing hardware, new magic numbers SHOULD be 8 octets (64 bits) in length. Note: being a file compression format is no excuse for skimping on the magic number! If an object being compressed is so small that an extra few octets of magic number is really significant, then compression overheads will probably render the compression unuseful anyway. 3.1.3 Nature of the Magic Number Any usable file format specification should specify the layout of a file right down to the octet level, and the magic number field is no exception. It is not sufficient to merely specify a 64-bit (or however large) number and state that it is stored at a particular offset within the file; it is necessary to specify exactly what the octet values are in the magic number field. Of course, if a file format specification first establishes a convention for the Main expires April 2002 [Page 6] Internet-Draft Care and Feeding of Magic Numbers October 2001 representation of numerical fields (big endian, little endian, or anything else), then simply specifying a large number to place in the magic number field will be sufficiently unambiguous. To summarise: file format specifications MUST specify the contents of the magic number field sufficiently clearly to determine the exact sequence of octets that fill that field. This specification SHOULD be in the form of an explicit list of octet values. 3.1.4 Selecting a Magic Number 3.1.4.1 Requirements The basic requirement on a magic number is that it look different from as many other file formats as possible. This can be divided into two requirements: it should be different from all other magic numbers, and it should look different from non-magic-number data formats (principally text formats). There is a popular but misguided technique of selecting meaningful ASCII character values to make up a magic number. For example, a popular Unix archival file format uses the ASCII characters "!" as its magic number. This kind of magic number is very poor, because by definition the magic number test can be satisfied by a plain ASCII text file. In many cases, the sequence of characters chosen has been one particularly likely to occur naturally at the beginning of a text file. There is, of course, no technical requirement for the first few octets of a binary-format file to contain text characters. 3.1.4.2 Magic Numbers for Related File Formats Historically, some file formats have been deliberately ambiguous about octet ordering in numerical fields. They have used the native ordering on whatever system the file was intended to be used on. Often the magic number field was handled the same way: it was a numeric field, written in the native numeric format, and so recognition of the magic number indicated implicitly that the reader was reading in the right numeric format. Another view is that such file formats actually define two (or more) variant file formats, differing only in numeric format and in the contents of the magic number field. This leads to the use of the magic number field to detect the numeric format that should be used to interpret the rest of the file. Designing file formats like this is not recommended, but the accompanying magic number technique is good. Where a file format has variants that, apart from the magic number, differ only in the format of numeric fields, the contents of the magic number field MAY be varied in the same way, but in any case Main expires April 2002 [Page 7] Internet-Draft Care and Feeding of Magic Numbers October 2001 MUST vary in some way. 3.1.4.3 Recommended Selection Criteria To give the best possible chance of a magic number being different from other magic numbers, and to look as little like other structured data formats as possible, magic numbers SHOULD be selected randomly. Randomness of cryptographic strength is not necessary, but the randomness source should be statistically unbiased. To avoid accidentally generating a magic number that happens to look like a textual file format or is in other ways weak, randomly selected magic numbers SHOULD be filtered according to the following criteria: o There should be no adjacent identical octets. Non-random data is relatively likely to have such patterns, and this requirement also ensures that the magic number can't possibly be unchanged if the file is improperly byte-swapped or similarly mangled. o At least 50% of the octets should have the most significant bit set. This ensures that the magic number cannot be mistaken for ASCII text, and is highly unlikely to look like text in any ASCII extension character set (such as ISO-8859-1), where most of the text tends to be in the ASCII range. It also ensures that mangling that strips off the most significant bit of of each octet will be detected. o At least 75% of the octets should be outside the ASCII printable range. This minimises the chance of clashing with an ASCII- compatible character set. o There should be at least one octet in the ASCII printable range; at least one in the non-ASCII printable range of the ISO-8859 character sets; and at least one that is a control character in the ISO-8859 character sets, other than 0x09, 0x0a, 0x0c, and 0x0d (which are the only control characters that commonly occur in plain text). o The magic number should not be a valid substring of UTF-8. Fortunately UTF-8 is quite highly structured, by design, so it is easy to eliminate the possibility of a clash. o The octet-reverse of the magic number should also meet all of the above criteria. This is to support the dual octet ordering technique described in section 3.1.4.2. These filtering rules provide some 1.16*2^62 acceptable 8-octet magic Main expires April 2002 [Page 8] Internet-Draft Care and Feeding of Magic Numbers October 2001 numbers (approximately 29.0% of all 64-bit values), and 1.16*2^29 acceptable 4-octet magic numbers (14.5% of all 32-bit values). 3.1.4.4 Magic Number Selection Program This Perl program can be used to generate high-quality magic numbers using the generation rules given in the previous section. #!/usr/bin/perl -w $length = $ARGV[0] || 8; $length >= 4 or die "$0: Magic must be at least 4 octets\n"; open(STDIN, "/dev/urandom") or die "$0: Can't open /dev/urandom: $!\n"; sub not_utf8($) { ($_[0]."\x80\x80\x80\x80\x80") !~ /\A[\x80-\xbf]{0,5}( [\x00-\x7f]| [\xc0-\xdf][\x80-\xbf]| [\xe0-\xef][\x80-\xbf]{2}| [\xf0-\xf7][\x80-\xbf]{3}| [\xf8-\xfb][\x80-\xbf]{4}| [\xfc-\xfd][\x80-\xbf]{5} )*\x80{0,5}\z/sx; } while(1) { sysread(STDIN, $magic, $length) or die "$0: /dev/urandom: $!\n"; length($magic) == $length or die "$0: Short read\n"; # no repeated octets $magic =~ /(.)\1/s and next; # at least 50% high-half $_ = $magic; $high = 0; s/[\x80-\xff]/$high++, "h"/seg; next unless $high*2 >= $length; # at least 75% not ASCII printable $_ = $magic; $asc = 0; s/[\x20-\x7e]/$asc++, "a"/seg; next if $asc*4 > $length; # at least one ASCII printable $magic =~ /[\x20-\x7e]/s or next; # at least one high-half ISO-8859 printable $magic =~ /[\xa0-\xff]/s or next; # at least one ISO-8859 control character $magic =~ /[\x00-\x08\x0b\x0e-\x1f\x7f-\x9f]/s or next; # not a substring of UTF-8 not_utf8($magic) or next; not_utf8(reverse($magic)) or next; last; } $magic =~ s/(.)/sprintf("0x%02x ", ord($1))/seg; $magic =~ s/ $/\n/; Main expires April 2002 [Page 9] Internet-Draft Care and Feeding of Magic Numbers October 2001 print $magic; 3.2 Magic Numbers in Textual File Formats This section applies to file formats where the underlying format consists of a string of characters, which are in turn encoded as plain text using some charset. 3.2.1 Recommended Placement Similar considerations apply as apply with binary file formats. The most common location, the most logical, and the easiest to read from, is the very beginning of the file. Therefore, any new textual file format SHOULD place its magic number at the very beginning of the file, offset zero. 3.2.2 Selecting a Magic Number 3.2.2.1 Recommended Selection Criteria As with binary magic numbers, textual magic numbers SHOULD be selected randomly. To avoid accidentally generating a magic number that happens to look like natural text, randomly selected textual magic numbers SHOULD be filtered according to the following criteria: o There should be no adjacent identical characters. Non-random data is relatively likely to have such patterns. o There should be at least one non-alphanumeric character. The set of characters from which the magic number is generated depends on the requirements of the particular file format: different formats have different underlying character sets, and different readability and editability constraints. Magic numbers SHOULD be selected from as wide a character set as is possible subject to such requirements. 3.2.2.2 Recommended Length It is RECOMMENDED that the length of a textual magic number be chosen to match the number of magic numbers available in binary formats. This length necessarily varies with the character set to which the magic number is limited. In the case of selecting a magic number from the ISO-646 graphical characters, which have the best possible chance of being representable in any character set encountered in practice, there are 82 characters available. This yields potential information content Main expires April 2002 [Page 10] Internet-Draft Care and Feeding of Magic Numbers October 2001 of 6.36 bits per character. The filtering rules in section 3.2.2.1 provide some 1.25*2^63 acceptable 10-character magic numbers (approximately 84% of all 10-character sequences), and 1.24*2^31 acceptable 5-character magic numbers (72% of all 5-character sequences). 3.2.2.3 Magic Number Selection Program This Perl program can be used to generate high-quality textual magic numbers using the generation rules given in section 3.2.2.1. It uses only ISO-646 graphical characters, which should be acceptable to the widest possible variety of applications; when designing a file format that requires non-ISO-646 characters anyway, it may be desired to adapt this program to use a correspondingly wider selection of characters. #!/usr/bin/perl -w $length = $ARGV[0] || 10; $length >= 1 or die "$0: Magic must be at least 1 character\n"; open(STDIN, "/dev/urandom") or die "$0: Can't open /dev/urandom: $!\n"; $charset = "ABCDEFGHIJKLMNOPQRSTUVWXYZ". "abcdefghijklmnopqrstuvwxyz". "0123456789!\"%&'()*+,-./:;<=>?_"; while(1) { sysread(STDIN, $magic, $length) or die "$0: /dev/urandom: $!\n"; length($magic) == $length or die "$0: Short read\n"; $magic =~ s/(.)/ord($1) > 3*length($charset) ? "#" : substr($charset, ord($1)%length($charset), 1)/seg; $magic =~ /#/ and next; # no repeated characters $magic =~ /(.)\1/s and next; # at least one non-alphanumeric $magic =~ /[^A-Za-z0-9]/s or next; last; } print $magic, "\n"; 4 Security Considerations 4.1 Magic Numbers as a Validity Test As explained in section 2.1, a positive magic number test provides no assurance that a file is actually a valid instance of the file format it appears to be. A magic number test is not a substitute for a complete syntactic check, and so MUST NOT be relied on as a validity test. Main expires April 2002 [Page 11] Internet-Draft Care and Feeding of Magic Numbers October 2001 4.2 Eavesdropping Considerations The presence of a magic number in a file format can give an eavesdropper additional clues about the nature of data being intercepted, or may give an eavesdropper something convenient to search for in intercepted data if they want to find a particular type of data. In any situation where eavesdropping is a concern, the use of strong encryption is RECOMMENDED. 4.3 Interaction with Encryption Where data is encrypted, the presence of a string of octets of fixed value, particularly at the very beginning of a data stream, can provide an opportunity for an attacker to apply known-plaintext attacks. Good ciphers are designed to resist such attacks; such resistance becomes absolutely essential when dealing with data that is as predictable as a magic number. In theory, the capability for an attacker to specify a new file format that includes a lengthy magic number opens up the possibility of a very slow chosen-plaintext attack. This is made possible by the lack of expectation that a magic number be in any way meaningful; this is the type of risk that leads cipher and hash algorithm designers to use mathematically significant constants instead of apparently random values in their algorithms. This possibility is difficult to exploit, and is in most contexts less of a concern than direct chosen-plaintext attacks where the attacker chooses the content (rather than the form) of data to be encrypted. In either case, the risk is mitigated by the use of good ciphers that are designed to resist chosen-plaintext attacks. Generally, all these concerns about known patterns in secret data already exist in any structured data; a magic number is merely the simplest and most extreme case. Good ciphers, chaining modes, and cryptographic protocols are all intended to remain secure under situations of partially known or chosen plaintext. 5 Acknowledgements Some of the magic number selection rules in section 3.1.4.3 are due to Eric S. Raymond. 6 References [MIME-MSG] N. Freed, N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. Main expires April 2002 [Page 12] Internet-Draft Care and Feeding of Magic Numbers October 2001 [MIME-REG] N. Freed, J. Klensin & J. Postel, "Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures", BCP 13, RFC 2048, November 1996. [PNG] T. Boutell, "PNG (Portable Network Graphics) Specification Version 1.0", RFC 2083, March 1997. [REQ-TERM] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 7 Author's Address Andrew Main Black Ops Ltd 36 Cannon Hill Road Coventry CV4 7DE United Kingdom Phone: +44 7887 945779 EMail: zefram@fysh.org Main expires April 2002 [Page 13]