IP Performance Working Group M. Mathis Internet-Draft Google, Inc Intended status: Experimental Oct 15, 2012 Expires: April 18, 2013 Curating Internet Measurement Data draft-mathis-ippm-data-curation-00.txt Abstract This document describes the nomenclature, requirements and framework for handling per-sample metadata for a live archive of Internet measurements. By archive, we mean that once captured, the data MUST NOT be altered or deleted. By live, we mean that new data from ongoing measurements are being continuously appended to the archive. The purpose of the metadata is to support the use of the Scientific Method in the study the archived data. Under the principle of full disclosure, the Scientific Method requires that later researchers be able to repeat earlier studies using the original data, but refined by new insights (such as improved calibration) gained from other intervening studies. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on April 18, 2013. Copyright Notice Copyright (c) 2012 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents Mathis Expires April 18, 2013 [Page 1] Internet-Draft IPPM Meta Data Oct 2012 (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Definitions and Assumptions . . . . . . . . . . . . . . . . . . 4 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Normative References . . . . . . . . . . . . . . . . . . . 7 5.2. Informative References . . . . . . . . . . . . . . . . . . 7 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 7 Mathis Expires April 18, 2013 [Page 2] Internet-Draft IPPM Meta Data Oct 2012 1. Introduction This [very preliminary Internet-Draft will eventually] describe the nomenclature, requirements and framework for handling per-sample metadata for a live archive of Internet measurements. By archive, we mean that once captured, the data MUST NOT be altered or deleted. By live, we mean that new data from ongoing measurement are being continuously appended to the archive. Without loss of generality, we model such an archive as a large sparse table where each row is a sample (likely but not required to be a singleton as defined in [RFC2330]) and each column is a measurement parameter or result. The purpose of the metadata is to support the use of the Scientific Method in the study the live data archive. Under the principle of full disclosure, the Scientific Method requires that later researchers have access to the same tools and raw data set used by previous researchers. The difficulty being that the archive itself has been extended in the time between current researchers' access, and previous researchers' conclusions. A capability must exist to exclude all new data from these investigations. Furthermore, a great deal of knowledge is often gained by detecting systematic error in extant data and retroactively recalibrating well after it was collected. It is extremely important to be able to repeat earlier inquiry using legacy data refined by later calibration studies. The metadata is to be placed in additional columns that are subject to the same rules as the data itself: columns can be added, but once added they MUST NOT be altered or deleted. This note describe the nomenclature, framework and requirements for handling the metadata. [At this time our goal is to recruit additional authors.] 2. Background The advent of large scale online storage and high performance cloud computing has created the opportunity for data mining Internet measurements. Data mining can be described as detecting patterns in large data sets to infer knowledge beyond the scope of the original data or measurement. In order for inferred results to meet the formal criterion as a scientific endeavour, the data mining analysis MUST be repeatable by later researchers who MUST be able to re-examine the same raw data with the same or similar tools. The additional researcher can confirm the conclusions or consider the possibility of alternative explanations for the patterns detected in the data. Mathis Expires April 18, 2013 [Page 3] Internet-Draft IPPM Meta Data Oct 2012 To support data mining as a scientific endeavour the data set must be archival: the only permitted alteration to the data set is to append more data: additional rows as more measurements are performed; additional columns as new annotations are added. Once captured, the data MUST NOT be deleted or altered under any conditions. All data is subject to measurement error. One common use of of data mining is detecting and retroactively compensating for measurement error. In the case of Internet measurement there is also the opportunity for operational events such as topology changes or routing updates to introduce subtle errors or changes in the calibration of the measurements. As a consequence, it is desirable to be able to tag rows with metadata indicating noteworthy operational events. Although most such annotations are entirely benign, occasionally it might be discovered that some of the data is "tainted" when something goes wrong such that the data can not be trusted without additional processing. An important use case for an archived data set is supporting the following type of scenario: researcher A makes a claim about a pattern observed while mining the data. At a later time researcher B discovers some systematic error present in some of the data and develops a method to recalibrate the data to compensate for the systematic errors. Researcher C wants to reconfirm A's results by running the following series of experiments: A's original analysis of the same rows as A used, to confirm the procedure. A's original analysis of the same rows as A used, using data recalibrated by B. Differential version of A's algorithm to extrapolate how recalibration might affect other results. A's original analysis on the data to date, including new rows since A's conclusion. A's original analysis on the data to date, using data recalibrated by B. This process of continuous re-evaluation is the foundation of the Scientific Method. 3. Definitions and Assumptions Data is assumed to be named (indexed) by N-tuples. Without loss of generality, we further assume a 2 dimensional sparse table where each measurement is a single row, and columns that are parameters or results associated with each measurement, as might be implemented in an SQL like database. These assumptions give us some linguistic shorthand: each row is a singletons[RFC2330], and each column the Mathis Expires April 18, 2013 [Page 4] Internet-Draft IPPM Meta Data Oct 2012 parameters, result and metadata data associated with each singleton. In a live archive, the number of rows continues to grow as more data is collected. This note describes conventions and practices for adding additional columns to support robust scientific method. Other ways to organize the data are completely acceptable. Using some other data organization would change the terminology but not the principles outlined in this document. Metadata can be added to each row, by adding additional columns to the table. Metadata MUST be subject to the same rules as the measurement data: once added it can not be deleted or altered, otherwise an experiment that used the metadata can not be repeated. To minimize the long term cost of the metadata, it should be designed carefully such that as much as possible each metadata column will remain useful for the life of the data set. It is explicitly permitted for metadata to be added to existing rows after the data itself has been archived. Each metadata column must be properly defined: at a minimum a textual description and date the column was first added. Other useful attributes include an indication of who or what authority or process is responsible for adding metadata to new rows, as the archive is extended by live measurement. It is tempting to try to replace the textual description of the metadata by some formal language specification. However our intuition is that such an effort is likely to be unbound and potentially non-convergent. Therefore, we only provide a couple of examples. A chronology is a metadata tag for which there are well defined "before" and "after" primitives. Examples include measurement time and archive time, as well as several platform properties: OS version and versions of all component software. The important property of a chronology is that it is useful to specify rows as ranges. While it is tempting to think that measurement time might be sufficient metadata, consider the difficulty in representing a scenario where a software version change affected some detail of the measurement parameters. If this tool is deployed across a very large fleet of nodes, then making sense of the data requires being able to join the per node update logs with the archived data. In general it would be far easier to label the data with the measurement tool version used for measurement. The researcher then just has to filter the data by comparing tool version. Mathis Expires April 18, 2013 [Page 5] Internet-Draft IPPM Meta Data Oct 2012 Another example of a chronology might be to tag measurements with pointers to entries in an operations log that might be tracking topology and configuration changes. This would permit fast exploration of questions such as, "did some network change affect performance in a detectable way?" A another class of meta data is data recalibration, for instance using an independent means to compute systematic errors present in measurement parameters. Either updated values or error offsets could be stored as metadata. This gives all researchers access to both the raw, uncorrected values and the adjusted or recalibrated values. Similar to recalibration is addressing tainted data, for example if some event made the data unreliable or inaccurate as pertains to determining some specific metric in a way that can not be calibrated. While it is tempting to entirely discard such data, doing so would invalidate the archive for other sorts of studies which might not be affected by inaccuracy, for example investigating questions of user self selection. 4. Requirements This section defines a metadata formalism that would permit a data archive to implement computer science grade lock semantics on the data. This is likely to be overkill for nearly all applications, however it does permit an implementer clearly understand where there are assumption that might create potential for race conditions, for example by not having a strong way to assure that empty cells are not changed after they have been accessed. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Data in an archive MUST NOT be altered or deleted. The Archive MUST make a distinction between an empty cell (never written) and a cell containing a null data. e.g. a null string or a null pointer. Each cell is permitted to have one transition from empty to non-empty, and no other transitions. And many many more.... 5. References Mathis Expires April 18, 2013 [Page 6] Internet-Draft IPPM Meta Data Oct 2012 5.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 5.2. Informative References [RFC2330] Paxson, V., Almes, G., Mahdavi, J., and M. Mathis, "Framework for IP Performance Metrics", RFC 2330, May 1998. Author's Address Matt Mathis Google, Inc 1600 Amphitheater Parkway Mountain View, California 93117 USA Email: mattmathis@google.com Mathis Expires April 18, 2013 [Page 7]