Internet DRAFT - draft-goergen-lmap-fcc

draft-goergen-lmap-fcc






LMAP WG                                                       D. Goergen
Internet-Draft                                                  R. State
Intended status: Informational                  University of Luxembourg
Expires: January 16, 2014                                     V. Gurbani
                                               Bell Labs, Alcatel-Lucent
                                                           July 15, 2013


  Aggregating large-scale measurements for  Application Layer Traffic
                      Optimization (ALTO) Protocol
                       draft-goergen-lmap-fcc-00

Abstract

   Analyzing and aggregating large-scale broadband measurements is
   essential to study trends and derive network analytics.  These trends
   and analyses could be made available through well defined protocols
   such as the Application Layer Traffic Optimization (ALTO) protocol.
   However, ALTO requires network information to be distilled and
   abstracted in form of a network map and a cost map.  We describe our
   methodology for analyzing the United States Federal Communication
   Commission's (FCC) Measuring Broadband America (MBA) dataset to
   derive required topology and cost maps suitable for consumption by an
   ALTO server.



























Goergen, et al.         Expires January 16, 2014                [Page 1]

Internet-Draft                  ALTO Maps                      July 2013


Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 16, 2014.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.














Goergen, et al.         Expires January 16, 2014                [Page 2]

Internet-Draft                  ALTO Maps                      July 2013


Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
   2.  Challenges in data analysis  . . . . . . . . . . . . . . . . .  5
   3.  Geo-locating the units . . . . . . . . . . . . . . . . . . . .  6
   4.  Conclusions and future work  . . . . . . . . . . . . . . . . .  9
   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 10
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 11
   7.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 12
     7.1.  Normative References . . . . . . . . . . . . . . . . . . . 12
     7.2.  Informative References . . . . . . . . . . . . . . . . . . 12
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13







































Goergen, et al.         Expires January 16, 2014                [Page 3]

Internet-Draft                  ALTO Maps                      July 2013


1.  Introduction

   Measuring broadband performance is increasingly important as
   communications continue to move towards the Internet.  Internet
   service providers (ISP), national agencies and other entities gather
   broadband data and may provide some, or all, of the dataset to the
   public for analysis.  As [I-D.seedorf-lmap-alto] notes, there are two
   extremes prevalent for presenting large-scale data.  One is in the
   form of charts, figures, or summarized reports amenable for easy and
   quick consumption.  The other extreme includes releasing raw data in
   the form of large files containing tables formatted as values
   separated by a delimiter.  While the former is indispensable to
   acquire a summary view of the dataset, it does not suffice for
   additional analysis beyond what is presented.  Conversely, the
   problem with the latter option (raw files) is that the unsuspecting
   user perusing them is lost in the deluge of data.

   [I-D.seedorf-lmap-alto] offers the argument that a reasonable medium
   between the two extremes may be a protocol that allows a constrained
   set of user-driven ad-hoc queries on the dataset.  It further offers
   that the Application Layer Traffic Optimization (ALTO) protocol
   [I-D.ietf-alto-protocol] be the protocol of choice that allows such
   reasoning on the dataset.  A necessary prerequisite for using ALTO is
   abstracting the network information into a form that is suitable for
   consumption by the protocol.  The implication of using ALTO is that
   data from any large-scale measurement effort must first be distilled
   in two maps: a topology map and a cost map.  Further analysis and ad-
   hoc queries can be subsequently performed on the normalized dataset.

   In the United States, the Federal Communication Commission (FCC) has
   embarked on a nationwide performance study of residential wireline
   broadband service [fcc].  Our aim is to use the raw datasets from
   this study for analysis and to create a topology map and a cost map
   from this dataset.  ALTO queries aimed at these maps will enable
   users and interested parties to fulfill the use cases listed in
   Section 2 of [I-D.seedorf-lmap-alto].















Goergen, et al.         Expires January 16, 2014                [Page 4]

Internet-Draft                  ALTO Maps                      July 2013


2.  Challenges in data analysis

   The FCC Measuring Broadband America (MBA) study consisted of 7,782
   volunteers spread across the United States with adequate geographic
   diversity.  Volunteers opted in for the study, however, each of the
   volunteers remained anonymous.  An opaque integral number (unit_id)
   represented a subscriber in the raw dataset.  This unit_id remains
   constant during the duration of the study in the dataset and uniquely
   identifies a volunteer subscriber, even if the subscriber switches
   the ISP.  More detail about the methodology used is described in
   [fcc].

   The dataset consisted of 12 tables, each table corresponding to the
   data drawn from a certain performance test.  For the analysis we
   present in this document we focus on the "curr_dns" table, which
   contains the time taken for the ISP's recursive DNS resolver to
   return a DNS A RR for a popular website domain name.  This test was
   ran approximately every hour in a 24-hour period, and produced about
   75-78 million records per month.  This resulted in a typical file
   size in the range of 6-7 GBytes per month.  We note that the
   "curr_dns" table is one of the smaller tables in the dataset.

   The first challenge, therefore, was to arrive at computing resources
   comparable in scale with respect to the dataset consisting of
   millions of records spread across gigabyte-sized files.  To analyze
   the volume of data we used a canonical Map-Reduce computational
   paradigm on a Hadoop cluster (more details on the methodology are
   outlined in Section 3).

   A second, more pressing challenge, was to identify the geographic
   location of the unit_ids generating the data.  In order to derive a
   topological map and impose costs on the links, it is important to
   know the physical locations of the unit_ids that contributed the
   measurements.  However, in the MBA dataset, the population is
   anonymized and the individual subscriber reporting the measurement
   data is simply referred to by an opaque integral number.  Therefore,
   an important task was to use the information in the public tables to
   reveal a coarse location of the subscriber.

   We outline the methodology we used to do so in the next section.  We
   stress that this methodology does not identify the specific location
   of a subscriber, who still remains anonymous.  Instead, it simply
   locates the subscriber in a larger metropolitan region.  This level
   of granularity suffices for our work.







Goergen, et al.         Expires January 16, 2014                [Page 5]

Internet-Draft                  ALTO Maps                      July 2013


3.  Geo-locating the units

   To geo-locate the units, we simply note that broadband subscriber
   devices are likely to be configured using DHCP by their ISP.  Besides
   imparting an IP address to the subscriber device, DHCP also populates
   the DNS name servers the subscriber devices uses for DNS queries.  In
   most installations, these DNS name servers are located in close
   physical proximity of the subscriber device.  The FCC technical
   appendix states that the DNS resolution tests were targeted directly
   at the ISP's recursive resolvers to circumvent caching and users
   configuring the subscriber device to circumvent the ISP's DNS
   resolvers.  Therefore, a reasonable approximation of a subscribers
   geo-location could be the geographic location of the DNS name server
   serving the subscriber.  We use this very heuristic to geo-locate a
   subscriber.

   Thus our first, and very simple filter consisted of obtaining a
   mapping from a unit_id (representing a subscriber) to one or more DNS
   name servers that the unit_id is sending DNS requests to.  It turned
   out that while this was a necessary condition for advancing, it was
   not a sufficient one.  The raw data would need to be further
   processed to reduce inconsistencies and remove outliers.  A number of
   interesting artifacts were uncovered during further processing of the
   data.  These artifacts informed the selection of the unit_ids for
   further analysis.

   The artifacts are documented below.

   o  A handful of unit_ids were geo-located in areas outside the
      contiguous United States, such as Ukraine, Poland or the United
      Kingdom.  We theorize that the subscribers corresponding to the
      unit_ids geo-located outside the contiguous United States had
      simply configured their devices to use alternate DNS servers,
      probably located outside the United States.  We removed these
      records before conducting our analysis.

   o  We also observed a reasonable number of non-ISP DNS resolvers,
      especially Google's 8.8.8.8 and 8.8.4.4 and OpenDNS 208.67.222.222
      and 208.67.220.220.  These 4 public DNS servers are geo-located in
      California.  We removed these records to ensure that the specific
      location that these resolvers represented was not oversampled.

   o  We noticed that a large number of unit_ids were being geo-located
      in Potwin, Kansas.  Intrigued as to why there appeared to be a
      large population of Internet users being located in a small rural
      community in Kansas, we investigated further.  It appears that
      Potwin, Kansas is the geographical center of the United States and
      a number of ISPs have chosen to establish data centers in or



Goergen, et al.         Expires January 16, 2014                [Page 6]

Internet-Draft                  ALTO Maps                      July 2013


      around the Potwin area.  These ISPs generally locate their primary
      or secondary DNS name servers in Potwin-area data centers, thus
      accounting for the popularity of Potwin as an Internet
      destination.  We continue to further investigate on minimizing the
      impact of such natural aggregation points that, if not accounted
      for, will skew our results in an unwarranted direction.

   o  We observed some unit_ids changing ISPs during the observation
      period.  This is a normal occurrence and to the extent that the
      unit_id is geo-located in the same geographical area after the
      change in ISP, we do not exclude such unit_ids from further
      analysis.

   Subsequent filters extracted the stable unit_ids from our dataset.
   In order to determine which unit_id are stable, i.e., remain constant
   with respect to their geographic location over the observation period
   from January to December 2012, we extracted for each unit_id the IP
   address of each DNS name server it consulted.  This is obtained by
   applying the map reduce paradigm on the DNS dataset.  We extracted
   for each unit_id the triggered DNS servers and obtained the
   individual DNS servers accessed by a unit_id.  This was repeated for
   each month of the observation period.  The resulting sets were
   cleaned up of private IP addresses and other artifacts discussed
   above.  The cleaned set consisted of about 8000 distinct unit_id.

   In order to determine the stability of each unit_id we proceeded to
   sum up the occurrences of IP addresses over the whole observation
   period separated in monthly files.  If the IP address of a DNS server
   occurred 12 times this meant that the unit_id always accessed the
   same DNS server and therefore remained stable over the observation
   period.  The obtained stable unit_ids, around 1500, will be used for
   further analysis.  Assuming a 99% confidence level and +/- 3 point
   margin of error, we will require a sample of 1494 unit_ids.  With our
   stable unit_id set of 1500 unit_ids, we are now positioned to perform
   further analysis on the dataset to create the full topology and cost
   maps.

   Table 1 presents a sample of the geographic location data that we
   have uncovered for unit_ids.  A complete list of identified units
   superimposed on the geographical map of the United States is
   available at http://cdb.io/13UOHgD.










Goergen, et al.         Expires January 16, 2014                [Page 7]

Internet-Draft                  ALTO Maps                      July 2013


         +---------+-----------------+--------------------------+
         | Unit ID | City, State     | Latitude/Longitude       |
         +---------+-----------------+--------------------------+
         | 872     | Morganville, NJ | 40.35950089,-74.26280212 |
         |         |                 |                          |
         | 885     | Madison, WI     | 43.07310104,-89.40119934 |
         |         |                 |                          |
         | 898     | Foley, AL       | 30.40660095,-87.68360138 |
         |         |                 |                          |
         | 7969    | Manteca, CA     | 37.79740143,-121.2160034 |
         |         |                 |                          |
         | 8024    | Quincy, MA      | 42.25289917,-71.00229645 |
         +---------+-----------------+--------------------------+

                     Sample unit identification tuples

                                  Table 1


































Goergen, et al.         Expires January 16, 2014                [Page 8]

Internet-Draft                  ALTO Maps                      July 2013


4.  Conclusions and future work

   Identification of the geographic location of the unit_ids generating
   the performance data is essential in order to continue the work.  We
   have presented a methodology and some early results in identifying a
   geographic location.  This location, although coarse, suffices for
   our future work that will consist of further data mining and analysis
   to create appropriate ALTO network and cost maps.











































Goergen, et al.         Expires January 16, 2014                [Page 9]

Internet-Draft                  ALTO Maps                      July 2013


5.  IANA Considerations

   This document does not contain any IANA considerations
















































Goergen, et al.         Expires January 16, 2014               [Page 10]

Internet-Draft                  ALTO Maps                      July 2013


6.  Security Considerations

   There are no security artifacts that have been invalidated due to our
   analysis.  All of our analysis was performed on publicly available
   data.  However, we do note that some privacy may have been lost based
   on our analysis.  In the raw dataset, the unit identifiers are opaque
   strings with no immediate correlation with a geographic location.
   After our analysis, while the unit identifiers still remain opaque,
   they are nonetheless correlated to a specific, though coarse,
   geographic location.









































Goergen, et al.         Expires January 16, 2014               [Page 11]

Internet-Draft                  ALTO Maps                      July 2013


7.  References

7.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

7.2.  Informative References

   [I-D.ietf-alto-protocol]
              Alimi, R., Penno, R., and Y. Yang, "ALTO Protocol",
              draft-ietf-alto-protocol-17 (work in progress), July 2013.

   [I-D.seedorf-lmap-alto]
              Seedorf, J., Gurbani, V., and E. Marocco, "ALTO for
              Querying LMAP Results", draft-seedorf-lmap-alto-01 (work
              in progress), July 2013.

   [fcc]      United States Federal Communications Commission,
              "Measuring Broadband America", Accessed July 12,
              2013, http://www.fcc.gov/measuring-broadband-america.






























Goergen, et al.         Expires January 16, 2014               [Page 12]

Internet-Draft                  ALTO Maps                      July 2013


Authors' Addresses

   David Goergen
   University of Luxembourg

   Email: david.goergen@uni.lu


   Radu State
   University of Luxembourg

   Email: radu.state@uni.lu


   Vijay K. Gurbani
   Bell Labs, Alcatel-Lucent

   Email: vijay.gurbani@alcatel-lucent.com

































Goergen, et al.         Expires January 16, 2014               [Page 13]