Network Working Group                                             S. Rao
Internet-Draft                                                S. Nagaraj
Intended status: Experimental                                       Grab
Expires: January 14, 2021                                       S. Sahib
                                                                R. Guest
                                                              Salesforce
                                                           July 13, 2020


                 Personal Information Tagging for Logs
                          draft-rao-pitfol-02

Abstract

   Software systems typically generate log messages in the course of
   their operation.  These log messages (or 'logs') record events as
   they happen, thus providing a trail that can be used to understand
   the state of the system and help with troubleshooting issues.  Given
   that logs try to capture state that is useful for monitoring and
   debugging, they can contain information that can be used to identify
   users.  Personal data identification and anonymization in logs is
   crucial to ensure that no personal data is being inadvertently logged
   and retained which would make the logging system run afoul of laws
   around storing private information.  This document focuses on
   exploring mechanisms that can be used by a generating or intermediary
   logging service to specify personal or sensitive data in log
   message(s), thus allowing a downstream logging server to potentially
   enforce any redaction or transformation.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any


Rao, et al.             Expires January 14, 2021                [Page 1]

Internet-Draft                   PITFoL                        July 2020


   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on January 14, 2021.

Copyright Notice

   Copyright (c) 2020 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (https://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Motivation and Use Cases  . . . . . . . . . . . . . . . . . .   4
   4.  Challenges with Existing Approaches . . . . . . . . . . . . .   4
   5.  Proposed Model  . . . . . . . . . . . . . . . . . . . . . . .   5
     5.1.  Defining the log privacy schema . . . . . . . . . . . . .   5
     5.2.  Typical Workflow  . . . . . . . . . . . . . . . . . . . .   7
     5.3.  Log Processing and Access Control . . . . . . . . . . . .   7
   6.  Examples  . . . . . . . . . . . . . . . . . . . . . . . . . .   8
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   9
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .   9
   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  10
   10. Normative References  . . . . . . . . . . . . . . . . . . . .  10
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   Logs capture the state of a software system in operation, thus
   providing observability.  However, because of the amount of state
   they capture, they can often contain sensitive user information
   [link: twitter storing passwords].  Personal data identification and
   redaction is crucial to make sure that a logging application is not
   storing and potentially leaking users' private information.  There
   are known precedents that help discover and extract sensitive data,
   for example, we can define a regular expression or lookup rules that
   will match a person's name, credit card number, email address and so


Rao, et al.             Expires January 14, 2021                [Page 2]

Internet-Draft                   PITFoL                        July 2020


   on.  Besides, there are data dictionary based training models that
   can analyze logs and predict presence of sensitive data and
   subsequently redact it.  This document proposes an approach and
   framework for creating logs with personal information tagged, thus
   marking a step towards privacy aware logging.  Once personal
   information is identified in a log, it has to be appropriately tagged
   at source.  Personal data tagging is especially important in cases
   where log data is flowing in from disparate sources.  In cases where
   tagging at source is not possible (e.g. log data generated by a
   legacy application, IoT device, Web server or a Firewall), a
   centralized logging server can be tasked with making sure the log
   data is tagged before passing on downstream.  Once the logs are
   tagged, the logging application can use anonymization techniques to
   redact the fields appropriately.  While the proposal described here
   can be applied to any data deemed sensitive in a log, however this
   document specifically discusses and illustrates tagging of personal
   information in logs.

2.  Terminology

   *Personal data:* RFC 6973 [RFC6973] defines personal data as "any
   information relating to an individual who can be identified, directly
   or indirectly."  This typically includes information such as IP
   addresses, username, email address, financial data, passwords and so
   on.  However, the definition of personal data varies heavily by what
   other information is available, the jurisdiction of operation and
   other such factors.  Hence, this document does not focus on
   prescriptively listing what log fields contain personal data but
   rather on what a tagging mechanism would look like once a logging
   application has determined which fields it considers to hold personal
   data.

   *Structured logging:* Most applications generate logs in a
   unidimensional format that twine together logic status and input
   data.  This makes log output largely free flowing and unstructured
   without specific delimiters making it hard to segregate personal
   information from other text in the log.  Structured logging refers to
   a formal arrangement of logs with specific identifiers of personal
   information and semantic information to enable easy parsing and
   identification of specific information in the log.

   *Privacy Sensitivity Level:* Sensitivity level defines the degree of
   sensitivity of a data in log template or schema.  Level can be
   enumerated on a scale 1 to 5 and defined as follows: 1 - Low risk for
   leaking private information and 5 - Very high risk for leaking
   private information>


Rao, et al.             Expires January 14, 2021                [Page 3]

Internet-Draft                   PITFoL                        July 2020


3.  Motivation and Use Cases

   Most systems like network devices, web servers and application
   services record information about user activity, transactions,
   network flows, etc., as log data.  Logs are incredibly useful for
   various purposes such as security monitoring, application debugging,
   investigations and operational maintenance.  In addition, there are
   use cases of organizations exporting or sharing logs with third party
   log analyzers for purposes of security incident response, monitoring,
   business analytics, where logs can be a valuable source of
   information.  In such cases, there are concerns about potential
   exposure of personal data to unintended systems or recipients.

4.  Challenges with Existing Approaches

   While methods of detecting personal identifiable information are
   continuously evolving, most approaches are around use of regular
   expressions, data or dataset based training models, pattern
   recognition, checksum matching, building custom logic.

   *Inconsistent Representation:* When applications, services or
   devices, log personal information, there is no consistency in the
   representation of the information.  For example the name of a user is
   often logged as either "fullname" (e.g.  John Doe) or with
   "firstname" (John) and "lastname" (Doe).

   *Context:* In most cases, what data is considered personal and
   sensitive is subjective, provisional and contextual to the data
   source or the application processing the data, which makes it hard to
   use automated techniques to identify personal data.  Even for a
   specific domain, it's controversial whether it is possible to
   definitively say that a piece of data is NOT identifying.

   *Disparate Types of Personal Data:* There are many disparate types of
   personal data and often require a multitude approaches for detection.

   *Lack of standards:* There are no standards that govern formats of
   sensitive data making automation difficult for most common use cases.

   *Detection Accuracy:* Most of the current PII detections tools employ
   regular expression based techniques or other pattern recognition
   techniques to identify the PII data.  Due to the very nature of logs,
   most of the current implementations let administrators to add
   redaction policies based on 'likelihood' of detection probability
   categorized as low, medium or high.  Defining a low detection scheme
   causes high false positives and a high detection scheme would cause
   PII leakage, thereby making a trade off inevitable to organizations.


Rao, et al.             Expires January 14, 2021                [Page 4]

Internet-Draft                   PITFoL                        July 2020


5.  Proposed Model

   This section describes a reference model to enable tagging of
   personal information at source and extends it to include an approach
   of role or policy based redaction based on personal information
   annotated at source.  The figure below illustrates the proposed
   model.

     Log Template/Schema with personal data identifiers
                         |
                         V
                   Log library
                         |
                         V
                     Application
                         |
                         V
                 Generate annotated log
                         |
                         V
                     Log redaction +---  consumer / role based
                                   +--   sensitivity based

                              Figure 1: Flow

5.1.  Defining the log privacy schema

   We propose using structured logging where a log schema or a template
   defines standardized identifiers for every personal information and
   each log field is associated with a sensitivity level customized to a
   use case or log intent.

   Note that this is not to be confused with a log severity level (WARN,
   INFO...) - those are typically defined "dynamically" by the developer
   while defining the severity of a certain scenario.  A privacy
   sensitivity level is defined statically and is part of a log schema,
   associated with the log name and data type.


Rao, et al.             Expires January 14, 2021                [Page 5]

Internet-Draft                   PITFoL                        July 2020


   +------------------+-----------------+----------------+-------------+
   | Name             | Abstract Data   | Description    | Sensitivity |
   |                  | Type            |                | [1-High     |
   |                  |                 |                | 5-Normal]   |
   +------------------+-----------------+----------------+-------------+
   | nationalIdentity | String          | National IDs   | 1           |
   |                  |                 | issued by      |             |
   |                  |                 | sovereign      |             |
   |                  |                 | governments.   |             |
   |                  |                 | Eg., SSN       |             |
   | drivingLicense   | String          | Driving        | 1           |
   |                  |                 | License number |             |
   | taxIdentity      | String          | Tax            | 1           |
   |                  |                 | identification |             |
   |                  |                 | numbers        |             |
   | credtCardNumber  | String          | Credit cards   | 1           |
   | bankAccount      | String          | Bank account   | 1           |
   |                  |                 | number         |             |
   | dateOfBirth      | Date            | Date of Birth  | 2           |
   | personName       | String          | Person name    | 1           |
   | emailAddress     | String          | Email          | 2           |
   | phoneNumber      | Number          | Phone          | 1           |
   | zipCode          | Integer         | Zip codes      | 5           |
   | ipAddress        | ipv4Address     | IPv4 or IPv6   | 4           |
   |                  |                 | Address        |             |
   | dateTimeSeconds  | dateTimeSeconds | seconds        | 5           |
   | age              | Integer         | Age            | 2           |
   | ethnicGroup      | String          | Ethnic group   | 1           |
   | genderIdentity   | String          | Gender         | 1           |
   |                  |                 | identity       |             |
   | macAddress       | macAddress      | MAC Address    | 4           |
   +------------------+-----------------+----------------+-------------+

                 Personal Information Identifiers Registry

   If an organization already uses structured logging with a log schema,
   then a privacy sensitivity level can be an additional attribute for
   the schema.

   The privacy sensitivity level for log types is intended to be defined
   by a centralized effort around privacy preservation in logs.  In
   other words, this mapping might be done by an organization's privacy
   team (which can include lawyers, engineers and privacy
   professionals).  The intention is that all logs generated by an org
   should conform to this structured format, which would ease downstream
   processing of logs for access control and removal of sensitive
   information.


Rao, et al.             Expires January 14, 2021                [Page 6]

Internet-Draft                   PITFoL                        July 2020


   If the log is being generated by a web server, then two approaches
   can be taken:

   1.  Modify log-format for the service: identify the log data type of
   each piece of log data generated, and tag in generation (examples
   provided in later section)

   2.  Add automated tagging in a centralized log aggregator: collect
   all the logs generated by different services and apply the annotation
   using the log schema at the aggregator

5.2.  Typical Workflow

   1.  The log privacy schema can be parsed into a structured logging
       library, that is used by individual developer teams.  The
       intention is for developers to not log arbitrary data i.e. they
       are asked to identify what is the data type of the state they
       want to preserve.

   2.  Any addition to the log schema would have to go through review of
       the privacy team that came up with the log schema.

   3.  Once a log is generated, tagged and stored, various kinds of
       access control techniques can be applied to who can access the
       logs.

5.3.  Log Processing and Access Control

   1.  Consumer Role Based Access

       A.  Once the log is tagged, access to it can be based on a
           consumer's role and privilege level.

       B.  A consumeer role based policy can define what level of
           sensitivity they can access.

   2.  Case-based access

       A.  If there is a genuine case for which access to sensitive
           information is needed and granted by the legal department, a
           cryptographically-signed token (e.g.JWT) can be generated
           that will allow access to a developer/user to logs of an
           increased log level.  This access can be temporal in nature
           i.e. the token will only be valid for a certain amount of
           time.

       B.  A transaction ID can also be propagated automatically
           throughout the request processing, to correlate different


Rao, et al.             Expires January 14, 2021                [Page 7]

Internet-Draft                   PITFoL                        July 2020


           logs related to a single request.  Note that the notion of a
           "request" can vary based on what the application is doing.
           The idea is to have a single unifying ID to tie a particular
           action.  If this is done, then the temporary token can be
           restricted to a particular request ID.

   3.  Redaction Techniques

       A.  Given that the log is tagged, an organization might choose to
           redact the more sensitive logs i.e. ones above a certain
           sensitivity level, ones of a certain log type.

       B.  More sophisticated approaches can be developed i.e.
           completely redact log types username and email, but obfuscate
           IP address so that a rough location can be garnered from the
           log record.  In this way, techniques such as differential
           privacy can be used in tandem to have privacy guarantees for
           logs while still providing usefulness to developers.

6.  Examples

   An example based on RFC 3164 Log format

   Normal Log Ouput

   <120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" user_name="XYZZY" job_number="256937"
   err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875"
   action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY"
   val_jobno="256937" object="TEST" object_library="CUS9242"
   object_type="*FILE" pgm_name="" pgm_libr="" workstation=""]

   Log Output with Personal Information Tagging

   <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority
   failure" violation="A-Not authorized to object" actual_type="AF-A"
   jrn_seq="1001363" timestamp="20120418163258988000"
   job_name="QPADEV000B" {personName="XYZZY" pii_sensitivity_level=1}
   job_number="256937" {emailAddress="xyz@foo.com"
   pii_sensitivity_level=2] [ip_addr="10.0.1.21"
   pii_sensitivity_level=4] port="55875" action="Undefined(x00)"
   val_job="QPADEV000B" val_jobno="256937" object="TEST"
   object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr=""
   workstation=""]


Rao, et al.             Expires January 14, 2021                [Page 8]

Internet-Draft                   PITFoL                        July 2020


7.  IANA Considerations

   IANA can consider defining a new central respository for Personal
   Information name and identifier registries to used in logging
   personal information.  The personal identifier registry would
   enumerate namee and identifiers as described in Section 5.1.

8.  Security Considerations

   It is anticipated that developers will want additional log data types
   for capturing application logic, and might abuse an existing log type
   instead of going through the process of adding a new one.  In such a
   case, the log would be incorrectly tagged.  This can be mitigated by
   having stronger typing for the log data types i.e. restricting
   address to a certain string length instead of storing arbitrary
   length.

   Encouraging developers to think carefully about what kind of data
   they're logging is a good practice and will lead to fewer incidents
   of private data being inadvertently logged.  An organization might
   choose to have an unstructured log type for letting developers log
   data that truly do not fit anywhere else.  This is still better than
   not having structured privacy-aware logging, because the potential
   privacy leakage is isolated to one particular field and its use can
   be monitored.

   Having a mapping from log data type to privacy sensitivity will need
   continuous effort by a privacy team, which might be expensive for an
   organization.

   Log data is often collated, propagated, transformed, loaded into
   different formats or data models for purposes of analytics,
   troubleshooting and visualization.  In such cases, it is necessary
   and critical to ensure that personal information tagging and
   annotations is preserved and forwarded across format transformations.

   If the privacy marking or classification changes for a log, for
   historical logs, the change of privacy classification is applied on
   subsequent access of the log.

   *TODO*: In case of logs that are not tagged or marked with personal
   information, an out-of-band mechanism to communicate log template or
   schema with personal data identifiers can be considered.  Such a
   mechansim can also be used to notify changes to privacy tagging or
   classification.


Rao, et al.             Expires January 14, 2021                [Page 9]

Internet-Draft                   PITFoL                        July 2020


9.  Acknowledgements

   The authors would like to thank everyone who provided helpful
   comments at the mic at IETF 106 during the PEARG session.  Thanks
   also to Joe Salowey for thoughts on aspects of log transformations,
   change of privacy classifications, models for privacy marking.

10.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC3164]  Lonvick, C., "The BSD Syslog Protocol", RFC 3164,
              DOI 10.17487/RFC3164, August 2001,
              <https://www.rfc-editor.org/info/rfc3164>.

   [RFC6973]  Cooper, A., Tschofenig, H., Aboba, B., Peterson, J.,
              Morris, J., Hansen, M., and R. Smith, "Privacy
              Considerations for Internet Protocols", RFC 6973,
              DOI 10.17487/RFC6973, July 2013,
              <https://www.rfc-editor.org/info/rfc6973>.

Authors' Addresses

   Sandeep Rao
   Grab
   Bangalore
   India

   Email: sandeeprao.ietf@gmail.com


   Santhosh C N
   Grab

   Email: santoshcn1@gmail.com


   Shivan Sahib
   Salesforce

   Email: shivankaulsahib@gmail.com


Rao, et al.             Expires January 14, 2021               [Page 10]

Internet-Draft                   PITFoL                        July 2020


   Ryan Guest
   Salesforce

   Email: rguest@salesforce.com


Rao, et al.             Expires January 14, 2021               [Page 11]