Network Working Group S. Rao Internet-Draft S. Nagaraj Intended status: Experimental Grab Expires: January 14, 2021 S. Sahib R. Guest Salesforce July 13, 2020 Personal Information Tagging for Logs draft-rao-pitfol-02 Abstract Software systems typically generate log messages in the course of their operation. These log messages (or 'logs') record events as they happen, thus providing a trail that can be used to understand the state of the system and help with troubleshooting issues. Given that logs try to capture state that is useful for monitoring and debugging, they can contain information that can be used to identify users. Personal data identification and anonymization in logs is crucial to ensure that no personal data is being inadvertently logged and retained which would make the logging system run afoul of laws around storing private information. This document focuses on exploring mechanisms that can be used by a generating or intermediary logging service to specify personal or sensitive data in log message(s), thus allowing a downstream logging server to potentially enforce any redaction or transformation. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any Rao, et al. Expires January 14, 2021 [Page 1] Internet-Draft PITFoL July 2020 time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on January 14, 2021. Copyright Notice Copyright (c) 2020 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Motivation and Use Cases . . . . . . . . . . . . . . . . . . 4 4. Challenges with Existing Approaches . . . . . . . . . . . . . 4 5. Proposed Model . . . . . . . . . . . . . . . . . . . . . . . 5 5.1. Defining the log privacy schema . . . . . . . . . . . . . 5 5.2. Typical Workflow . . . . . . . . . . . . . . . . . . . . 7 5.3. Log Processing and Access Control . . . . . . . . . . . . 7 6. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 8 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Security Considerations . . . . . . . . . . . . . . . . . . . 9 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 10 10. Normative References . . . . . . . . . . . . . . . . . . . . 10 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction Logs capture the state of a software system in operation, thus providing observability. However, because of the amount of state they capture, they can often contain sensitive user information [link: twitter storing passwords]. Personal data identification and redaction is crucial to make sure that a logging application is not storing and potentially leaking users' private information. There are known precedents that help discover and extract sensitive data, for example, we can define a regular expression or lookup rules that will match a person's name, credit card number, email address and so Rao, et al. Expires January 14, 2021 [Page 2] Internet-Draft PITFoL July 2020 on. Besides, there are data dictionary based training models that can analyze logs and predict presence of sensitive data and subsequently redact it. This document proposes an approach and framework for creating logs with personal information tagged, thus marking a step towards privacy aware logging. Once personal information is identified in a log, it has to be appropriately tagged at source. Personal data tagging is especially important in cases where log data is flowing in from disparate sources. In cases where tagging at source is not possible (e.g. log data generated by a legacy application, IoT device, Web server or a Firewall), a centralized logging server can be tasked with making sure the log data is tagged before passing on downstream. Once the logs are tagged, the logging application can use anonymization techniques to redact the fields appropriately. While the proposal described here can be applied to any data deemed sensitive in a log, however this document specifically discusses and illustrates tagging of personal information in logs. 2. Terminology *Personal data:* RFC 6973 [RFC6973] defines personal data as "any information relating to an individual who can be identified, directly or indirectly." This typically includes information such as IP addresses, username, email address, financial data, passwords and so on. However, the definition of personal data varies heavily by what other information is available, the jurisdiction of operation and other such factors. Hence, this document does not focus on prescriptively listing what log fields contain personal data but rather on what a tagging mechanism would look like once a logging application has determined which fields it considers to hold personal data. *Structured logging:* Most applications generate logs in a unidimensional format that twine together logic status and input data. This makes log output largely free flowing and unstructured without specific delimiters making it hard to segregate personal information from other text in the log. Structured logging refers to a formal arrangement of logs with specific identifiers of personal information and semantic information to enable easy parsing and identification of specific information in the log. *Privacy Sensitivity Level:* Sensitivity level defines the degree of sensitivity of a data in log template or schema. Level can be enumerated on a scale 1 to 5 and defined as follows: 1 - Low risk for leaking private information and 5 - Very high risk for leaking private information> Rao, et al. Expires January 14, 2021 [Page 3] Internet-Draft PITFoL July 2020 3. Motivation and Use Cases Most systems like network devices, web servers and application services record information about user activity, transactions, network flows, etc., as log data. Logs are incredibly useful for various purposes such as security monitoring, application debugging, investigations and operational maintenance. In addition, there are use cases of organizations exporting or sharing logs with third party log analyzers for purposes of security incident response, monitoring, business analytics, where logs can be a valuable source of information. In such cases, there are concerns about potential exposure of personal data to unintended systems or recipients. 4. Challenges with Existing Approaches While methods of detecting personal identifiable information are continuously evolving, most approaches are around use of regular expressions, data or dataset based training models, pattern recognition, checksum matching, building custom logic. *Inconsistent Representation:* When applications, services or devices, log personal information, there is no consistency in the representation of the information. For example the name of a user is often logged as either "fullname" (e.g. John Doe) or with "firstname" (John) and "lastname" (Doe). *Context:* In most cases, what data is considered personal and sensitive is subjective, provisional and contextual to the data source or the application processing the data, which makes it hard to use automated techniques to identify personal data. Even for a specific domain, it's controversial whether it is possible to definitively say that a piece of data is NOT identifying. *Disparate Types of Personal Data:* There are many disparate types of personal data and often require a multitude approaches for detection. *Lack of standards:* There are no standards that govern formats of sensitive data making automation difficult for most common use cases. *Detection Accuracy:* Most of the current PII detections tools employ regular expression based techniques or other pattern recognition techniques to identify the PII data. Due to the very nature of logs, most of the current implementations let administrators to add redaction policies based on 'likelihood' of detection probability categorized as low, medium or high. Defining a low detection scheme causes high false positives and a high detection scheme would cause PII leakage, thereby making a trade off inevitable to organizations. Rao, et al. Expires January 14, 2021 [Page 4] Internet-Draft PITFoL July 2020 5. Proposed Model This section describes a reference model to enable tagging of personal information at source and extends it to include an approach of role or policy based redaction based on personal information annotated at source. The figure below illustrates the proposed model. Log Template/Schema with personal data identifiers | V Log library | V Application | V Generate annotated log | V Log redaction +--- consumer / role based +-- sensitivity based Figure 1: Flow 5.1. Defining the log privacy schema We propose using structured logging where a log schema or a template defines standardized identifiers for every personal information and each log field is associated with a sensitivity level customized to a use case or log intent. Note that this is not to be confused with a log severity level (WARN, INFO...) - those are typically defined "dynamically" by the developer while defining the severity of a certain scenario. A privacy sensitivity level is defined statically and is part of a log schema, associated with the log name and data type. Rao, et al. Expires January 14, 2021 [Page 5] Internet-Draft PITFoL July 2020 +------------------+-----------------+----------------+-------------+ | Name | Abstract Data | Description | Sensitivity | | | Type | | [1-High | | | | | 5-Normal] | +------------------+-----------------+----------------+-------------+ | nationalIdentity | String | National IDs | 1 | | | | issued by | | | | | sovereign | | | | | governments. | | | | | Eg., SSN | | | drivingLicense | String | Driving | 1 | | | | License number | | | taxIdentity | String | Tax | 1 | | | | identification | | | | | numbers | | | credtCardNumber | String | Credit cards | 1 | | bankAccount | String | Bank account | 1 | | | | number | | | dateOfBirth | Date | Date of Birth | 2 | | personName | String | Person name | 1 | | emailAddress | String | Email | 2 | | phoneNumber | Number | Phone | 1 | | zipCode | Integer | Zip codes | 5 | | ipAddress | ipv4Address | IPv4 or IPv6 | 4 | | | | Address | | | dateTimeSeconds | dateTimeSeconds | seconds | 5 | | age | Integer | Age | 2 | | ethnicGroup | String | Ethnic group | 1 | | genderIdentity | String | Gender | 1 | | | | identity | | | macAddress | macAddress | MAC Address | 4 | +------------------+-----------------+----------------+-------------+ Personal Information Identifiers Registry If an organization already uses structured logging with a log schema, then a privacy sensitivity level can be an additional attribute for the schema. The privacy sensitivity level for log types is intended to be defined by a centralized effort around privacy preservation in logs. In other words, this mapping might be done by an organization's privacy team (which can include lawyers, engineers and privacy professionals). The intention is that all logs generated by an org should conform to this structured format, which would ease downstream processing of logs for access control and removal of sensitive information. Rao, et al. Expires January 14, 2021 [Page 6] Internet-Draft PITFoL July 2020 If the log is being generated by a web server, then two approaches can be taken: 1. Modify log-format for the service: identify the log data type of each piece of log data generated, and tag in generation (examples provided in later section) 2. Add automated tagging in a centralized log aggregator: collect all the logs generated by different services and apply the annotation using the log schema at the aggregator 5.2. Typical Workflow 1. The log privacy schema can be parsed into a structured logging library, that is used by individual developer teams. The intention is for developers to not log arbitrary data i.e. they are asked to identify what is the data type of the state they want to preserve. 2. Any addition to the log schema would have to go through review of the privacy team that came up with the log schema. 3. Once a log is generated, tagged and stored, various kinds of access control techniques can be applied to who can access the logs. 5.3. Log Processing and Access Control 1. Consumer Role Based Access A. Once the log is tagged, access to it can be based on a consumer's role and privilege level. B. A consumeer role based policy can define what level of sensitivity they can access. 2. Case-based access A. If there is a genuine case for which access to sensitive information is needed and granted by the legal department, a cryptographically-signed token (e.g.JWT) can be generated that will allow access to a developer/user to logs of an increased log level. This access can be temporal in nature i.e. the token will only be valid for a certain amount of time. B. A transaction ID can also be propagated automatically throughout the request processing, to correlate different Rao, et al. Expires January 14, 2021 [Page 7] Internet-Draft PITFoL July 2020 logs related to a single request. Note that the notion of a "request" can vary based on what the application is doing. The idea is to have a single unifying ID to tie a particular action. If this is done, then the temporary token can be restricted to a particular request ID. 3. Redaction Techniques A. Given that the log is tagged, an organization might choose to redact the more sensitive logs i.e. ones above a certain sensitivity level, ones of a certain log type. B. More sophisticated approaches can be developed i.e. completely redact log types username and email, but obfuscate IP address so that a rough location can be garnered from the log record. In this way, techniques such as differential privacy can be used in tandem to have privacy guarantees for logs while still providing usefulness to developers. 6. Examples An example based on RFC 3164 Log format Normal Log Ouput <120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" user_name="XYZZY" job_number="256937" err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation=""] Log Output with Personal Information Tagging <120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" {personName="XYZZY" pii_sensitivity_level=1} job_number="256937" {emailAddress="xyz@foo.com" pii_sensitivity_level=2] [ip_addr="10.0.1.21" pii_sensitivity_level=4] port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation=""] Rao, et al. Expires January 14, 2021 [Page 8] Internet-Draft PITFoL July 2020 7. IANA Considerations IANA can consider defining a new central respository for Personal Information name and identifier registries to used in logging personal information. The personal identifier registry would enumerate namee and identifiers as described in Section 5.1. 8. Security Considerations It is anticipated that developers will want additional log data types for capturing application logic, and might abuse an existing log type instead of going through the process of adding a new one. In such a case, the log would be incorrectly tagged. This can be mitigated by having stronger typing for the log data types i.e. restricting address to a certain string length instead of storing arbitrary length. Encouraging developers to think carefully about what kind of data they're logging is a good practice and will lead to fewer incidents of private data being inadvertently logged. An organization might choose to have an unstructured log type for letting developers log data that truly do not fit anywhere else. This is still better than not having structured privacy-aware logging, because the potential privacy leakage is isolated to one particular field and its use can be monitored. Having a mapping from log data type to privacy sensitivity will need continuous effort by a privacy team, which might be expensive for an organization. Log data is often collated, propagated, transformed, loaded into different formats or data models for purposes of analytics, troubleshooting and visualization. In such cases, it is necessary and critical to ensure that personal information tagging and annotations is preserved and forwarded across format transformations. If the privacy marking or classification changes for a log, for historical logs, the change of privacy classification is applied on subsequent access of the log. *TODO*: In case of logs that are not tagged or marked with personal information, an out-of-band mechanism to communicate log template or schema with personal data identifiers can be considered. Such a mechansim can also be used to notify changes to privacy tagging or classification. Rao, et al. Expires January 14, 2021 [Page 9] Internet-Draft PITFoL July 2020 9. Acknowledgements The authors would like to thank everyone who provided helpful comments at the mic at IETF 106 during the PEARG session. Thanks also to Joe Salowey for thoughts on aspects of log transformations, change of privacy classifications, models for privacy marking. 10. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, DOI 10.17487/RFC3164, August 2001, . [RFC6973] Cooper, A., Tschofenig, H., Aboba, B., Peterson, J., Morris, J., Hansen, M., and R. Smith, "Privacy Considerations for Internet Protocols", RFC 6973, DOI 10.17487/RFC6973, July 2013, . Authors' Addresses Sandeep Rao Grab Bangalore India Email: sandeeprao.ietf@gmail.com Santhosh C N Grab Email: santoshcn1@gmail.com Shivan Sahib Salesforce Email: shivankaulsahib@gmail.com Rao, et al. Expires January 14, 2021 [Page 10] Internet-Draft PITFoL July 2020 Ryan Guest Salesforce Email: rguest@salesforce.com Rao, et al. Expires January 14, 2021 [Page 11]