Internet DRAFT - draft-du-computing-resource-representation

draft-du-computing-resource-representation







Network Working Group                                              Z. Du
Internet-Draft                                                     Y. Fu
Intended status: Informational                              China Mobile
Expires: 12 January 2023                                    11 July 2022


    Computing Resource Representation in Computing Aware Networking
             draft-du-computing-resource-representation-01

Abstract

   This document introduces the way of encoding service-specific
   information and the way of signaling it in the network.

Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 12 January 2023.

Copyright Notice

   Copyright (c) 2022 IETF Trust and the persons identified as the
   document authors.  All rights reserved.










Du & Fu                  Expires 12 January 2023                [Page 1]

Internet-Draft  Computing Resource Representation in CAN       July 2022


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Definition of Terms . . . . . . . . . . . . . . . . . . . . .   3
   3.  Requirements of Computing Resource Representation and
           Signaling . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  Requirements of Computing Resource Representation . . . .   4
     3.2.  Requirements of Computing Resource Signaling  . . . . . .   4
   4.  Representation of Computing Information . . . . . . . . . . .   5
     4.1.  Representation of Computing Metric  . . . . . . . . . . .   6
       4.1.1.  Representing in a Single value  . . . . . . . . . . .   6
       4.1.2.  Representing in Multiple values . . . . . . . . . . .   6
     4.2.  Example Process of Computing Load Information . . . . . .   7
   5.  Signaling of Computing Information  . . . . . . . . . . . . .   8
     5.1.  General Process of Informing  . . . . . . . . . . . . . .   8
     5.2.  BGP Method in Informing . . . . . . . . . . . . . . . . .   9
     5.3.  Other Methods in Informing  . . . . . . . . . . . . . . .  10
   6.  Conclusion  . . . . . . . . . . . . . . . . . . . . . . . . .  10
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
   9.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  11
   10. Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  11
   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  11
     11.1.  Normative References . . . . . . . . . . . . . . . . . .  11
     11.2.  Informative References . . . . . . . . . . . . . . . . .  11
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  12

1.  Introduction

   Traditionally, the network can only do traffic engineering according
   to the network statuses.  As the trend of computing and network
   convergence, some works are proposed for network to be aware of
   service information, and can make a better choice in the traffic
   steering accordingly.  Computing Aware Networking (CAN) could steer
   the traffic based on both the network and computing statuses, which
   is considered as a mechanism for computing and network convergence.






Du & Fu                  Expires 12 January 2023                [Page 2]

Internet-Draft  Computing Resource Representation in CAN       July 2022


   In the traditional network architecture, the network is only
   responsible for delivering packets between servers and clients, and
   is not aware of the computing information.
   [I-D.liu-dyncast-ps-usecases] and [I-D.liu-dyncast-reqs] show that,
   when service instances are deployed at multiple geographical edge
   sites, CAN would achieve service equivalence and load balancing by
   considering both the service metrics and network metrics.

   However, the method of notifying the service metrics in the network,
   representation of computing resources, and signaling of computing
   resource to the network are still uncertain, which is important for
   the network domain to know about the computing domain.

   This document dose further explorations on the way of service metrics
   encoding and signaling.  Some requirements about the service metric
   representation and signaling can be found in the document
   [I-D.liu-dyncast-gap-reqs].


2.  Definition of Terms

   This document makes use of the following terms:

   Computing-Aware Networking (CAN):  Aiming at computing and network
     resource optimization by steering traffic to appropriate computing
     resources considering not only routing metric but also computing
     resource metric and service affiliation.

   Service:  A monolithic functionality that is provided by an endpoint
     according to the specification for said service.  A composite
     service can be built by orchestrating monolithic services.

   Service instance:  Running environment (e.g., a node) that makes the
     functionality of a service available.  One service can have several
     instances running at different network locations.

   Service identifier:  Used to uniquely identify a service, at the same
     time identifying the whole set of service instances that each
     represent the same service behavior, no matter where those service
     instances are running.

   Computing capacity:  The ability of nodes with computing resource
     achieve specific result output through data processing,
     specifically including computing, communication, memory and storage
     capacity.

3.  Requirements of Computing Resource Representation and Signaling




Du & Fu                  Expires 12 January 2023                [Page 3]

Internet-Draft  Computing Resource Representation in CAN       July 2022


3.1.  Requirements of Computing Resource Representation

   The CAN needs to obtain the computing information of the computing
   resource for a service, to realize the traffic steering considering
   both network and computing status.  As described in
   [I-D.liu-dyncast-reqs], the representation and encoding of computing
   metric is crucial, which is conveyed to CAN system to support the CAN
   components to act upon.  The representation needs to express the
   capabilities of computing resources accurately, and the CAN system
   must agree on the service-specific metrics and their representation
   between service elements in the participating edges for the CAN
   components to act upon them.

   Moreover, the computing resource representation need to consider the
   computing modeling as the requirements described in
   [I-D.liu-can-computing-resource-modeling]:

   Support the representation of computing resources in multiple
   dimensions, including computing capacity, communication capacity,
   cache capacity and storage capacity.

   Support the representation of the computing capacity in chip
   category, such as CPU, GPU, FPGA, ASIC, and in computing type, such
   as int calculation, float calculation and hash calculation.

3.2.  Requirements of Computing Resource Signaling

   The representation results of computing resources need to be exposed
   in the network to support the efficient utilizing of computing
   resources, or joint utilizing of both computing resources and network
   resources as describe in [I-D.liu-dyncast-reqs].  CAN aims at dynamic
   scenarios of which the status of computing resources may vary
   frequently, e.g., changing with the number of sessions, CPU/GPU
   utilization and memory space.  More frequent distribution of more
   accurate synchronization of the real-time representation of computing
   resources may result in more overhead in terms of signaling.  Thus,
   the signaling of computing resources needs to distribute and
   synchronize the real-time representation of computing resources
   efficiently to reduce the unnecessary signaling and meet the service
   requirements.  The requirements contain several aspects as described
   below.

   Support to signal various message based on the representation of
   computing resources.

   Support to control the signaling rate, such as define at what
   interval or events to signal the information of computing resources.




Du & Fu                  Expires 12 January 2023                [Page 4]

Internet-Draft  Computing Resource Representation in CAN       July 2022


   Support to signal the updated information of computing resources.

   Support to implement mechanisms for loop avoidance in signaling
   metrics, when necessary.

4.  Representation of Computing Information

   The main job of the network is to forward the packets of the users
   from the source to the destination, while the main job of the
   computing is to complete the various tasks of the users.

   The network metrics include the bandwidth, latency, jitter, etc.
   They can describe the capabilities of the network, and are
   independent of the detailed realization of the underlayer
   technologies, such as the mode of the optical fiber, or the structure
   of a switch.

   The computing metrics are more complex, which is hard to match the
   QoS/QoE.  For example, if the task is the AI computing, such as the
   image processing, the computing resource can be measured by using
   FLOPS (Floating-point Operations Per Second) or TFLOPS (Tera FLOPS).
   However, it is more difficult to get the process time, which will be
   influenced by the current utilization rate of CPU, cache, and so on.
   Even some real-time OS or protocol are used, sometimes it will fail
   because of the deadlock or other mechanisms of OS.That is not to say
   there is any problem with the OS, but the complex environment in it.
   So, the service metric will consider more factors to judge the
   performance, and how to be used in another domain to guarantee the
   E2E service quality.

   [I-D.liu-can-computing-resource-modeling] proposes a basic
   architecture of computing resource modeling, which considers the
   computing hardware types, computing task types, communication, cache,
   storage status, and uses the vector to represent the basic result of
   modeling.  The vector could be:

   a group of multiple vectors, to represent the evaluated level of
   computing, communication, cache, and storage capacity.

   a single vector, to represent the single comprehensive level of
   overall capacity.

   How to use the vector depends on the specific application domain.
   For the network, to preserve the metadata privacy of computing
   domain, usually, weighted or fuzzy processing methods are used.






Du & Fu                  Expires 12 January 2023                [Page 5]

Internet-Draft  Computing Resource Representation in CAN       July 2022


4.1.  Representation of Computing Metric

   How to use the vector depends on the specific application demands.
   To preserve the metadata privacy of computing domain, usually, the
   weighted or fuzzy processing methods are used by CAN.

   Based on [I-D.liu-can-computing-resource-modeling], to use the
   information of computing resource for network, we can use two general
   ways to represent them.  One is to use single vector to represent the
   level, the other is to use a group of vectors to represent more
   detailed information.

4.1.1.  Representing in a Single value

   At one aspect, we can offer a general computing load information to
   the ingress nodes.  As an example, we perhaps only need to three
   values:

   one red value stands for the busy status,

   one yellow value stands for relatively busy status,

   one green value stands for free status.

   Therefore, the ingress node only needs to consider the yellow edge
   sites and green edge sites when steering traffic, in which the green
   ones are more preferred.

4.1.2.  Representing in Multiple values

   At the other aspect, we can also offer detailed computing related
   information but also are expected to be the weighted value as
   described in [I-D.liu-can-computing-resource-modeling], such as
   computing capacity information includes chips category and computing
   task category, communication information, cache information and
   storage information.

   Moreover, some additional information could also be represented if
   needed:

   the service information deployed on edge sites, for example, Service
   ID,

   the maximum session number that the edge sites can provide,

   the current session number that the edge sites can provide,

   the available computing infrastructure of the server, etc.



Du & Fu                  Expires 12 January 2023                [Page 6]

Internet-Draft  Computing Resource Representation in CAN       July 2022


   Those information may be optional and encoded as TLVs.  A specific
   service may have a specific preferred set of TLVs.  For example, if
   multiple instances have the same free status, the additional TLVs
   could be used to represent the computing resources.  The detailed
   decision algorithm is out of scope of this document.

   The informing of the TLVs should be service-specific and on-demand.
   Different services may care about or have subscribed different sets
   of TLVs.  Besides, if an Ingress node receives any TLV that it does
   not support, the Ingress node can just ignore it.

4.2.  Example Process of Computing Load Information

   For a specific service, we can offer both a general computing load
   information and some more specific information about the computing.
   A general process about it is described as below.

   Step1: The service instances are deployed in multiple edge sites.
   The ingress nodes of network working as the load balancing point
   needs to obtain the computing information.  The service should have a
   specific SID, for example SID1, in the network, so that the ingress
   node can recognize and treat the service request differently
   according to SID.

   Step2: After obtaining the computing information of a service related
   to ServiceID1 from multiple edge sites, the ingress nodes should
   record the computing information.  Meanwhile, an ingress node should
   also be able to obtain network status, for example the latency to the
   egress of an edge site and record it.

   Step3: An ingress node receives a packet targeted to the ServiceID1.
   According to the service metrics and network metrics it has recorded,
   the ingress node makes a decision about which edge site to use and
   forward the packet to the related egress.  The selection method may
   be depended on the service.  For example, it may be the one with the
   lowest latency among the ones that can offer the service, or the one
   with the best computing resource among the ones that have a latency
   fulfilling the service requirements, or a hybrid method.

   The purpose of the procedure is to find an edge site that is
   relatively near to the client, and also have enough computing
   resource for the service.  However, the edge sites that provide the
   service may be various, and perhaps have different computing
   abilities.  Therefore, a load balancing method considering the
   computing resource is useful in this scenario.






Du & Fu                  Expires 12 January 2023                [Page 7]

Internet-Draft  Computing Resource Representation in CAN       July 2022


5.  Signaling of Computing Information

   The target of CAN is to steer traffic considering both network and
   computing resource status.  To meet the use case demands in
   [I-D.liu-dyncast-ps-usecases], an "on-path" decision is expected.
   For instance, the Ingress of the network works as the decision point
   to steer the traffic of the users.  In this situation, the Ingress
   needs to know the computing information of the service instance,
   which could be behind the Egress.  Among the computing information,
   some are relatively static, and some are dynamic.  They may be
   delivered by using different means, and at different frequencies.

   Besides of the computing resource modeling and computing resource
   representation, CAN should also focus on how to deliver the computing
   information from the Egress to the Ingress.

5.1.  General Process of Informing

   For the signaling of the computing information, a general process
   about it is described as below.

   Step1: The gateway of the edge site collects the computing status
   information of the specific service instance or a categorized
   service.  In some cases, there will be the controller in the edge
   site, which can help to collect the information and notify the
   gateway.

   Step2: The Egress of CAN receives the service status information from
   the gateway of the edge site and notify the CAN ingress nodes.

   In the first step, the controller or the gateway perhaps can
   communicate by PCE or other protocol for the controller.  In the
   second step, the controller-based method can also be used; however,
   communications between the controller of the edge site and the
   controller of the network may be complicated and inefficient.

   In the following section, we propose some potential ways to notify
   computing information, including the BGP extension, and others
   potential methods.  When we are notifying that the edge sites have
   the service, i.e., a binding address for the service and the
   corresponding route to it, we can add additional computing
   information in its Extended Community.









Du & Fu                  Expires 12 January 2023                [Page 8]

Internet-Draft  Computing Resource Representation in CAN       July 2022


5.2.  BGP Method in Informing

   As the informing of the computing information is for the edge network
   nodes, we can consider using BGP, specifically the MP-BGPRFC 4760
   [RFC4760] . BGP is a gateway protocol that enables the network to
   exchange routing information between Autonomous Systems (AS).  MP-BGP
   allows VPN edge nodes to exchange client information via different
   underlay networks (e.g., MPLS).  As said before, we can add the
   computing information in the Extended Community.

   When we notify the route for the specific service (naming as
   ServiceID1) whose address is an anycast address, in a BGP UPDATE
   message, the route can include many Path Attributes.  The Extended
   Community is one of the Attributes defined in RFC 4360 [RFC4360].

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |             Type              |           Length              |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |    Flag       |    Status     |          Sub-tlvs             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
             Figure 1. Format of the Computing Information in BGP

   Type: TBD, for example, 0x0314.

   Length: This refers to the total length in octets of the element
   excluding the Type and Length fields.

   Flag: all zero.

   Status: the first two bits are used.

   Sub-tlvs: the sub-tlvs related to computing information.

   One example of the Sub-tlvs is that the value of FLOPS that is widely
   used in the AI analysis scenarios.  For some services that need a
   large amount of computing resources, we can also provide a general
   computing grade information of a server, such as large, middle, or
   small.

   Besides the computing information, BGP can also be extended to
   exchange some other information for the CAN.  While notifying the
   load of the computing, the network can also monitor the whole load
   balancing system.  If any service becomes heavy load, i.e., all the
   service instances for the service are busy, the network should be
   able to inform potential inactive service points to join in the LB.
   Similarly, if any service becomes light-load, i.e. all the service



Du & Fu                  Expires 12 January 2023                [Page 9]

Internet-Draft  Computing Resource Representation in CAN       July 2022


   instances for the service are relaxed, the network should also be
   able to inform one active service points to become inactive to
   release the resource to other services.

   What needs to be considered more is the update frequency.  The UPDATE
   message is sent when the network topology, path, or other status
   change, not cyclical.  There should be a match mechanism of the
   computing status change of edge sites, considering the effectiveness
   for a given period of time, and preventing the overload of network
   caused by the notification of network status update, for instance, a
   set threshold.

5.3.  Other Methods in Informing

   The computing information can be treated similarly to the OAM
   (Operations, Administration and Maintenance) information in the
   network.  Therefore, it should also be able to be carried in the OAM
   message with some proper extensions to current OAM mechanisms.
   Therefore, the load balancing point can collect network information
   via OAM mechanisms, and it can collect computing information via OAM
   mechanisms.

   Some network programming mechanisms such as SRv6 can also be
   considered here.  The computing information can be carried in some
   places of the IPv6 extension headers.  For example, some data packets
   from the Egress to the Ingress can carry the computing information.
   The insertion of the computing information can take place on the
   Egress.  It can be on-demand or periodically.

   Besides BGP, OAM and network programming mechanisms, if needed, the
   CAN specific methodology of computing information notification could
   also be further formulated.

6.  Conclusion

   This document analyzes the requirements of computing representation
   and signaling, proposing some potential method to achieve them, which
   are the key functions of CAN.

7.  IANA Considerations

   TBD.

8.  Security Considerations

   TBD.





Du & Fu                  Expires 12 January 2023               [Page 10]

Internet-Draft  Computing Resource Representation in CAN       July 2022


9.  Acknowledgements

   TBD.

10.  Contributors

   The following people have substantially contributed to this document:

   Linda Dunbar

11.  References

11.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119,
              DOI 10.17487/RFC2119, March 1997,
              <https://www.rfc-editor.org/info/rfc2119>.

   [RFC4360]  Sangli, S., Tappan, D., and Y. Rekhter, "BGP Extended
              Communities Attribute", RFC 4360, DOI 10.17487/RFC4360,
              February 2006, <https://www.rfc-editor.org/info/rfc4360>.

   [RFC4760]  Bates, T., Chandra, R., Katz, D., and Y. Rekhter,
              "Multiprotocol Extensions for BGP-4", RFC 4760,
              DOI 10.17487/RFC4760, January 2007,
              <https://www.rfc-editor.org/info/rfc4760>.

11.2.  Informative References

   [I-D.liu-can-computing-resource-modeling]
              Liu, P., Du, Z., Rui, L., Li, W., Li, C., and G. Huang,
              "Computing Resource Modeling for CAN", Work in Progress,
              Internet-Draft, draft-liu-can-computing-resource-modeling-
              00, 11 July 2022, <https://www.ietf.org/archive/id/draft-
              liu-can-computing-resource-modeling-00.txt>.

   [I-D.liu-dyncast-gap-reqs]
              Liu, P., Jiang, T., Eardley, P., Trossen, D., and C. Li,
              "Dynamic-Anycast (Dyncast) Gap analysis and Requirements",
              Work in Progress, Internet-Draft, draft-liu-dyncast-gap-
              reqs-00, 8 July 2022, <https://www.ietf.org/archive/id/
              draft-liu-dyncast-gap-reqs-00.txt>.

   [I-D.liu-dyncast-ps-usecases]
              Liu, P., Eardley, P., Trossen, D., Boucadair, M.,
              Contreras, L. M., and C. Li, "Dynamic-Anycast (Dyncast)
              Use Cases and Problem Statement", Work in Progress,



Du & Fu                  Expires 12 January 2023               [Page 11]

Internet-Draft  Computing Resource Representation in CAN       July 2022


              Internet-Draft, draft-liu-dyncast-ps-usecases-03, 7 March
              2022, <https://www.ietf.org/archive/id/draft-liu-dyncast-
              ps-usecases-03.txt>.

   [I-D.liu-dyncast-reqs]
              Liu, P., Jiang, T., Eardley, P., Trossen, D., and C. Li,
              "Dynamic-Anycast (Dyncast) Requirements", Work in
              Progress, Internet-Draft, draft-liu-dyncast-reqs-02, 7
              March 2022, <https://www.ietf.org/archive/id/draft-liu-
              dyncast-reqs-02.txt>.

Authors' Addresses

   Zongpeng Du
   China Mobile
   No.32 XuanWuMen West Street
   Beijing
   100053
   China
   Email: duzongpeng@foxmail.com


   Yuexia Fu
   China Mobile
   No.32 XuanWuMen West Street
   Beijing
   100053
   China
   Email: fuyuexia@chinamobile.com






















Du & Fu                  Expires 12 January 2023               [Page 12]