AUSNOG 2025

The Australian Network Operators' Group, AUSNOG, just held its 19th meetings. Here's my impressions of some of the presentations at that event.


Network Operations and Management

Perhaps unsurprisingly for a network operators' meeting, the topic of network management and operations has been a constant topic of presentations and discussions. From a suitably abstract perspective packet switched networks are simply a collection of processing engines and transmission elements, and the ability to pull data from these processing elements created endless possibilities. 

In looking over the history of this area we started with a simple model where the network's processing systems had consoles and network operators accessed the units performed both configuration and data extraction using console commands. Its highly likely that there are more than a few network management systems in use today that operate in the same manner, still using expect scripts to perform these functions. The SNMP model was an early entrant into the network management suite of tools. SNMP could be considered a remote memory access protocol, where each processor write data to labelled memory locations which could be read by a remote party via SNMP. The protocol also allowed for remote writing (to enable configuration operations). Its a very simple model, was widely adopted, and continues to enjoy widespread use. There have been numerous efforts to "improve" on SNMP. There is Netconf with YANG, which appears largely to be a shuffling from ASN.1 to XML, and not an awful lot else. There has also been a change in the tooling models, moving from individual tasks to orchestration of large pools of devices, with tools such as ANSIBLE, NAPALM and SALT. However there are still a few invariants in the network monitoring tool box, and these include ping, traceroute and a collection of convenient BGP Looking Glasses.

We've moved further by using container-based toolkits and load the monitoring and diagnostic tools into the routers directly. Rather than rely on vendor-provided diagnostics this approach allows the network operator to code their own diagnostics and load these tools into the routers within the network. It's a step forward from the previous practice of deploying a small computer beside every router to collect data as to the view of the network from the perspective of the adjacent router.

The saga of monitoring systems is like safety regulations, in that their revisions are written in the blood of prior victims! But such an incremental approach to the task runs the risk of rampant complexity bloat. Collecting megabytes of network device status information per second is quite a realistic scenario today, However, the question is who has the time and skills to shovel through all this data to rephrase it as a relevant and timely indicator ot current network anomalies and identify options to rectify them? The current fashionable answer is "This is surely a task for AI!"

One of the dominant factors in the network operations and management domain is the pressure of scale, and the view from Amazon AWS is a perspective from one of the larger of the hyperscalers. Scaling for Amazon is unrelenting, with an 80% annual increase in network capacity. 

How to respond to such scaling pressures? For Amazon the answer is "automation!" They are claiming that soke 96% of network "events" in their newtwork are remediated or mitigated without any human operator intervention. That is a pretty intense level of automated response.

This is accompanied by something that is shared across all of the hyperscalers, which is an intense focus on the absence of external dependencies. They take the mantra of "Own our own destiny" very seriously. This and the desire to automate everything are the key factors behind their capability to manage a network with over 1M devices.

The trend over many years in the routing and switching world was to user ever larger units of equipment with larger port populations. More recently the industry has reversed this trend, and is now interested in single chip switches. Such devices have fewer ports and possible less line buffer memory in total, but the overall switch architecture is simpler, and with up to 64M of high pseed memory on the same chip as the switch fabric such single chip switches can provide improved performance at a lower price point. They have a lower port population, so there are potentially many more such switches, but they can be built into very dense switching frame with a factor of redundant switch capacity. The typical building block is a 12.8T 1xRU 32x400G ports switch, all with a single chip single control plane with 64M of high speed local memory. 32 of these devices are assembled into a 100T rack, with a 3 tier Clos switching fabric. 42 of these racks get to the petabyte capacity!

The automation approach is to react to a known signal from a switch device by taking the device into an out-of-service condition and let the redundant framswork take care of the associated traffic movement. What is perhaps unique here is the deliberate approach to design and construct an automated network from the start, distinct form the more conventional approach os retro-fitting automation of network operations to an existing network. The standard automated response is to pull the unit out of servicfe, mediate the device and roll it back into service.

It sould also be noted that this approach of using a custom designed switch chip and its own firmware bypasses the typical router function bloat (and related complexity issues( found in todays routers. The functions of the units can be tightlky constrained and the conditions that generate a management alarm are also similarly constrained. This approach facilitates the automation of the environment.

Finally, in the topic of network management and operations nokia presented on AI Data Centers and thir take on Ultra Ethernet. These Data Caenters can be seen as an amalgam of a number of discrete networks, namely GPU memory access network, storage access, in-band communication and out-of band control. hew most demanding of these is the GOU memory access application. This is based pon RoCEv2 (Remote Direct Memory Access Over Converged Ethernet v2) is a networking protocol that enables high-throughput, low-latency data transfers over Ethernet by encapsulating Infiniband transport packets within UDP/IPv4 or UDP/IPv6 packets. RoCEv2's Layer 3 operation allows it to be routed across subnets, making it suitable for large-scale data centers, high-performance computing (HPC), and AI/ML workloads. It requires a lossless Ethernet network, achieved through Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), to maintain its performance. (One could argue that another approach to lossless switching is to take advantage of redundancy and send the same packet via multiple paths, but it appears thjat UE has headed down a path of attempting to avoid packet loss by signalling the modern equivalent of a 'source quench' signal. The second level response is PFC, which is as signal for the lower priority senders to pause sending until released.
 

NOKIA - Nvidia GPU Datacentres

multiple connectivity/ purpose fabrics:
GPU, Storage, In-band and out-of-band

High performance Ethernet fabric
Lossless Ethernet (RoCEv2) evolving to Ultra Ethernet (UEC)

in-band is a traditional clos fabric

RDMA ius an infiband transport run over UDP to support RDMA
Relies on lossless Ethernet

They use ECN!!!!!!!
with a low threshiold with WRED drop

for a higher threshold they use PFC Pause (stop sending)

Two topologies: "Rail-Only Fabric" all GPUs are reaschable within 1 hop

for more GOUs you use Rail Optimized Fabric
 the fabric options head to 800G SMF links 
 
wow - these are massive!
Ultra Ethernet Consortium - build a modern RDMA transport layer for these high scalehigh performance data centre applications

the specialised natire of these data centres is pushing scaling to new levels!

Submarine cables 

telco model - demand driven with half-circuit model
But the entrance of the CDNs (or hypersclaersO
fully owned cable systems (see slide) started in 2010)
Google, Meta, AWS and Microsoft account for 70% of global cable usage, and this growth has happened in the lasgt decade - the model is intewgration of transmission, data centers and service/content

onlyt 52% of anniounced projects are compledted, yet 100% of the hypersclaers comnpledete

130 subsea cabvles
2025 - 570- subsea cables
new examples
SEA-ME-WE6
SJC2

Australia has been active in investing in cables as a blocking move to Chinese investment

Most Pacific cables are not commercially viable. The projects that get up generally leveraage investment.
ASPI report
Sept 2024

I prefer Brewer's talk from 

IRR AS-SETs - https://storage.googleapis.com/site-media-prod/meetings/NANOG94/5436/20250610_Madory_The_Scourge_Of_v1.pdf

https://benjojo.co.uk/talks/

TACS are useless!

NBN experience


Quantum Cryptography

I am happy to say that I am one of the few billion people of this planet that has absolutely no clear understanding of quantum physics and the magical property of quantum entanglement and superimposition. But the hype machine is working ov ertime and already we have quantum 2.0

I have yet to hear a clear explanation of qubits, state superposition and entanglement, and superposition state collapse under measurement or observation.

There is quantum networking and quantum computing, but of which are high cost hogh profile flagship technlogy projects of the past decade or so. Large sums of money have been spent already and the early implementations of quantum comoputers are already out there. They are evidently no all that impressive to date. Finding the prime factoras of 21 is a quantum copmputable problem, but performing the same function on a 40 digit integer is still some time away.

https://www.potaroo.net/ispcol/2024-11/pqc-fig1.png

Work progresses - Google's Willow, Microsoft's Majorana and D-Wave's Advantage-2 are moving the needle.


Is this a problem today?

Well no. 

But if the secret needs to be a secret for, say, 20 years, is thjat we are not assessing the difficulty of 'cracking' the cryptography using today's computers, but the feasibility of performing the same function 20 years from now. There is a decent consensus in the area of quantum computing that many of today's commonly used crypto algorithms would be susceptible to cracking

For symmetric crypto algorithms the solution is to just double the key size

FOr asymmetric crypto the problem is the cracking of the key exchange

Asymmetric is computationally expensive - which is why they are used for key exchange to set up the symmetric session keys

For RSA the problem is facgtoring a large composite number, particularly if teh number is the product of large primes - Shors algorithm

the response is developing post q1uamntum crypto

symmetric - larger keys
assymetric - new algorithms

NIST has been working on this and haas come upo with FIPS 203, 204 ands 205 for key encap and signatures

The work is now to try and push these cryptio algoirithms into existing standards that use crypto

How REAL is this threat? 

Well the basic question is who has 20 year secrets!

DNS - DNSSEC ? no - not really
TLS and authentication - no!
TLS and session encryption 0 yes


should the network layer encryot? Who pays?

----------

DO microtiks suck?

transport protocol evolution

MPLS is NOT transport - its network state vs stteless

"MPLS TE solved the IP TE scale problem" REALLY?

not really - Just Get More Bandwidth!

the task comes from the 709's - transmission os szcarce and expensive and we need to ration it out and sell it for a fortune.

Today we have transmission is abundant and frankly we have no need to ration it out and no market left to selll it at a premium

this pack a a pack from complexity merchants shoving complexity into the netowrk for no enduring value.

"there is no state left in the network" - Source routed virtual corcuits? really? Wow!

clash of cultures!

the asset of IP was pushing functionality out of the network and into the edges

Fast Reroute - WHY????????

Things are getting a little bit crazy

You can achieve all this with a decent IGP and enough bandwidth in your network and well tuned end-to-end transport protocols

"if you want to think like an RSVP guy"

I had thought this whole network-centric view of the network as being the arbiter and maintainer of network responses and application transport is dumb, insensitive and intolerant was a view that fell out of favour decades ago. We've moved the locus of control up the stack and applications treat the network as an abundant commodity. There is no need for applications to negotiate with the network - at best applications need to implicitly negotiate with each other to mutually balance their demands placed onto a common resource.


=============


https://batfish.org/ - test a device config to validate configs - do this BEFORE they geet fielded

no matter the "question" change response - shift and roll it out, mediate and roll it out

"ping all the things"

nexthop -- 
(whatever happened to route views looking glasses?)

monitoring systems presentation - they are a bit like safety regulatory regulatory, wiritten in the blood of prior victoms!

obervium/libreNMA, ZABBIX ands flow moniting, and rancid, netbox and grafana

before all this was excel spreadsheets!

EVERYONE PINGS GOOGLE 8.8.8.!

why not use AI?

the approach herre is NOT to use attached PI units beside the units, but to embed the scriu[pts in docker and loadit onto the reoute - but the qursiton is is the route offering a biased view of the network?

As far as I can see the current response to the incrasing level of detail and complexity in network monitoring and management is "Roll on AI!"

--

Lincoln

No packet left behind at AWS

80% annual increase in network capacity! and 96% of network events are remediated or mitigates without any human interactions. The automation level is prwetty intense

intense focur on the absence of external dependencies. "Own our own destiny." The other approacjh is to "automate everyting". With over 1M devices this is a necessity!

single chip routers - fewer ports, yes, many devices but far simpler architecture. 3-tier clos - redundancy between routers.

the typical building block os a 12.8T 1 RU 32x400G ports switch - single chip single control plane

<<what is the buffer size of the chip?>>

32 of these devices are assembled into a 100T rack. - 32 x 32 x 400G - 3 tier Clos

42 of these racks get to the petabyte level

https://datatracker.ietf.org/doc/draft-evans-opsawg-ipfix-discard-class-ie/

the automation approach is to react to a k own signal by taking the device into an out-of-service condition and let the redundant framswork take care of the associated traffic movement. TGhere is a comparable "back to service"  operation.

this is a really impressive view of beuilding an automatable network as distinct from automating the netweork that I ALREADY HAVE!

https://batfish.org/ - test a device config to validate configs - do this BEFORE they geet fielded

no matter the "question" change response - shift and roll it out, mediate and roll it out

"ping all the things"

amazon cloudwatch internet monitor