@DXWorldExpo Authors: Elizabeth White, Zakia Bouachraoui, Liz McMillan, Pat Romanski, Carmen Gonzalez

Blog Feed Post

Plexxi’s View: Data Path Performance of Leaf-Spine Datacenter Fabrics

While doing some competitive analysis, I read this paper that was presented at the 21st Symposium on High Performance Interconnects. The paper discusses the data path performance of spine and leaf networks and was written by Insieme Networks’ Mohammad Alizadeh and CTO Tom Edsall. Mohammad has co-authored several research papers on fabrics and networks, all very worthwhile reading. The paper describes findings from a leaf and spine simulation, focused on the impact of buffer space, fabric link capacity and oversubscription to the overall traffic load.

The paper describes how the authors created a leaf and spine model consisting of 5 racks with 20 servers each, connected with 10GbE. The 5 simulated ToR switches (the leafs) are then connected to modeled spine switches at various oversubscription rates. Unfortunately the paper does not specifically state whether each of the spine switches takes a single fabric link from each of the leaf switches (which would require a lot of spine switches), or multiple fabric links from each. And if the latter, are those links aggregated using LAG, or treated individually and provided as individual paths to ECMP. I believe each of those options would result in different observed behavior.

Within this network they create dynamic flows that are being tracked for performance, with background traffic that is tracked for large flows only and creates load and congestion that impacts the primary workload. The primary workload creates an average of 10 file transfer requests per second per server, created with normal arrival probability to remove synchronization. Each file transfer requests 1Mb files from n servers, where n is a random number between 1 and the amount of servers, and each server provides 1/nth of the file when asked. The time taken for all portions of the file to be received is tracked, in the paper this is called Query Completion Time (QCT). The large background flows that are being tracked are measured by Flow Completion Time (FCT).

When looking at the traffic patterns being generated, it is essentially an any to any traffic pattern. The query traffic is requested by any server from any other server, its random nature guaranteeing an even distribution across all servers over time. The background flows are also evenly distributed (although not explicitly stated), the size of the flows based on real world traffic pattern studies.

And this is where we at Plexxi would have a first difference of opinion. We do not believe that data centers create uniform data flows, in bursts or over time. We strongly believe there are patterns of communications in a data center that can be recognized. We also do not believe all traffic is equally important. There are workloads that are very sensitive to delays, some that are not at all. I fully understand why the traffic distribution was chosen, it is extremely hard to find patterns of cause of effect in unbalanced environments. The simulation also forced all traffic to leave a rack, there was no traffic between servers inside the same rack (at least for the background flows). Many solutions in real life are engineered to make use of this locality providing cheap 1 hop non-oversubscribed bandwidth, Hadoop being a great example.

The authors found that changing link speeds from 10GbE between leaf and spine to 40GbE or 100GbE (without changing the overall bandwidth between spine and leaf) improves the FCT significantly. This is a good finding, but the way the conclusion of this finding is phrased leaves me with more questions. They state that “… ECMP load-balancing is inefficient when the fabric link speed is 10Gbps.” While true for the specific simulated environment, I believe the explanation is really that ECMP hashing does not create perfect distribution. When multiple background flows originating from 10GbE based servers, with their TCP windows wide open, start blasting traffic full speed, it will only take 2 of these to land on the same hashed fabric link to have a significant impact. With fewer uplinks of a higher speed, ECMP inefficiency will be less pronounced, the link speed alleviates some of this burstiness. If you look at the primary query traffic for the same comparison, the delta between multiple 10GbEs and fewer 40GbEs is much less pronounced and barely noticeable under very heavy load. It’s a nice result to highlight, but I believe it is directly related to type of traffic offered, TCP with a wide open window.

When comparing oversubscription levels, the paper finds that the oversubscribed versions of the network perform very similar to the non oversubscribed when offering up to about 60% relative load. When pushing the relative load to 70% of higher, oversubscribed spine and leaf networks degrade faster than non oversubscribed versions. No explanation is given, but a combination of ECMP effectiveness and buffer effectiveness has to contribute to this degradation.

The challenge to the network architect is trying to understand what the right oversubscription ratio is. Once designed and deployed with a certain ratio, changing that ratio in a fixed spine and leaf network is extremely cumbersome. Would it not be nice if you could dynamically change connectivity and therefore oversubscription based on workload needs?

The previous two parameters are fully under the control of those that build a network. You can pick how much oversubscription you want or need (or can afford) and with many of the latest generation switches, 10GbE vs 40GbE has become pretty much available as user configurable options. The last parameters examined in the simulation is the impact of buffer space available at each of the switches. They picked 10Mb as the standard shared buffer, very similar to what today’s 1U data center switch will have. Not surprising, the simulation showed that more buffer made the network perform better. A minor surprise is that increasing buffer space on the leaf is more impactful than doing the same on the spine. While the paper mentioned even queue utilization due to the all to all traffic patterns in use, it does not explain why with this traffic pattern leaf buffer size is more valuable than its spine equivalent, a suggestion that incast issues in this simulation are found at the leaf egress rather than the spine egress ports. Unfortunately you as a buyer have little control over the amount of buffering in your leaf switch. Modular spine switches have always had more buffer memory, perhaps this paper is a reason to ask why.

In the end, the paper is a very interesting read and provides some insight on how leaf and spine networks could behave. There is however a challenge in translating this to any network. The traffic simulation was (not without reason) specifically designed to create an even any to any mix, with explicit burstiness due to open TCP windows and no traffic between ports on the same switch.

We believe there is no one size fits all network. Applications are different. Workloads are different. There is no such thing as uniformity, there is localization, there are hot spots and different workloads want different things from the network. That is why we believe we should not be building uniform networks, or forward traffic based on uniform algorithms. By understanding what the workloads are and where large portion of traffic flows are, would it not be nice to adjust topologies to create less oversubscription between those portions of the network, while allowing more oversubscription for other portions? By understanding workloads, hot spots can be avoided, more links and associated buffer space can be applied where they are needed. There is no question that spine and leaf networks are a great improvement over multi tiered networks of the past, but why stop there?

[Today's Fun Fact: The fingerprints of Koala Bears are virtually indistinguishable from those of humans, so much so that they could be confused at a crime scene. Reasonable doubt anyone?]

The post Plexxi’s View: Data Path Performance of Leaf-Spine Datacenter Fabrics appeared first on Plexxi.

Read the original blog entry...

More Stories By Marten Terpstra

Marten Terpstra is a Product Management Director at Plexxi Inc. Marten has extensive knowledge of the architecture, design, deployment and management of enterprise and carrier networks.

DXWorldEXPO Digital Transformation Stories
The platform combines the strengths of Singtel's extensive, intelligent network capabilities with Microsoft's cloud expertise to create a unique solution that sets new standards for IoT applications," said Mr Diomedes Kastanis, Head of IoT at Singtel. "Our solution provides speed, transparency and flexibility, paving the way for a more pervasive use of IoT to accelerate enterprises' digitalisation efforts. AI-powered intelligent connectivity over Microsoft Azure will be the fastest connected pat...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Druva is the global leader in Cloud Data Protection and Management, delivering the industry's first data management-as-a-service solution that aggregates data from endpoints, servers and cloud applications and leverages the public cloud to offer a single pane of glass to enable data protection, governance and intelligence-dramatically increasing the availability and visibility of business critical information, while reducing the risk, cost and complexity of managing and protecting it. Druva's...
BMC has unmatched experience in IT management, supporting 92 of the Forbes Global 100, and earning recognition as an ITSM Gartner Magic Quadrant Leader for five years running. Our solutions offer speed, agility, and efficiency to tackle business challenges in the areas of service management, automation, operations, and the mainframe.
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
DSR is a supplier of project management, consultancy services and IT solutions that increase effectiveness of a company's operations in the production sector. The company combines in-depth knowledge of international companies with expert knowledge utilising IT tools that support manufacturing and distribution processes. DSR ensures optimization and integration of internal processes which is necessary for companies to grow rapidly. The rapid growth is possible thanks, to specialized services an...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...