Welcome!

@DXWorldExpo Authors: Elizabeth White, Pat Romanski, Yeshim Deniz, Liz McMillan, William Schmarzo

Blog Feed Post

Plexxi’s View: Data Path Performance of Leaf-Spine Datacenter Fabrics

While doing some competitive analysis, I read this paper that was presented at the 21st Symposium on High Performance Interconnects. The paper discusses the data path performance of spine and leaf networks and was written by Insieme Networks’ Mohammad Alizadeh and CTO Tom Edsall. Mohammad has co-authored several research papers on fabrics and networks, all very worthwhile reading. The paper describes findings from a leaf and spine simulation, focused on the impact of buffer space, fabric link capacity and oversubscription to the overall traffic load.

The paper describes how the authors created a leaf and spine model consisting of 5 racks with 20 servers each, connected with 10GbE. The 5 simulated ToR switches (the leafs) are then connected to modeled spine switches at various oversubscription rates. Unfortunately the paper does not specifically state whether each of the spine switches takes a single fabric link from each of the leaf switches (which would require a lot of spine switches), or multiple fabric links from each. And if the latter, are those links aggregated using LAG, or treated individually and provided as individual paths to ECMP. I believe each of those options would result in different observed behavior.

Within this network they create dynamic flows that are being tracked for performance, with background traffic that is tracked for large flows only and creates load and congestion that impacts the primary workload. The primary workload creates an average of 10 file transfer requests per second per server, created with normal arrival probability to remove synchronization. Each file transfer requests 1Mb files from n servers, where n is a random number between 1 and the amount of servers, and each server provides 1/nth of the file when asked. The time taken for all portions of the file to be received is tracked, in the paper this is called Query Completion Time (QCT). The large background flows that are being tracked are measured by Flow Completion Time (FCT).

When looking at the traffic patterns being generated, it is essentially an any to any traffic pattern. The query traffic is requested by any server from any other server, its random nature guaranteeing an even distribution across all servers over time. The background flows are also evenly distributed (although not explicitly stated), the size of the flows based on real world traffic pattern studies.

And this is where we at Plexxi would have a first difference of opinion. We do not believe that data centers create uniform data flows, in bursts or over time. We strongly believe there are patterns of communications in a data center that can be recognized. We also do not believe all traffic is equally important. There are workloads that are very sensitive to delays, some that are not at all. I fully understand why the traffic distribution was chosen, it is extremely hard to find patterns of cause of effect in unbalanced environments. The simulation also forced all traffic to leave a rack, there was no traffic between servers inside the same rack (at least for the background flows). Many solutions in real life are engineered to make use of this locality providing cheap 1 hop non-oversubscribed bandwidth, Hadoop being a great example.

The authors found that changing link speeds from 10GbE between leaf and spine to 40GbE or 100GbE (without changing the overall bandwidth between spine and leaf) improves the FCT significantly. This is a good finding, but the way the conclusion of this finding is phrased leaves me with more questions. They state that “… ECMP load-balancing is inefficient when the fabric link speed is 10Gbps.” While true for the specific simulated environment, I believe the explanation is really that ECMP hashing does not create perfect distribution. When multiple background flows originating from 10GbE based servers, with their TCP windows wide open, start blasting traffic full speed, it will only take 2 of these to land on the same hashed fabric link to have a significant impact. With fewer uplinks of a higher speed, ECMP inefficiency will be less pronounced, the link speed alleviates some of this burstiness. If you look at the primary query traffic for the same comparison, the delta between multiple 10GbEs and fewer 40GbEs is much less pronounced and barely noticeable under very heavy load. It’s a nice result to highlight, but I believe it is directly related to type of traffic offered, TCP with a wide open window.

When comparing oversubscription levels, the paper finds that the oversubscribed versions of the network perform very similar to the non oversubscribed when offering up to about 60% relative load. When pushing the relative load to 70% of higher, oversubscribed spine and leaf networks degrade faster than non oversubscribed versions. No explanation is given, but a combination of ECMP effectiveness and buffer effectiveness has to contribute to this degradation.

The challenge to the network architect is trying to understand what the right oversubscription ratio is. Once designed and deployed with a certain ratio, changing that ratio in a fixed spine and leaf network is extremely cumbersome. Would it not be nice if you could dynamically change connectivity and therefore oversubscription based on workload needs?

The previous two parameters are fully under the control of those that build a network. You can pick how much oversubscription you want or need (or can afford) and with many of the latest generation switches, 10GbE vs 40GbE has become pretty much available as user configurable options. The last parameters examined in the simulation is the impact of buffer space available at each of the switches. They picked 10Mb as the standard shared buffer, very similar to what today’s 1U data center switch will have. Not surprising, the simulation showed that more buffer made the network perform better. A minor surprise is that increasing buffer space on the leaf is more impactful than doing the same on the spine. While the paper mentioned even queue utilization due to the all to all traffic patterns in use, it does not explain why with this traffic pattern leaf buffer size is more valuable than its spine equivalent, a suggestion that incast issues in this simulation are found at the leaf egress rather than the spine egress ports. Unfortunately you as a buyer have little control over the amount of buffering in your leaf switch. Modular spine switches have always had more buffer memory, perhaps this paper is a reason to ask why.

In the end, the paper is a very interesting read and provides some insight on how leaf and spine networks could behave. There is however a challenge in translating this to any network. The traffic simulation was (not without reason) specifically designed to create an even any to any mix, with explicit burstiness due to open TCP windows and no traffic between ports on the same switch.

We believe there is no one size fits all network. Applications are different. Workloads are different. There is no such thing as uniformity, there is localization, there are hot spots and different workloads want different things from the network. That is why we believe we should not be building uniform networks, or forward traffic based on uniform algorithms. By understanding what the workloads are and where large portion of traffic flows are, would it not be nice to adjust topologies to create less oversubscription between those portions of the network, while allowing more oversubscription for other portions? By understanding workloads, hot spots can be avoided, more links and associated buffer space can be applied where they are needed. There is no question that spine and leaf networks are a great improvement over multi tiered networks of the past, but why stop there?

[Today's Fun Fact: The fingerprints of Koala Bears are virtually indistinguishable from those of humans, so much so that they could be confused at a crime scene. Reasonable doubt anyone?]

The post Plexxi’s View: Data Path Performance of Leaf-Spine Datacenter Fabrics appeared first on Plexxi.

Read the original blog entry...

More Stories By Marten Terpstra

Marten Terpstra is a Product Management Director at Plexxi Inc. Marten has extensive knowledge of the architecture, design, deployment and management of enterprise and carrier networks.

DXWorldEXPO Digital Transformation Stories
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, discussed some of the security challenges of the IoT infrastructure and related how these aspects impact Smart Living. The material was delivered interac...
Atmosera delivers modern cloud services that maximize the advantages of cloud-based infrastructures. Offering private, hybrid, and public cloud solutions, Atmosera works closely with customers to engineer, deploy, and operate cloud architectures with advanced services that deliver strategic business outcomes. Atmosera's expertise simplifies the process of cloud transformation and our 20+ years of experience managing complex IT environments provides our customers with the confidence and trust tha...
Intel is an American multinational corporation and technology company headquartered in Santa Clara, California, in the Silicon Valley. It is the world's second largest and second highest valued semiconductor chip maker based on revenue after being overtaken by Samsung, and is the inventor of the x86 series of microprocessors, the processors found in most personal computers (PCs). Intel supplies processors for computer system manufacturers such as Apple, Lenovo, HP, and Dell. Intel also manufactu...
Darktrace is the world's leading AI company for cyber security. Created by mathematicians from the University of Cambridge, Darktrace's Enterprise Immune System is the first non-consumer application of machine learning to work at scale, across all network types, from physical, virtualized, and cloud, through to IoT and industrial control systems. Installed as a self-configuring cyber defense platform, Darktrace continuously learns what is ‘normal' for all devices and users, updating its understa...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
Apptio fuels digital business transformation. Technology leaders use Apptio's machine learning to analyze and plan their technology spend so they can invest in products that increase the speed of business and deliver innovation. With Apptio, they translate raw costs, utilization, and billing data into business-centric views that help their organization optimize spending, plan strategically, and drive digital strategy that funds growth of the business. Technology leaders can gather instant recomm...
OpsRamp is an enterprise IT operation platform provided by US-based OpsRamp, Inc. It provides SaaS services through support for increasingly complex cloud and hybrid computing environments from system operation to service management. The OpsRamp platform is a SaaS-based, multi-tenant solution that enables enterprise IT organizations and cloud service providers like JBS the flexibility and control they need to manage and monitor today's hybrid, multi-cloud infrastructure, applications, and wor...
The Master of Science in Artificial Intelligence (MSAI) provides a comprehensive framework of theory and practice in the emerging field of AI. The program delivers the foundational knowledge needed to explore both key contextual areas and complex technical applications of AI systems. Curriculum incorporates elements of data science, robotics, and machine learning-enabling you to pursue a holistic and interdisciplinary course of study while preparing for a position in AI research, operations, ...
After years of investments and acquisitions, CloudBlue was created with the goal of building the world's only hyperscale digital platform with an increasingly infinite ecosystem and proven go-to-market services. The result? An unmatched platform that helps customers streamline cloud operations, save time and money, and revolutionize their businesses overnight. Today, the platform operates in more than 45 countries and powers more than 200 of the world's largest cloud marketplaces, managing mo...
Codete accelerates their clients growth through technological expertise and experience. Codite team works with organizations to meet the challenges that digitalization presents. Their clients include digital start-ups as well as established enterprises in the IT industry. To stay competitive in a highly innovative IT industry, strong R&D departments and bold spin-off initiatives is a must. Codete Data Science and Software Architects teams help corporate clients to stay up to date with the mod...