Welcome!

@BigDataExpo Authors: Elizabeth White, Pat Romanski, Dana Gardner, Liz McMillan, Scott Allen

Blog Feed Post

Big Data Needs a Better Network

Earlier this week I had some interesting conversations with @davehusak. Where the conversation started early in the day with a discussion on overlay networks and what network functions are performed where and in what context, later in the afternoon the discussion moved to networking solutions (and specifically Plexxi solutions) for big data applications.

It’s easy to jump to Hadoop or similarly structured cluster computing applications (Spark, Storm, or a long list of others) as the definition of a big data application. With all its simplicity for the overall distribution of work, Hadoop is a fairly tough network problem to solve if you want to do anything more than “throw bandwidth at the problem”. And when you do throw bandwidth at the problem, the extreme burstiness of the traffic will still significantly drag down the performance of the overall solution. And for many, CPU cycles to reduce the data is not the biggest challenge, storage and movement of data throughout a big data cluster is the biggest pain point. Intel has done a fine job providing compute firepower that far outpaces the evolution of network capacity.

The network plays a significant factor in several stages of a Hadoop solution cycle. It starts with chopping the to-be-analyzed data into chunks and distributing it across the datanodes. Hadoop has a notion of a rack and it has some basic intelligence when placing data and jobs that work on that data. By default the data will be replicated 3 times across at least 2 racks, if racks have been defined. The data to be distributed is easily in the 100s of Gigabytes or even Terabytes, so triple that data is being moved throughout the Hadoop cluster to the datanodes.

Once distributed, the actual Map jobs are launched against that data, these are the tasks that take the data and perform a first pass mapping into (in its most basic form) <key, value> tuples. Again here there is an attempt to have jobs work on local data, where local can be defined as local to the server that has that chunk of data or local to the rack, in an attempt to avoid as much cross rack communication as possible. This is based on the assumption that cross rack communication is much more constrained and aggregated and therefore more prone to congestion and packet loss.

Once the mapper jobs complete their task, the results of the mapping exercise is sent to reducers. Reducers take the <key, value> information and essentially tally the results. This transfer is the most taxing part of a Hadoop cycle on the network. Since most Hadoop mapping jobs run the same function on a similar sized dataset, that first set of mappers will all complete their task at about the same time and will all start sending their results to the same set of reducers, creating 1) a lot of traffic and 2) a lot of traffic to the same set of destinations.  Depending on the amount of data, mappers and compute nodes, this cycle repeats (the next set of mapping jobs are fired off) and at the end of each cycle a very significant spike in network traffic appears. At the very end, all results are brought together for one last spike in traffic. Each one of these network events is a source of significant congestion.

Many variables contribute to the overall performance of the Hadoop solution. What is the relationship between the chunks of data and the amount of servers and jobs? How many reducers are used? Where are the reducers in relation to the mappers? Is the Map function compute heavy or I/O heavy? How aggressive is the speculative scheduling that allows the same data to be worked on by multiple mappers?

With that many variables that can be tuned, and with so many variables different from one analysis to the next, it is hard to imagine that a single network design or implementation provides the best supporting infrastructure. There are assumptions in Hadoop placement of data and jobs that can easily be altered. The basic concept of a rack can easily expanded into a multi layer locality definition. With the right tools in the network, the definition of a rack, or even the locality and closeness of nodes in a cluster or virtual cluster can be adjusted according to the analysis to be completed.

We tend to give our applications variables to tune its performance based on what the network provides. It is time that the network adjusts itself based on the application needs. In a clustered application like Hadoop, there is lots of knowledge and even some predictability of network traffic. Wouldn’t that make for a great opportunity to infuse the network with some of that knowledge and have it morph itself to provide the best possible service? And Hadoop is not unique, cluster compute framework almost all carefully track placement of data and compute jobs, which makes them all great candidates to share some of that information with a smart network.

There are far simpler big data needs and applications that can and should be supported by flexible networks. Storage networks are still often separated from data networks for performance reasons and fear of interference. If you could actually separate the various types of data with logically or even physically different paths in the same network, would you still?

 

[Today's fun fact: An MLB baseball lasts on average 7 pitches. A google search asking how many baseballs are used in a single MLB season returns several pages worth of different answers. They must all be correct, it's on the Internet afterall.]

The post Big Data Needs a Better Network appeared first on Plexxi.

Read the original blog entry...

More Stories By Marten Terpstra

Marten Terpstra is a Product Management Director at Plexxi Inc. Marten has extensive knowledge of the architecture, design, deployment and management of enterprise and carrier networks.

@BigDataExpo Stories
SaaS companies can greatly expand revenue potential by pushing beyond their own borders. The challenge is how to do this without degrading service quality. In his session at 18th Cloud Expo, Adam Rogers, Managing Director at Anexia, discussed how IaaS providers with a global presence and both virtual and dedicated infrastructure can help companies expand their service footprint with low “go-to-market” costs.
When it comes to cloud computing, the ability to turn massive amounts of compute cores on and off on demand sounds attractive to IT staff, who need to manage peaks and valleys in user activity. With cloud bursting, the majority of the data can stay on premises while tapping into compute from public cloud providers, reducing risk and minimizing need to move large files. In his session at 18th Cloud Expo, Scott Jeschonek, Director of Product Management at Avere Systems, discussed the IT and busin...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm ...
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, explored the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences between wh...
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The pace of innovation, vendor lock-in, production sustainability, cost-effectiveness, and managing risk… In his session at 18th Cloud Expo, Dan Choquette, Founder of RackN, discussed how CIOs are challenged finding the balance of finding the right tools, technology and operational model that serves the business the best. He also discussed how clouds, open source software and infrastructure solutions have benefits but also drawbacks and how workload and operational portability between vendors ...
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
"We work in the area of Big Data analytics and Big Data analytics is a very crowded space - you have Hadoop, ETL, warehousing, visualization and there's a lot of effort trying to get these tools to talk to each other," explained Mukund Deshpande, head of the Analytics practice at Accelerite, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The idea of comparing data in motion (at the sensor level) to data at rest (in a Big Data server warehouse) with predictive analytics in the cloud is very appealing to the industrial IoT sector. The problem Big Data vendors have, however, is access to that data in motion at the sensor location. In his session at @ThingsExpo, Scott Allen, CMO of FreeWave, discussed how as IoT is increasingly adopted by industrial markets, there is going to be an increased demand for sensor data from the outermos...
The initial debate is over: Any enterprise with a serious commitment to IT is migrating to the cloud. But things are not so simple. There is a complex mix of on-premises, colocated, and public-cloud deployments. In this power panel at 18th Cloud Expo, moderated by Conference Chair Roger Strukhoff, Randy De Meno, Chief Technologist - Windows Products and Microsoft Partnerships at Commvault; Dave Landa, Chief Operating Officer at kintone; William Morrish, General Manager Product Sales at Interou...
UAS, drones or unmanned aircraft, no matter what you call them — this was their week. Our news stream was flooded with updates on the newly announced rules and regulations for commercial UAS from the FAA. So, naturally we have dedicated this week’s top news round up to highlight some of our favorite UAS stories.
Internet of @ThingsExpo has announced today that Chris Matthieu has been named tech chair of Internet of @ThingsExpo 2016 Silicon Valley. The 6thInternet of @ThingsExpo will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
As organizations shift towards IT-as-a-service models, the need for managing and protecting data residing across physical, virtual, and now cloud environments grows with it. Commvault can ensure protection, access and E-Discovery of your data – whether in a private cloud, a Service Provider delivered public cloud, or a hybrid cloud environment – across the heterogeneous enterprise. In his general session at 18th Cloud Expo, Randy De Meno, Chief Technologist - Windows Products and Microsoft Part...
In addition to all the benefits, IoT is also bringing new kind of customer experience challenges - cars that unlock themselves, thermostats turning houses into saunas and baby video monitors broadcasting over the internet. This list can only increase because while IoT services should be intuitive and simple to use, the delivery ecosystem is a myriad of potential problems as IoT explodes complexity. So finding a performance issue is like finding the proverbial needle in the haystack.
Digital Initiatives create new ways of conducting business, which drive the need for increasingly advanced security and regulatory compliance challenges with exponentially more damaging consequences. In the BMC and Forbes Insights Survey in 2016, 97% of executives said they expect a rise in data breach attempts in the next 12 months. Sixty percent said operations and security teams have only a general understanding of each other’s requirements, resulting in a “SecOps gap” leaving organizations u...
"A lot of times people will come to us and have a very diverse set of requirements or very customized need and we'll help them to implement it in a fashion that you can't just buy off of the shelf," explained Nick Rose, CTO of Enzu, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discussed how businesses can gain an edge over competitors by empowering consumers to take control through IoT. He cited examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He also highlighted how IoT can revitalize and restore outdated business models, making them profitable ...
Presidio has received the 2015 EMC Partner Services Quality Award from EMC Corporation for achieving outstanding service excellence and customer satisfaction as measured by the EMC Partner Services Quality (PSQ) program. Presidio was also honored as the 2015 EMC Americas Marketing Excellence Partner of the Year and 2015 Mid-Market East Partner of the Year. The EMC PSQ program is a project-specific survey program designed for partners with Service Partner designations to solicit customer feedbac...
Apixio Inc. has raised $19.3 million in Series D venture capital funding led by SSM Partners with participation from First Analysis, Bain Capital Ventures and Apixio’s largest angel investor. Apixio will dedicate the proceeds toward advancing and scaling products powered by its cognitive computing platform, further enabling insights for optimal patient care. The Series D funding comes as Apixio experiences strong momentum and increasing demand for its HCC Profiler solution, which mines unstruc...
Predictive analytics tools monitor, report, and troubleshoot in order to make proactive decisions about the health, performance, and utilization of storage. Most enterprises combine cloud and on-premise storage, resulting in blended environments of physical, virtual, cloud, and other platforms, which justifies more sophisticated storage analytics. In his session at 18th Cloud Expo, Peter McCallum, Vice President of Datacenter Solutions at FalconStor, discussed using predictive analytics to mon...