Welcome!

@DXWorldExpo Authors: Automic Blog, Pat Romanski, Liz McMillan, Elizabeth White, William Schmarzo

Related Topics: @DXWorldExpo, Containers Expo Blog, Cognitive Computing , @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

Extending and Augmenting Hadoop

Use the right tool for the job

In the last two years, the Apache Hadoop software library has emerged as a veritable Swiss Army knife of data management and analytical infrastructure. The Hadoop toolset has been positioned as a universal platform for all types of commercial and organizational analytical needs.

Hadoop is an ideal solution for use cases in which the data is easily partitioned and distributed. For example, consider keyword searches, a major component of SEO. Simply identifying and counting distinct words in text data is a central part of the process for keyword-based searches. No matter how many pieces of text you have, each document, article, blog post, or other piece of content is distinct from the other.

In order to enable keyword search, a program then computes the number of words in each distinct document or item of text. Clearly, this can be done in isolation: to count the number of times a word occurs in a given set of documents, you can count it within one document, and add up the counts across many documents. Moreover, you can count each document at the same time as another since they are distinct. An Internet-scale search engine like Google essentially leverages this concept to distribute such processing across a large number of simple machines (a cluster).

Another example where Hadoop shines is when it comes to counting the number of times a specific user, as represented by an IP address, has visited a particular web page or website. Again, this can be broken up into a series of smaller problems and spread over multiple machines in a Hadoop cluster. Results from the smaller sets can then be aggregated to obtain the total count.

MapReduce, the programming model behind Hadoop, was designed to address problems in which an operation over a very large dataset can easily be broken into the same operation on smaller datasets. The promise of Hadoop is in the ability to use open source software relatively inexpensively to address this whole class of partitionable problems.

However, there are a number of analytical use cases for which Hadoop is inadequate. In such cases, the Hadoop toolset may need to be augmented and extended with other technologies to properly resolve these problems.

Understanding Graph Connections
In parallel with the emergence of Hadoop, the world of social media has exploded: As of 2012, the social media powerhouse Facebook had more than one billion registered users, according to CEO Mark Zuckerburg.

Social media networks such as Facebook and LinkedIn are driven by a fundamental focus on relationships and connections. For example, Facebook users can now use the service's Graph Search to find friends of friends who live in the same city or like the same baseball team, and the site frequently suggests "people you may know" based on the mutual connections that two unconnected individuals have established. LinkedIn focuses on helping business professionals grow their social networks by helping them find key contacts or prospects who are connected to existing friends or colleagues, and allowing users to leverage those existing relationships to form new connections. The use of such data connections is becoming ever more useful to individuals for enhancing their personal and business lives.

Likewise, the capacity to comprehend and assess such relationships is a key component driving the world of business analytics. For example, business managers frequently want to know the answers to questions such as:

  • What are all the ways in which a person of interest in a crime database may be related to another person of interest?
  • Based on known patterns of suspicious behavior in a corporate network, how can we identify malicious hacking attacks before they have a financial impact on our company?
  • Which of an organization's partners have a financial exposure to the failure of another company?

Take the question of how two people might be connected on social media. This may seem simple, but as soon as you look closely, it's not quite so clear. The simplest example of such a problem is in looking at how two people may be connected on Facebook. They can be friends - a direct connection that is hard to miss. Or they might be friends of friends, which starts getting a little murkier. The connections can be even more distant and difficult to immediately pinpoint. For instance, Person A may be married to someone whose brother is a friend of Person B. Or perhaps they have a shared affiliation, such as attending the same school, working at the same organization, or attending the same church.

In some cases, two individuals' only connection may be sharing a few Likes. These shared affinities may be valuable information to a business if, for example, those Likes happen to be something your organization addresses. In that case, you may want to drill down to those specific people out of the entire billion users on Facebook, so that you can target your online advertising directly to them.

If you think of all the possible ways that one Facebook user can be connected to another user, it is a very different kind of Big Data problem. You cannot simply break up the problem into smaller segments because, by definition, it involves connections that require link analysis. This makes it a problem that Hadoop isn't ideally suited to address.

Link analysis problems occur in many domains beyond social networks. The network of neurons in the brain and the pathways between these neurons is an example. A group of suspicious people and their connections (as observed by their interactions) is another. The network of genes and proteins and their interactions is yet another.

What do you do to solve problems that involve complex relationship patterns and require detailed link analysis? Enter graph analytics.

Graph Analytics
Essentially graphs provide a way of organizing data to specifically highlight relationships. On such a foundation, it is possible to apply a number of simple to complex analytical techniques to understand groups of similar related entities, to identify the central influencer in a social network, or to identify complex patterns of behavior indicative of fraud.

In fact, the secret to Google's search engine success is the use of a specific graph analytics technique called PageRank. Rather than focus on the prevalence of keywords in a web page, Google focused on the relationships between webpages on the World Wide Web and prioritizing results from highly authoritative sites - resulting in astonishing accuracy in determining relevant results for keyword search.

A common, standard way of representing data in this relationship-oriented format is RDF, a W3C standard, which is accompanied by a query language called SPARQL to specifically analyze such data. In the Life Sciences domain, companies and public consortia are increasingly representing data in this form, because this method provides a more comprehensive overview of the data relationships - whether it is gene/protein interactions, or diseases and their genetic characteristics.

Requires Secret Sauce
Since the nature of graphs makes them difficult to partition, Hadoop is not well suited to this class of analytical problems.

As a matter of fact, the problems are even deeper than that: Because of the unpredictable nature of data access while following and analyzing relationships, commodity hardware architectures are fundamentally challenged. Merely grouping machines together does not address these issues, because the challenges posed by graph analytics are at the network level and are not significantly addressed by the computing capacity of a single machine. What is the ideal approach for solving complex problems involving the analysis of relationships in data?

The secret sauce behind the best performing graph analytics tools is massive un-partitioned memory. One tool, for example, uses a memory pool of up to 512TB (half a petabyte) to perform continuous data and link analysis in real-time even as data continues to pour in. This eliminates latency problems and memory scalability issues while customized chips speed performance.

Comparison Table: Hadoop-Graph Analytics

 

Hadoop

Graph Analytics

Operation mode

Batch

Real-time

Language

Map-reduce

SPARQL

Platform

Any commodity hardware

Specialized hardware

Queries

Must be partitioned

Allows non-partitioned

Query types

Seek specific data answers

Discover relationships, connections

Results

Tables of entities

Relationships between entities

Graph Analytics Use Cases
Graph analytics is a new player in the Big Data game (which, itself, is quite new). Still, the pioneers and early adopters are reporting promising results for graph analytics as an alternative for solving diverse types of problems. Several examples include:

  • Actionable intelligence: QinetiQ North America (QNA) delivers "actionable intelligence" to government customers interested in identifying threats through the detection of non-obvious patterns of relationships in big data. Graph analytics were the obvious approach, for which QinetiQ uses a purpose-built graph analytics appliance running graph-optimized hardware and a graph database. It interacts with the appliance through the industry standard interface RDF/SPARQL, as defined by the Worldwide Web Consortium (W3C).
  • Life sciences: Oak Ridge National Laboratory (ORNL) opted for a graph analytics appliance to conduct research in healthcare fraud and analytics for a leading healthcare payer. In addition to the healthcare fraud detection program, researchers and scientists at ORNL will also apply the capabilities of the graph-analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of proteins and gene pathways.
  • Higher education: The Pittsburgh Supercomputer Center (PSC) turned to agraph analytics appliance called Sherlock (no relation to IBM's Watson) to provide researchers with the ability to search extremely large and complex bodies of information using a straightforward command similar to ‘find something important.' Sherlock took advantage of specialized graph analytics hardware to run 128 threads per processor on dedicated hardware and speeded memory access across a terabyte of global shared memory. The appliance helped PSC win public recognition for extending graph analytics techniques to a wide range of scientific research projects.

The potential uses of graph analytics are just beginning to be explored. Already, the technology is being applied across a broad array of industries, including manufacturing, energy and gas exploration, earth sciences and meteorology, and government and defense.

Advantages Offered by Graph Analytics
A key advantage of graphs is the ease with which new sources of data and new relationships can be added. Graph databases using RDF to represent the graph can easily merge and unify diverse datasets without significant upfront investment in data modeling. Such an approach lies in stark contrast to ‘traditional' analytics, in which a great deal of time is spent organizing data, and the addition of new data sources requires time-consuming and error prone effort by analysts.

The easy on-boarding of new data is particularly important when dealing with Big Data. Traditional analytics focus on finding answers to known questions. By contrast, many of the highest value applications, such as those identified above, are focused on discovery, where the questions to be answered are not known in advance. The ability to quickly and easily add new data sources or new relationships within the data when needed to support a new line of questioning is crucial for discovery, and graphs are uniquely well qualified to support these requirements.

Graph analytics also offer sophisticated capabilities for analyzing relationships, while traditional analytics focus on summarizing, aggregating and reporting on data. Use the right tool for the job. Some common graph analytic techniques include:

  1. Centrality analysis: To identify the most central entities in your network, a very useful capability for influencer marketing.
  2. Path analysis: To identify all the connections between a pair of entities, useful in understanding risks and exposure.
  3. Community detection: To identify clusters or communities, which is of great importance to understanding issues in sociology and biology.
  4. Sub-graph isomorphism: To search for a pattern of relationships, useful for validating hypotheses and searching for abnormal situations, such as hacker attacks.

Complementary to Hadoop
Interestingly, Hadoop and graph analytics complement each other perfectly. Hadoop is a scale-out solution, allowing independent items of work to be parceled out to the computers in a cluster. Graph analytics, on the other hand, excel at looking at the "big picture," analyzing complex networks of relationships that cannot be partitioned.

For example, consider risk analysis within a financial solution. Many documents will need to be independently analyzed, and the relationships between organizations extracted. This is a perfect job for Hadoop since each document is independent of the others. On the other hand, the complex network of relationships between organizations form an un-partitionable graph, which is best analyzed as a single entity, in-memory.

Relationships and Connections
Analysts today have a tabular, "row-and-column" mindset when it comes to data and analytics - probably a byproduct of the spreadsheet's decades of success.

But don't you often think about problems and data in different ways?

Graph analytics explicitly model and reason about the relationships between different entities, and graph tools also display those relationships visually. The analyst can see all the relationships in which an entity participates, and intuitively assess which elements are close or important.

When it comes to customers, relationships, rather than tabular data, may be the most important element: they are more predictive of your likelihood of retaining or losing customers. The more connections customers have to your organization, its products, and its people, the more likely they will remain customers. Relationships, not tables, are also key to hacker and threat identification, risk and fraud analysis, influencer marketing and many other high value applications.

Graph analytics complement Hadoop and provide a level of immediate, deep insights that are not readily obtainable in any other way.

More Stories By Venkat Krishnamurthy

Venkat Krishnamurthy is the Product Management Director at YarcData, driving the direction and definition of YarcData products and solutions and working with customers to make them successful. Krishnamurthy has over a decade of experience in advanced analytics, including as a Director of Product Management at Oracle and as Vice President of Technology at Goldman Sachs. At Goldman, he conducted data analysis to assess risk controls across multiple trading desks/asset classes, algorithmic trading, market risk model validation, prime brokerage.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@BigDataExpo Stories
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone in...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices t...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to clos...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
The “Digital Era” is forcing us to engage with new methods to build, operate and maintain applications. This transformation also implies an evolution to more and more intelligent applications to better engage with the customers, while creating significant market differentiators. In both cases, the cloud has become a key enabler to embrace this digital revolution. So, moving to the cloud is no longer the question; the new questions are HOW and WHEN. To make this equation even more complex, most ...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
DevOps at Cloud Expo – being held June 5-7, 2018, at the Javits Center in New York, NY – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Among the proven benefits,...
@DevOpsSummit at Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, is co-located with 22nd Cloud Expo | 1st DXWorld Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait...