Welcome!

@BigDataExpo Authors: Elizabeth White, Liz McMillan, Yeshim Deniz, Matt Brickey, Christoph Schell

Related Topics: @BigDataExpo, Containers Expo Blog, Cognitive Computing , @CloudExpo, Apache, SDN Journal

@BigDataExpo: Article

Extending and Augmenting Hadoop

Use the right tool for the job

In the last two years, the Apache Hadoop software library has emerged as a veritable Swiss Army knife of data management and analytical infrastructure. The Hadoop toolset has been positioned as a universal platform for all types of commercial and organizational analytical needs.

Hadoop is an ideal solution for use cases in which the data is easily partitioned and distributed. For example, consider keyword searches, a major component of SEO. Simply identifying and counting distinct words in text data is a central part of the process for keyword-based searches. No matter how many pieces of text you have, each document, article, blog post, or other piece of content is distinct from the other.

In order to enable keyword search, a program then computes the number of words in each distinct document or item of text. Clearly, this can be done in isolation: to count the number of times a word occurs in a given set of documents, you can count it within one document, and add up the counts across many documents. Moreover, you can count each document at the same time as another since they are distinct. An Internet-scale search engine like Google essentially leverages this concept to distribute such processing across a large number of simple machines (a cluster).

Another example where Hadoop shines is when it comes to counting the number of times a specific user, as represented by an IP address, has visited a particular web page or website. Again, this can be broken up into a series of smaller problems and spread over multiple machines in a Hadoop cluster. Results from the smaller sets can then be aggregated to obtain the total count.

MapReduce, the programming model behind Hadoop, was designed to address problems in which an operation over a very large dataset can easily be broken into the same operation on smaller datasets. The promise of Hadoop is in the ability to use open source software relatively inexpensively to address this whole class of partitionable problems.

However, there are a number of analytical use cases for which Hadoop is inadequate. In such cases, the Hadoop toolset may need to be augmented and extended with other technologies to properly resolve these problems.

Understanding Graph Connections
In parallel with the emergence of Hadoop, the world of social media has exploded: As of 2012, the social media powerhouse Facebook had more than one billion registered users, according to CEO Mark Zuckerburg.

Social media networks such as Facebook and LinkedIn are driven by a fundamental focus on relationships and connections. For example, Facebook users can now use the service's Graph Search to find friends of friends who live in the same city or like the same baseball team, and the site frequently suggests "people you may know" based on the mutual connections that two unconnected individuals have established. LinkedIn focuses on helping business professionals grow their social networks by helping them find key contacts or prospects who are connected to existing friends or colleagues, and allowing users to leverage those existing relationships to form new connections. The use of such data connections is becoming ever more useful to individuals for enhancing their personal and business lives.

Likewise, the capacity to comprehend and assess such relationships is a key component driving the world of business analytics. For example, business managers frequently want to know the answers to questions such as:

  • What are all the ways in which a person of interest in a crime database may be related to another person of interest?
  • Based on known patterns of suspicious behavior in a corporate network, how can we identify malicious hacking attacks before they have a financial impact on our company?
  • Which of an organization's partners have a financial exposure to the failure of another company?

Take the question of how two people might be connected on social media. This may seem simple, but as soon as you look closely, it's not quite so clear. The simplest example of such a problem is in looking at how two people may be connected on Facebook. They can be friends - a direct connection that is hard to miss. Or they might be friends of friends, which starts getting a little murkier. The connections can be even more distant and difficult to immediately pinpoint. For instance, Person A may be married to someone whose brother is a friend of Person B. Or perhaps they have a shared affiliation, such as attending the same school, working at the same organization, or attending the same church.

In some cases, two individuals' only connection may be sharing a few Likes. These shared affinities may be valuable information to a business if, for example, those Likes happen to be something your organization addresses. In that case, you may want to drill down to those specific people out of the entire billion users on Facebook, so that you can target your online advertising directly to them.

If you think of all the possible ways that one Facebook user can be connected to another user, it is a very different kind of Big Data problem. You cannot simply break up the problem into smaller segments because, by definition, it involves connections that require link analysis. This makes it a problem that Hadoop isn't ideally suited to address.

Link analysis problems occur in many domains beyond social networks. The network of neurons in the brain and the pathways between these neurons is an example. A group of suspicious people and their connections (as observed by their interactions) is another. The network of genes and proteins and their interactions is yet another.

What do you do to solve problems that involve complex relationship patterns and require detailed link analysis? Enter graph analytics.

Graph Analytics
Essentially graphs provide a way of organizing data to specifically highlight relationships. On such a foundation, it is possible to apply a number of simple to complex analytical techniques to understand groups of similar related entities, to identify the central influencer in a social network, or to identify complex patterns of behavior indicative of fraud.

In fact, the secret to Google's search engine success is the use of a specific graph analytics technique called PageRank. Rather than focus on the prevalence of keywords in a web page, Google focused on the relationships between webpages on the World Wide Web and prioritizing results from highly authoritative sites - resulting in astonishing accuracy in determining relevant results for keyword search.

A common, standard way of representing data in this relationship-oriented format is RDF, a W3C standard, which is accompanied by a query language called SPARQL to specifically analyze such data. In the Life Sciences domain, companies and public consortia are increasingly representing data in this form, because this method provides a more comprehensive overview of the data relationships - whether it is gene/protein interactions, or diseases and their genetic characteristics.

Requires Secret Sauce
Since the nature of graphs makes them difficult to partition, Hadoop is not well suited to this class of analytical problems.

As a matter of fact, the problems are even deeper than that: Because of the unpredictable nature of data access while following and analyzing relationships, commodity hardware architectures are fundamentally challenged. Merely grouping machines together does not address these issues, because the challenges posed by graph analytics are at the network level and are not significantly addressed by the computing capacity of a single machine. What is the ideal approach for solving complex problems involving the analysis of relationships in data?

The secret sauce behind the best performing graph analytics tools is massive un-partitioned memory. One tool, for example, uses a memory pool of up to 512TB (half a petabyte) to perform continuous data and link analysis in real-time even as data continues to pour in. This eliminates latency problems and memory scalability issues while customized chips speed performance.

Comparison Table: Hadoop-Graph Analytics

 

Hadoop

Graph Analytics

Operation mode

Batch

Real-time

Language

Map-reduce

SPARQL

Platform

Any commodity hardware

Specialized hardware

Queries

Must be partitioned

Allows non-partitioned

Query types

Seek specific data answers

Discover relationships, connections

Results

Tables of entities

Relationships between entities

Graph Analytics Use Cases
Graph analytics is a new player in the Big Data game (which, itself, is quite new). Still, the pioneers and early adopters are reporting promising results for graph analytics as an alternative for solving diverse types of problems. Several examples include:

  • Actionable intelligence: QinetiQ North America (QNA) delivers "actionable intelligence" to government customers interested in identifying threats through the detection of non-obvious patterns of relationships in big data. Graph analytics were the obvious approach, for which QinetiQ uses a purpose-built graph analytics appliance running graph-optimized hardware and a graph database. It interacts with the appliance through the industry standard interface RDF/SPARQL, as defined by the Worldwide Web Consortium (W3C).
  • Life sciences: Oak Ridge National Laboratory (ORNL) opted for a graph analytics appliance to conduct research in healthcare fraud and analytics for a leading healthcare payer. In addition to the healthcare fraud detection program, researchers and scientists at ORNL will also apply the capabilities of the graph-analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of proteins and gene pathways.
  • Higher education: The Pittsburgh Supercomputer Center (PSC) turned to agraph analytics appliance called Sherlock (no relation to IBM's Watson) to provide researchers with the ability to search extremely large and complex bodies of information using a straightforward command similar to ‘find something important.' Sherlock took advantage of specialized graph analytics hardware to run 128 threads per processor on dedicated hardware and speeded memory access across a terabyte of global shared memory. The appliance helped PSC win public recognition for extending graph analytics techniques to a wide range of scientific research projects.

The potential uses of graph analytics are just beginning to be explored. Already, the technology is being applied across a broad array of industries, including manufacturing, energy and gas exploration, earth sciences and meteorology, and government and defense.

Advantages Offered by Graph Analytics
A key advantage of graphs is the ease with which new sources of data and new relationships can be added. Graph databases using RDF to represent the graph can easily merge and unify diverse datasets without significant upfront investment in data modeling. Such an approach lies in stark contrast to ‘traditional' analytics, in which a great deal of time is spent organizing data, and the addition of new data sources requires time-consuming and error prone effort by analysts.

The easy on-boarding of new data is particularly important when dealing with Big Data. Traditional analytics focus on finding answers to known questions. By contrast, many of the highest value applications, such as those identified above, are focused on discovery, where the questions to be answered are not known in advance. The ability to quickly and easily add new data sources or new relationships within the data when needed to support a new line of questioning is crucial for discovery, and graphs are uniquely well qualified to support these requirements.

Graph analytics also offer sophisticated capabilities for analyzing relationships, while traditional analytics focus on summarizing, aggregating and reporting on data. Use the right tool for the job. Some common graph analytic techniques include:

  1. Centrality analysis: To identify the most central entities in your network, a very useful capability for influencer marketing.
  2. Path analysis: To identify all the connections between a pair of entities, useful in understanding risks and exposure.
  3. Community detection: To identify clusters or communities, which is of great importance to understanding issues in sociology and biology.
  4. Sub-graph isomorphism: To search for a pattern of relationships, useful for validating hypotheses and searching for abnormal situations, such as hacker attacks.

Complementary to Hadoop
Interestingly, Hadoop and graph analytics complement each other perfectly. Hadoop is a scale-out solution, allowing independent items of work to be parceled out to the computers in a cluster. Graph analytics, on the other hand, excel at looking at the "big picture," analyzing complex networks of relationships that cannot be partitioned.

For example, consider risk analysis within a financial solution. Many documents will need to be independently analyzed, and the relationships between organizations extracted. This is a perfect job for Hadoop since each document is independent of the others. On the other hand, the complex network of relationships between organizations form an un-partitionable graph, which is best analyzed as a single entity, in-memory.

Relationships and Connections
Analysts today have a tabular, "row-and-column" mindset when it comes to data and analytics - probably a byproduct of the spreadsheet's decades of success.

But don't you often think about problems and data in different ways?

Graph analytics explicitly model and reason about the relationships between different entities, and graph tools also display those relationships visually. The analyst can see all the relationships in which an entity participates, and intuitively assess which elements are close or important.

When it comes to customers, relationships, rather than tabular data, may be the most important element: they are more predictive of your likelihood of retaining or losing customers. The more connections customers have to your organization, its products, and its people, the more likely they will remain customers. Relationships, not tables, are also key to hacker and threat identification, risk and fraud analysis, influencer marketing and many other high value applications.

Graph analytics complement Hadoop and provide a level of immediate, deep insights that are not readily obtainable in any other way.

More Stories By Venkat Krishnamurthy

Venkat Krishnamurthy is the Product Management Director at YarcData, driving the direction and definition of YarcData products and solutions and working with customers to make them successful. Krishnamurthy has over a decade of experience in advanced analytics, including as a Director of Product Management at Oracle and as Vice President of Technology at Goldman Sachs. At Goldman, he conducted data analysis to assess risk controls across multiple trading desks/asset classes, algorithmic trading, market risk model validation, prime brokerage.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@BigDataExpo Stories
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We provide IoT solutions. We provide the most compatible solutions for many applications. Our solutions are industry agnostic and also protocol agnostic," explained Richard Han, Head of Sales and Marketing and Engineering at Systena America, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
"The Striim platform is a full end-to-end streaming integration and analytics platform that is middleware that covers a lot of different use cases," explained Steve Wilkes, Founder and CTO at Striim, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Peak 10 is a hybrid infrastructure provider across the nation. We are in the thick of things when it comes to hybrid IT," explained Michael Fuhrman, Chief Technology Officer at Peak 10, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are focused on SAP running in the clouds, to make this super easy because we believe in the tremendous value of those powerful worlds - SAP and the cloud," explained Frank Stienhans, CTO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DX World EXPO, LLC., a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Cloud computing is certainly changing how people consume storage, how they use it, and what they use it for. It's also making people rethink how they architect their environment," stated Brad Winett, Senior Technologist for DDN Storage, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...