Welcome!

@DXWorldExpo Authors: Pat Romanski, Liz McMillan, Elizabeth White, Yeshim Deniz, Maria C. Horton

Related Topics: @DXWorldExpo, Containers Expo Blog, Cognitive Computing , @CloudExpo, Apache, SDN Journal

@DXWorldExpo: Article

Extending and Augmenting Hadoop

Use the right tool for the job

In the last two years, the Apache Hadoop software library has emerged as a veritable Swiss Army knife of data management and analytical infrastructure. The Hadoop toolset has been positioned as a universal platform for all types of commercial and organizational analytical needs.

Hadoop is an ideal solution for use cases in which the data is easily partitioned and distributed. For example, consider keyword searches, a major component of SEO. Simply identifying and counting distinct words in text data is a central part of the process for keyword-based searches. No matter how many pieces of text you have, each document, article, blog post, or other piece of content is distinct from the other.

In order to enable keyword search, a program then computes the number of words in each distinct document or item of text. Clearly, this can be done in isolation: to count the number of times a word occurs in a given set of documents, you can count it within one document, and add up the counts across many documents. Moreover, you can count each document at the same time as another since they are distinct. An Internet-scale search engine like Google essentially leverages this concept to distribute such processing across a large number of simple machines (a cluster).

Another example where Hadoop shines is when it comes to counting the number of times a specific user, as represented by an IP address, has visited a particular web page or website. Again, this can be broken up into a series of smaller problems and spread over multiple machines in a Hadoop cluster. Results from the smaller sets can then be aggregated to obtain the total count.

MapReduce, the programming model behind Hadoop, was designed to address problems in which an operation over a very large dataset can easily be broken into the same operation on smaller datasets. The promise of Hadoop is in the ability to use open source software relatively inexpensively to address this whole class of partitionable problems.

However, there are a number of analytical use cases for which Hadoop is inadequate. In such cases, the Hadoop toolset may need to be augmented and extended with other technologies to properly resolve these problems.

Understanding Graph Connections
In parallel with the emergence of Hadoop, the world of social media has exploded: As of 2012, the social media powerhouse Facebook had more than one billion registered users, according to CEO Mark Zuckerburg.

Social media networks such as Facebook and LinkedIn are driven by a fundamental focus on relationships and connections. For example, Facebook users can now use the service's Graph Search to find friends of friends who live in the same city or like the same baseball team, and the site frequently suggests "people you may know" based on the mutual connections that two unconnected individuals have established. LinkedIn focuses on helping business professionals grow their social networks by helping them find key contacts or prospects who are connected to existing friends or colleagues, and allowing users to leverage those existing relationships to form new connections. The use of such data connections is becoming ever more useful to individuals for enhancing their personal and business lives.

Likewise, the capacity to comprehend and assess such relationships is a key component driving the world of business analytics. For example, business managers frequently want to know the answers to questions such as:

  • What are all the ways in which a person of interest in a crime database may be related to another person of interest?
  • Based on known patterns of suspicious behavior in a corporate network, how can we identify malicious hacking attacks before they have a financial impact on our company?
  • Which of an organization's partners have a financial exposure to the failure of another company?

Take the question of how two people might be connected on social media. This may seem simple, but as soon as you look closely, it's not quite so clear. The simplest example of such a problem is in looking at how two people may be connected on Facebook. They can be friends - a direct connection that is hard to miss. Or they might be friends of friends, which starts getting a little murkier. The connections can be even more distant and difficult to immediately pinpoint. For instance, Person A may be married to someone whose brother is a friend of Person B. Or perhaps they have a shared affiliation, such as attending the same school, working at the same organization, or attending the same church.

In some cases, two individuals' only connection may be sharing a few Likes. These shared affinities may be valuable information to a business if, for example, those Likes happen to be something your organization addresses. In that case, you may want to drill down to those specific people out of the entire billion users on Facebook, so that you can target your online advertising directly to them.

If you think of all the possible ways that one Facebook user can be connected to another user, it is a very different kind of Big Data problem. You cannot simply break up the problem into smaller segments because, by definition, it involves connections that require link analysis. This makes it a problem that Hadoop isn't ideally suited to address.

Link analysis problems occur in many domains beyond social networks. The network of neurons in the brain and the pathways between these neurons is an example. A group of suspicious people and their connections (as observed by their interactions) is another. The network of genes and proteins and their interactions is yet another.

What do you do to solve problems that involve complex relationship patterns and require detailed link analysis? Enter graph analytics.

Graph Analytics
Essentially graphs provide a way of organizing data to specifically highlight relationships. On such a foundation, it is possible to apply a number of simple to complex analytical techniques to understand groups of similar related entities, to identify the central influencer in a social network, or to identify complex patterns of behavior indicative of fraud.

In fact, the secret to Google's search engine success is the use of a specific graph analytics technique called PageRank. Rather than focus on the prevalence of keywords in a web page, Google focused on the relationships between webpages on the World Wide Web and prioritizing results from highly authoritative sites - resulting in astonishing accuracy in determining relevant results for keyword search.

A common, standard way of representing data in this relationship-oriented format is RDF, a W3C standard, which is accompanied by a query language called SPARQL to specifically analyze such data. In the Life Sciences domain, companies and public consortia are increasingly representing data in this form, because this method provides a more comprehensive overview of the data relationships - whether it is gene/protein interactions, or diseases and their genetic characteristics.

Requires Secret Sauce
Since the nature of graphs makes them difficult to partition, Hadoop is not well suited to this class of analytical problems.

As a matter of fact, the problems are even deeper than that: Because of the unpredictable nature of data access while following and analyzing relationships, commodity hardware architectures are fundamentally challenged. Merely grouping machines together does not address these issues, because the challenges posed by graph analytics are at the network level and are not significantly addressed by the computing capacity of a single machine. What is the ideal approach for solving complex problems involving the analysis of relationships in data?

The secret sauce behind the best performing graph analytics tools is massive un-partitioned memory. One tool, for example, uses a memory pool of up to 512TB (half a petabyte) to perform continuous data and link analysis in real-time even as data continues to pour in. This eliminates latency problems and memory scalability issues while customized chips speed performance.

Comparison Table: Hadoop-Graph Analytics

 

Hadoop

Graph Analytics

Operation mode

Batch

Real-time

Language

Map-reduce

SPARQL

Platform

Any commodity hardware

Specialized hardware

Queries

Must be partitioned

Allows non-partitioned

Query types

Seek specific data answers

Discover relationships, connections

Results

Tables of entities

Relationships between entities

Graph Analytics Use Cases
Graph analytics is a new player in the Big Data game (which, itself, is quite new). Still, the pioneers and early adopters are reporting promising results for graph analytics as an alternative for solving diverse types of problems. Several examples include:

  • Actionable intelligence: QinetiQ North America (QNA) delivers "actionable intelligence" to government customers interested in identifying threats through the detection of non-obvious patterns of relationships in big data. Graph analytics were the obvious approach, for which QinetiQ uses a purpose-built graph analytics appliance running graph-optimized hardware and a graph database. It interacts with the appliance through the industry standard interface RDF/SPARQL, as defined by the Worldwide Web Consortium (W3C).
  • Life sciences: Oak Ridge National Laboratory (ORNL) opted for a graph analytics appliance to conduct research in healthcare fraud and analytics for a leading healthcare payer. In addition to the healthcare fraud detection program, researchers and scientists at ORNL will also apply the capabilities of the graph-analytics appliance to other areas of research where data discovery is vital. These potential use cases include healthcare treatment efficacy and outcome analysis, analyzing drugs and side effects, and the analysis of proteins and gene pathways.
  • Higher education: The Pittsburgh Supercomputer Center (PSC) turned to agraph analytics appliance called Sherlock (no relation to IBM's Watson) to provide researchers with the ability to search extremely large and complex bodies of information using a straightforward command similar to ‘find something important.' Sherlock took advantage of specialized graph analytics hardware to run 128 threads per processor on dedicated hardware and speeded memory access across a terabyte of global shared memory. The appliance helped PSC win public recognition for extending graph analytics techniques to a wide range of scientific research projects.

The potential uses of graph analytics are just beginning to be explored. Already, the technology is being applied across a broad array of industries, including manufacturing, energy and gas exploration, earth sciences and meteorology, and government and defense.

Advantages Offered by Graph Analytics
A key advantage of graphs is the ease with which new sources of data and new relationships can be added. Graph databases using RDF to represent the graph can easily merge and unify diverse datasets without significant upfront investment in data modeling. Such an approach lies in stark contrast to ‘traditional' analytics, in which a great deal of time is spent organizing data, and the addition of new data sources requires time-consuming and error prone effort by analysts.

The easy on-boarding of new data is particularly important when dealing with Big Data. Traditional analytics focus on finding answers to known questions. By contrast, many of the highest value applications, such as those identified above, are focused on discovery, where the questions to be answered are not known in advance. The ability to quickly and easily add new data sources or new relationships within the data when needed to support a new line of questioning is crucial for discovery, and graphs are uniquely well qualified to support these requirements.

Graph analytics also offer sophisticated capabilities for analyzing relationships, while traditional analytics focus on summarizing, aggregating and reporting on data. Use the right tool for the job. Some common graph analytic techniques include:

  1. Centrality analysis: To identify the most central entities in your network, a very useful capability for influencer marketing.
  2. Path analysis: To identify all the connections between a pair of entities, useful in understanding risks and exposure.
  3. Community detection: To identify clusters or communities, which is of great importance to understanding issues in sociology and biology.
  4. Sub-graph isomorphism: To search for a pattern of relationships, useful for validating hypotheses and searching for abnormal situations, such as hacker attacks.

Complementary to Hadoop
Interestingly, Hadoop and graph analytics complement each other perfectly. Hadoop is a scale-out solution, allowing independent items of work to be parceled out to the computers in a cluster. Graph analytics, on the other hand, excel at looking at the "big picture," analyzing complex networks of relationships that cannot be partitioned.

For example, consider risk analysis within a financial solution. Many documents will need to be independently analyzed, and the relationships between organizations extracted. This is a perfect job for Hadoop since each document is independent of the others. On the other hand, the complex network of relationships between organizations form an un-partitionable graph, which is best analyzed as a single entity, in-memory.

Relationships and Connections
Analysts today have a tabular, "row-and-column" mindset when it comes to data and analytics - probably a byproduct of the spreadsheet's decades of success.

But don't you often think about problems and data in different ways?

Graph analytics explicitly model and reason about the relationships between different entities, and graph tools also display those relationships visually. The analyst can see all the relationships in which an entity participates, and intuitively assess which elements are close or important.

When it comes to customers, relationships, rather than tabular data, may be the most important element: they are more predictive of your likelihood of retaining or losing customers. The more connections customers have to your organization, its products, and its people, the more likely they will remain customers. Relationships, not tables, are also key to hacker and threat identification, risk and fraud analysis, influencer marketing and many other high value applications.

Graph analytics complement Hadoop and provide a level of immediate, deep insights that are not readily obtainable in any other way.

More Stories By Venkat Krishnamurthy

Venkat Krishnamurthy is the Product Management Director at YarcData, driving the direction and definition of YarcData products and solutions and working with customers to make them successful. Krishnamurthy has over a decade of experience in advanced analytics, including as a Director of Product Management at Oracle and as Vice President of Technology at Goldman Sachs. At Goldman, he conducted data analysis to assess risk controls across multiple trading desks/asset classes, algorithmic trading, market risk model validation, prime brokerage.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@BigDataExpo Stories
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
IoT is at the core or many Digital Transformation initiatives with the goal of re-inventing a company's business model. We all agree that collecting relevant IoT data will result in massive amounts of data needing to be stored. However, with the rapid development of IoT devices and ongoing business model transformation, we are not able to predict the volume and growth of IoT data. And with the lack of IoT history, traditional methods of IT and infrastructure planning based on the past do not app...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DXWorldEXPO LLC, the producer of the world's most influential technology conferences and trade shows has announced the 22nd International CloudEXPO | DXWorldEXPO "Early Bird Registration" is now open. Register for Full Conference "Gold Pass" ▸ Here (Expo Hall ▸ Here)
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smart...
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Here are the Top 20 Twitter Influencers of the month as determined by the Kcore algorithm, in a range of current topics of interest from #IoT to #DeepLearning. To run a real-time search of a given term in our website and see the current top influencers, click on the topic name. Among the top 20 IoT influencers, ThingsEXPO ranked #14 and CloudEXPO ranked #17.
Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
DXWorldEXPO LLC announced today that ICC-USA, a computer systems integrator and server manufacturing company focused on developing products and product appliances, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City. ICC is a computer systems integrator and server manufacturing company focused on developing products and product appliances to meet a wide range of ...
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
DXWorldEXPO LLC announced today that All in Mobile, a mobile app development company from Poland, will exhibit at the 22nd International CloudEXPO | DXWorldEXPO. All In Mobile is a mobile app development company from Poland. Since 2014, they maintain passion for developing mobile applications for enterprises and startups worldwide.
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
Deep learning has been very successful in social sciences and specially areas where there is a lot of data. Trading is another field that can be viewed as social science with a lot of data. With the advent of Deep Learning and Big Data technologies for efficient computation, we are finally able to use the same methods in investment management as we would in face recognition or in making chat-bots. In his session at 20th Cloud Expo, Gaurav Chakravorty, co-founder and Head of Strategy Development ...
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
Vulnerability management is vital for large companies that need to secure containers across thousands of hosts, but many struggle to understand how exposed they are when they discover a new high security vulnerability. In his session at 21st Cloud Expo, John Morello, CTO of Twistlock, addressed this pressing concern by introducing the concept of the “Vulnerability Risk Tree API,” which brings all the data together in a simple REST endpoint, allowing companies to easily grasp the severity of the ...
"We are a well-established player in the application life cycle management market and we also have a very strong version control product," stated Flint Brenton, CEO of CollabNet,, in this SYS-CON.tv interview at 18th Cloud Expo at the Javits Center in New York City, NY.
In his session at @ThingsExpo, Arvind Radhakrishnen discussed how IoT offers new business models in banking and financial services organizations with the capability to revolutionize products, payments, channels, business processes and asset management built on strong architectural foundation. The following topics were covered: How IoT stands to impact various business parameters including customer experience, cost and risk management within BFS organizations.
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...