Welcome!

@BigDataExpo Authors: Liz McMillan, Kevin Benedict, Elizabeth White, Pat Romanski, Dana Gardner

Related Topics: @BigDataExpo, Java IoT, Linux Containers, Containers Expo Blog, @CloudExpo, Apache

@BigDataExpo: Blog Post

The Hadoop Business Case By @JGlesner | @BigDataExpo [#BigData]

Big Data is 'data that exceeds the processing capacity of conventional database systems'

By

Since the birth of Hadoop in 2005-06, the way we think about storing and processing information has evolved considerably. The term “Big Data” has become synonymous with this evolution. But still, many of our customers continue to ask, “What is Big Data?”, “What are its use cases?”, and “What is its business value?”. The Internet is overloaded with definitions, characteristics, and benefits; however, few discussions synthesize all three of these topics in one place. This paper answers these questions, and proposes a total cost calculation framework for CTOs and CIOs that are evaluating solutions for their organization’s use case(s). In the text below, I examine an on-premise Hadoop ecosystem as a general purpose Big Data solution in relation to alternative commercial purpose-built storage technologies (-e.g. Oracle, Teradata, IBM, SAP, Microsoft, EMC, etc). It may be difficult to determine the exact point at which you should leverage one over the other. It is my contention that when the total cost of using all your data exceeds what you are able to spend using purpose-built technologies, it is time to consider using a general purpose solution like Hadoop for process offloading.

What is Big Data?
According to Edd Dumbill, a well respected thought leader and VP of Strategy for Silicon Valley Data Science, a big data and data science consulting company, Big Data is “data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or does not fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” This definition was published in an article entitled “What is Big Data?” in Big Data Now: 2012 Edition by O’Reilly Media, and touches on the three primary characteristics of data [1]:

  • Volume: The size of your data. Data has mass, and there is a cost to moving it around the network.
  • Velocity: The speed with which the data either arrives or is created, and how quickly it needs to be consumed in order to make use of it.
  • Variety: The differences in structure between all the types of data within an enterprise.

Scour the internet and you will find that there are other, less commonly discussed but relevant characteristics associated with Big Data to include Veracity and Volatility [2]. Veracity refers to the truthfulness of the data or your degree of trust in what it is conveying. Volatility is how often your existing data changes or is updated by the new data you are receiving/creating. There are certainly other characteristics as well. I recently began using the term Viscosity to describe the degree of data fragmentation in client environments, and the level of effort required to reassemble it into a coherent view. In this context, organizations with low viscosity have significant fragmentation, and duplication throughout their enterprise.

The term “Big Data” has come to focus on these characteristics and imply that traditional database architectures, such as on line transaction processing (OLTP) and on line analytic processing (OLAP) purpose-built technologies simply will not scale to meet your data capacity needs. However, massively parallel processing (MPP) database architectures are one example of purpose-built technologies that have been developed to support both OLTP and OLAP data structures at enormous scales up into the petabytes (PB). For example, there is a 50 PB MPP cluster at eBay [3]. Certainly this size conforms to any logical definition of Big Data.

Depending on your use case, it is possible that a purpose-built technology may suite your needs at scale. While MPP systems remain most effective with structured, tabular and transactional data sets, it is possible to store most everything except massive files in relational structures. However, this may not be the best fit for your use case(s) in terms of appropriateness or cost. There is limited published pricing data for commercial offerings, but MPP systems are notably expensive. When including the cost of software, hardware, and licensing/support, the cost per terabyte (TB) of an MPP system is estimated at tens of thousands of dollars [4]. At these prices, a one PB system can cost tens of millions of dollars. In contrast, the equivalent cost of Hadoop is roughly $2,000 per TB, leaving a one PB Hadoop cluster to cost roughly $2 million. That is a significant initial cost savings; however, use case(s) will always drive the total cost of any solution.

Coincidentally, eBay has also released information on their production 50 PB Hadoop cluster, one of the largest such clusters in the world [5]. The fact that eBay uses both types of systems demonstrates that there is a place for each, and that the difference may come down to price and purpose. Given the relative lower cost of Hadoop, I submit that it is easier to identify Big Data if we add cost to our definition. Therefore, Big Data is the result when (a) the sum of all your data’s characteristics coupled with (b) the resources required to achieve your use case exceeds (c) the cost you are willing/able to spend using traditional approaches. When that inflection point is reached, it is clearly time to consider other, non-traditional approaches for process offloading. Each unique situation warrants a cost/benefit analysis to determine if a general-purpose solution like Hadoop is right for your use case.

What are the Use Cases for Big Data?
Process offloading refers to the act of moving workloads from one implementation to another to achieve better suitability, performance, availability, etc., at a lower price point. Both traditional and non-traditional solutions have advantages and disadvantages given a particular workload, and they should be leveraged accordingly to maximize cost efficiencies. Let us examine Hadoop’s use cases for process offloading.

Hadoop is comprised of two major components: the Hadoop Distributed File System (HDFS) and MapReduce, a framework for writing applications to process large amounts of content over multiple nodes (servers). Hadoop is often referred to as a schemaless system because data is not forced into a schema upon ingest. Ultimately, there is a structure known as the key/value pair in which data is expressed as a collection of [key]->[value] tuples or records. This is the most fundamental data structure in computer science. Hadoop uses the key/value pair because nearly any data can be expressed, stored, processed and retrieved using this minimal structure. Because key/value is so rudimentary, a schema can be applied at query time based on the question being asked. This adds tremendous flexibility and differs significantly from traditional approaches like OLTP and OLAP, which require you to know/define the data model up front, and have an understanding of the questions you intend to ask. Figure 1 illustrates these different process flow models. Having to know what questions you intend to ask, and constructing a pre-defined schema will add artificial constraints to the answers you are able to get from the data.

Another issue with schema-based systems is scalability. Traditional relational architectures scale vertically with ease, but are difficult to design for horizontal scaling due to their rigid data structures (tables, table relationships, rows, columns, indices) which must be sharded or split across multiple nodes. The integrity of these structures must be maintained while offering near-real-time (on line) create, read, update and delete (CRUD) operations on data. This is not trivial, and it requires commercial companies to make significant financial investments to do it well, which drive up the cost of those solutions. As a schemaless system, the latest release of Hadoop (2.x) scales horizontally to 10,000+ nodes without the added complexity inherent to traditional MPP systems [6].

Many organizations have purpose-built solutions for asking business intelligence questions, providing disaster recovery/backup, etc., but scaling these solutions beyond an initial, narrowly defined usage for structured data usually involves significant cost increases. As a schemaless computational file system, Hadoop can be applied to an almost endless set of challenges at a lower cost. Below we walk through six higher-order use cases to illustrate how these savings can be realized:

1. Raw Storage/Data Lake: Backing up all the data your enterprise collects and creates daily, to include its historical holdings, for continuity of operations (COOP) and disaster recovery (DR) has previously been too expensive, and therefore unfeasible. Instead, businesses make difficult tradeoffs as to what will and will not be recoverable should disaster strike. Imagine the possibilities if you were able to economically store everything in your enterprise for the price of traditional commodity hard disks. Fortunately, Hadoop makes this dream a reality with its internally redundant data structure that by default makes three copies of all data written to HDFS. This scalable, schemaless raw storage lends itself conceptually to what is now being called a “data lake”. A data lake is based on the notion that data can be tagged with metadata about its source, contents, structure and other characteristics. These properties stay with the data as it is minimized into key-value pairs and written to the Hadoop file system. To process the data, all one needs to know is what data they wish to process leveraging these properties. This allows many different types of data to exist side-by-side within the simple structures of the Data Lake. The amount of pre-processing is minimal, as data is no longer fit into specific schemas up-front, making the data accessible to a wider variety of purposes. This would not be cost effective using traditional commercial systems.

2. Multi-Format Data Analysis: There are many different types of data beyond structured and unstructured text, to include audio, video, and images. Analyzing structured and unstructured text at scale can be an expensive and difficult challenge, but analyzing large collections of digital media is not even possible using traditional relational systems. Many businesses have previously been unable to unlock the potential of their data holdings due to an inability to process digital content, such as the ability to analyze and track objects in video, or to identify and extract biomarkers in healthcare images. HDFS accepts all these formats for analysis without the need for a schema. Hadoop’s ability to work with unstructured text and binary data (audio, video, imagery) extends well beyond the native capabilities offered by existing storage solutions, providing an enormous capability advantage.

3. Data Cleansing/Transformation Businesses often contend with multiple relational data models, unstructured text and streaming data. You likely need to correlate, cleanse, de-duplicate, synchronize and normalize/de-normalize these data sets as they move between databases and tools to create a complete, clean operating picture for downstream analysis. The vast majority of work in conducting analytics is often preparing the data for use. In addition, new initiatives to leverage autonomous self-reporting devices and sensors provide continuous streams of data, creating explosions in the amount of information if used in their raw form. Purpose-built technologies present challenges when attempting these types of tasks due to their reliance on schemas. General purpose solutions, like the Hadoop ecosystem, deliver an economical way of storing, pre-processing and/or summarizing these data sets and streams, thereby minimizing the unchecked growth in commercial licensing investments within your enterprise.

4. Data Exploration: When new questions arise, the relevant variables and their relationships must be identified from within your data before you can begin to calculate definitive answers. However, these elements are not always understood, nor are the best algorithms for analyzing the data. Exploration is often required in order to build a model that will answer the questions being asked. Traditional relational architectures with pre-defined schemas are not likely to provide a platform for discovery. In these cases, identifying key variables and useful analytic methods is a trial/error process. Hadoop provides a flexible, schemaless environment that reduces the friction associated with the iterative process of exploring and analyzing data when the model is unclear. Hadoop provides a sandbox for exploring data without having to increase commercial capacity or spend the time building new schemas.

5. Data Science & Personalization: Data science leverages tools and techniques from many different areas of study, to include statistics, machine learning, mathematics, probability/uncertainty modeling, etc., to surface meaning from data, and generate data-driven products. This is essentially the art of making data actionable, either by a user or a machine. Data science is not exclusive to Big Data, but there is tremendous knowledge potential in large data sets. One use of data science is for personalization, the act of exhaustively analyzing large quantities of related data, such as the online behaviors of millions of Internet users to in order to calculate recommendations for a specific individual. The results are then presented in the form of “you might also like” books, movies, and other targeted advertisements. These techniques are also being applied to healthcare where symptoms, genetics, treatments, and outcomes are being analyzed to optimize treatment for specific individuals to optimize treatments. Hadoop is a perfect platform for data collecting, synthesizing, munging, cleaning and joining disparate data sets for analysis to achieve decision-relevant insight.

6. Data Anonymization: Certain industries, perhaps healthcare more than any other, require anonymized data for research. Rules governing the release of such data to the public generally require the information contain no personally identifiable information (PII). In the case of healthcare specifically, the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule, released in 2003, governs the use and disclosure of protected health information (PHI). The Privacy Rule does not give a specific algorithm for achieving the level of de-identification required, and there are many ways to approach anonymization in general, which depend greatly on how the data will be used. If a portion of your business model relies on providing anonymized data to internal or external groups for analysis, you want that process to be as clear, efficient, and repeatable as possible. Hadoop provides a solution for codifying and institutionalizing these algorithms for your enterprise. This increases the speed and effectiveness of all groups depending on anonymized data, providing them with an approved, documented process, and an authoritative source from which to receive data.

This list is not meant to be exhaustive, and there are definitely other use cases. Each use case is applicable to a wide variety of domains, to include finance, cyber, healthcare, defense, and scientific research.

What is the Business Value of Big Data?
Our new definition of Big Data (when the cost of using all your data for your use case exceeds what you are able to spend using purpose-built technologies) lends itself to a cost/benefit analysis. Figure 2 establishes a rubric through which to express the decision calculus of Big Data for process offloading. This framework illustrates the components of cost, discussed below, that every CIO and CTO should take into account when evaluating solutions for their use cases.

Projects should always start with gathering and analyzing requirements. In an analytic context, these are the questions you want to ask of your data. Or more generally, how you intend to use the data you would like to store in Hadoop. These requirements have obvious implications for leveraging the relevant data assets.

The Data / Characteristics (AS-IS) corner of the triangle refers to all data related to your requirements, and all the attributes discussed earlier, to include the amount, how quickly it grows/changes, differences in type/structure, where it resides on the network, etc.

Once the associated data has been identified, a solution is identified and designed. In the case of data analysis, the solution often involves models and techniques to change and analyze the data to find answers. Overall, this step includes any processes, human or machine, that are necessary to get the results you are looking for.

The Purpose / Answers (TO-BE) corner is your end-state vision, which is sometimes expressed in terms of success criteria and/or key performance indicators. In the case of data science, this corner represents the answers you want from your data, in addition to how users should expect to access those answers, and how frequently the answers need to be updated (real-time, hourly, daily, monthly, etc).

Lastly, there are often numerous ways for this solution to be physically implemented. Each possible implementation requires specific people, intellect (expertise, experience), technology (licenses, support), time, and physical capital (power, space, cooling) to assemble and extend (write algorithms, or build solutions on top of) the desired end-state. There are many factors here to consider. For example, certain software licenses will charge by the number of users, which may limit your derived business value (in terms of productivity) if that cost prevents your entire team from leveraging the software. As well, the more data you have, the more physical or virtual compute resources you may need.

Together, these elements influence the total cost of the solution. Ultimately, cost is the tipping point that can cause you to change the scope of your requirements and timeline, the data you use, the models/techniques you employ, the answers you are able to achieve and the algorithms/technologies you implement. Often, it is necessary to find an affordable balance to achieve the organization’s goals and objectives. However, these trade-offs may cause you to compromise certain business objectives, and reduce the business value derived from the solution.

The business value of Hadoop is the result of overcoming the functional limitations established by the cost of scaling purpose-built technologies, and having to make fewer compromises to achieve your data-driven business objectives. This relationship between cost and business value is illustrated in Figure 2. By managing (containing or reducing) cost, it becomes possible to maintain or broaden your scope and implement the solution that is right for you. Hadoop may allow you to get more from your data, with a significantly lower cost investment, resulting in tangible economic value. If Hadoop is able to satisfy your use case, then it is likely you will benefit from cost containment (and possibly savings) by preventing or reducing the expansion of more expensive purpose-built technologies.

Conclusion
It is important to choose the right technology for your particular use case. Hadoop continues to mature as a widely supported open source solution nearing its ten year anniversary. It is also supported by several commercial vendors offering on-site support. Depending on your particular use case(s), Hadoop may or may not be the best solution. Some Big Data is consistent, known, structured, and aligns well to the use cases best served by purpose-built technologies. However, when you do not have that, or cost constraints limit your business value, it is time to consider using a general purpose solution like the Hadoop ecosystem for process offloading. The formula presented in this paper provides a lens for CIOs/CTOs to examine potential solutions, business objectives, and cost constraints. Hadoop’s low cost and broad applicability are definitely worth exploring. I recommend you conduct your own cost/benefit analysis to determine if Hadoop is right for you and your use case(s). You may find that relative to commercial products, Hadoop will allow you to achieve greater business value and substantial cost savings.


[1] O’Reilly Media, Inc., “What is Big Data?”. Big Data Now: 2012 Edition. Sebastopol, CA. October 2012. Found Online at: http://www.oreilly.com/data/free/big-data-now-2012.csp

[2] Normandeau, Kevin. “Beyond Volume, Variety and Velocity is the Issue of Big Data Veracity”. Inside BigData. September 2013. http://inside-bigdata.com/2013/09/12/beyond-volume-variety-velocity-issu...

[3] Harris, Derrick. “Teradata pluges 17% on Q3 warning: Is it economics or Hadoop?” Gigaom. October 2013. https://gigaom.com/2013/10/15/teradata-plunges-17-on-q3-warning-is-it-ec...

[4] Barth, Paul; Bean, Randy. “Get the Maximum Value Out of Your Big Data Initiative”. Harvard Business Review Blog Network. February 2013. http://blogs.hbr.org/2013/02/get-the-maximum-value-out-of-y/

[5] Ma, Ming. “Hadoop @ eBay Marketplaces”. Slideshare. June 2013. http://www.slideshare.net/Hadoop_Summit/ma-june27-140pmroom212v2

[6] Murthy, Arun. “Apache Hadoop YARN – Concepts and Applications”. Hortonworks. August 2012. http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/


About the author:

Jeremy Glesner is the Chief Technology Officer of Berico Technologies. Jeremy’s background is in information science and software engineering. Find him on Twitter at @jglesner (https://twitter.com/jglesner) and on Linkedin (http://www.linkedin.com/in/jeremyglesner/).

 

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley writes on enterprise IT. He is a founder and partner at Cognitio Corp and publsher of CTOvision.com

@BigDataExpo Stories
The increasing popularity of the Internet of Things necessitates that our physical and cognitive relationship with wearable technology will change rapidly in the near future. This advent means logging has become a thing of the past. Before, it was on us to track our own data, but now that data is automatically available. What does this mean for mHealth and the "connected" body? In her session at @ThingsExpo, Lisa Calkins, CEO and co-founder of Amadeus Consulting, will discuss the impact of wea...
Many private cloud projects were built to deliver self-service access to development and test resources. While those clouds delivered faster access to resources, they lacked visibility, control and security needed for production deployments. In their session at 18th Cloud Expo, Steve Anderson, Product Manager at BMC Software, and Rick Lefort, Principal Technical Marketing Consultant at BMC Software, will discuss how a cloud designed for production operations not only helps accelerate developer...
SYS-CON Events announced today that Enzu, a leading provider of cloud hosting solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to foc...
SYS-CON Events announced today that Ericsson has been named “Gold Sponsor” of SYS-CON's @ThingsExpo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. Ericsson is a world leader in the rapidly changing environment of communications technology – providing equipment, software and services to enable transformation through mobility. Some 40 percent of global mobile traffic runs through networks we have supplied. More than 1 billion subscribers around the world re...
SYS-CON Events announced today that Peak 10, Inc., a national IT infrastructure and cloud services provider, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Peak 10 provides reliable, tailored data center and network services, cloud and managed services. Its solutions are designed to scale and adapt to customers’ changing business needs, enabling them to lower costs, improve performance and focus inter...
In his session at @ThingsExpo, Chris Klein, CEO and Co-founder of Rachio, will discuss next generation communities that are using IoT to create more sustainable, intelligent communities. One example is Sterling Ranch, a 10,000 home development that – with the help of Siemens – will integrate IoT technology into the community to provide residents with energy and water savings as well as intelligent security. Everything from stop lights to sprinkler systems to building infrastructures will run ef...
The IoTs will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm and share the must-have mindsets for removing complexity from the development proc...
A critical component of any IoT project is the back-end systems that capture data from remote IoT devices and structure it in a way to answer useful questions. Traditional data warehouse and analytical systems are mature technologies that can be used to handle large data sets, but they are not well suited to many IoT-scale products and the need for real-time insights. At Fuze, we have developed a backend platform as part of our mobility-oriented cloud service that uses Big Data-based approache...
Artificial Intelligence has the potential to massively disrupt IoT. In his session at 18th Cloud Expo, AJ Abdallat, CEO of Beyond AI, will discuss what the five main drivers are in Artificial Intelligence that could shape the future of the Internet of Things. AJ Abdallat is CEO of Beyond AI. He has over 20 years of management experience in the fields of artificial intelligence, sensors, instruments, devices and software for telecommunications, life sciences, environmental monitoring, process...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
There is an ever-growing explosion of new devices that are connected to the Internet using “cloud” solutions. This rapid growth is creating a massive new demand for efficient access to data. And it’s not just about connecting to that data anymore. This new demand is bringing new issues and challenges and it is important for companies to scale for the coming growth. And with that scaling comes the need for greater security, gathering and data analysis, storage, connectivity and, of course, the...
We’ve worked with dozens of early adopters across numerous industries and will debunk common misperceptions, which starts with understanding that many of the connected products we’ll use over the next 5 years are already products, they’re just not yet connected. With an IoT product, time-in-market provides much more essential feedback than ever before. Innovation comes from what you do with the data that the connected product provides in order to enhance the customer experience and optimize busi...
Increasing IoT connectivity is forcing enterprises to find elegant solutions to organize and visualize all incoming data from these connected devices with re-configurable dashboard widgets to effectively allow rapid decision-making for everything from immediate actions in tactical situations to strategic analysis and reporting. In his session at 18th Cloud Expo, Shikhir Singh, Senior Developer Relations Manager at Sencha, will discuss how to create HTML5 dashboards that interact with IoT devic...
So, you bought into the current machine learning craze and went on to collect millions/billions of records from this promising new data source. Now, what do you do with them? Too often, the abundance of data quickly turns into an abundance of problems. How do you extract that "magic essence" from your data without falling into the common pitfalls? In her session at @ThingsExpo, Natalia Ponomareva, Software Engineer at Google, will provide tips on how to be successful in large scale machine lear...
The IETF draft standard for M2M certificates is a security solution specifically designed for the demanding needs of IoT/M2M applications. In his session at @ThingsExpo, Brian Romansky, VP of Strategic Technology at TrustPoint Innovation, will explain how M2M certificates can efficiently enable confidentiality, integrity, and authenticity on highly constrained devices.
You think you know what’s in your data. But do you? Most organizations are now aware of the business intelligence represented by their data. Data science stands to take this to a level you never thought of – literally. The techniques of data science, when used with the capabilities of Big Data technologies, can make connections you had not yet imagined, helping you discover new insights and ask new questions of your data. In his session at @ThingsExpo, Sarbjit Sarkaria, data science team lead ...
In his session at 18th Cloud Expo, Sagi Brody, Chief Technology Officer at Webair Internet Development Inc., will focus on real world deployments of DDoS mitigation strategies in every layer of the network. He will give an overview of methods to prevent these attacks and best practices on how to provide protection in complex cloud platforms. He will also outline what we have found in our experience managing and running thousands of Linux and Unix managed service platforms and what specifically c...
Manufacturers are embracing the Industrial Internet the same way consumers are leveraging Fitbits – to improve overall health and wellness. Both can provide consistent measurement, visibility, and suggest performance improvements customized to help reach goals. Fitbit users can view real-time data and make adjustments to increase their activity. In his session at @ThingsExpo, Mark Bernardo Professional Services Leader, Americas, at GE Digital, will discuss how leveraging the Industrial Interne...
Up until last year, enterprises that were looking into cloud services usually undertook a long-term pilot with one of the large cloud providers, running test and dev workloads in the cloud. With cloud’s transition to mainstream adoption in 2015, and with enterprises migrating more and more workloads into the cloud and in between public and private environments, the single-provider approach must be revisited. In his session at 18th Cloud Expo, Yoav Mor, multi-cloud solution evangelist at Cloudy...
You deployed your app with the Bluemix PaaS and it's gaining some serious traction, so it's time to make some tweaks. Did you design your application in a way that it can scale in the cloud? Were you even thinking about the cloud when you built the app? If not, chances are your app is going to break. Check out this webcast to learn various techniques for designing applications that will scale successfully in Bluemix, for the confidence you need to take your apps to the next level and beyond.