Click here to close now.


@BigDataExpo Authors: Carmen Gonzalez, Yeshim Deniz, Liz McMillan, Sanjeev Sharma, William Schmarzo

Related Topics: @BigDataExpo, Java IoT, Open Source Cloud, Containers Expo Blog, Server Monitoring, @CloudExpo

@BigDataExpo: Blog Feed Post

Big Data – Storage Mediums and Data Structures

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

My working title was Big Data, Storage Dilemma.

They say dilemma. I say dilemna. I'm serious. I spell it dilemna.

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

Should different data structures be persisted to different storage mediums?

Storage Medium
Identifying the appropriate medium is a function of performance, cost, and capacity.

Random Access Memory
It's fast. Very. It's expense. Very.

If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.

We can configure an HP DL980 G7 server with up to 4TB of memory.

Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.

If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.

That, and SLC drives often have greater endurance.

Intel 520 (480GB) 550 520 50,000 50,000
Intel 520 (240GB) 550 520 50,000 80,000
Intel DC S3700 500 460 75,000 36,000

The performance of solid state drives can vary:

  • consumer / enterprise
  • MLC / eMLC / SLC
  • SATA / SAS / PCIe

The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.

Intel 910 (800GB) 2,000 1,000 180,000 75,000

Hard Disk Drive

It may not be fast, but it is inexpensive.

If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.

The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.

Data Structure
Hash Table

JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.


  • Random Reads, Random Writes

Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.


  • Complex Index in Memory
  • Random Reads, Sequential Writes

Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.

MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.


  • Partial Index (Internal Nodes) in Memory
  • Random Reads, Sequential Reads, Random Writes

CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.


  • Partial Index in Memory
  • Random Reads, Sequential Reads, Sequential Writes

In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.

CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized.  However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.

Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.

Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).

In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.

Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.

Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.

Recommendation / Access

  • Partial Index / Bloom Filter in Memory
  • Random Reads, Sequential Reads, Sequential Writes

A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.

I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.

I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views  (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.

I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:

Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)

Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

@BigDataExpo Stories
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on...
As more intelligent IoT applications shift into gear, they’re merging into the ever-increasing traffic flow of the Internet. It won’t be long before we experience bottlenecks, as IoT traffic peaks during rush hours. Organizations that are unprepared will find themselves by the side of the road unable to cross back into the fast lane. As billions of new devices begin to communicate and exchange data – will your infrastructure be scalable enough to handle this new interconnected world?
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in high-performance, high-efficiency server, storage technology and green computing, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology is a premier provider of advanced server Building Block Solutions® for Data ...
Nowadays, a large number of sensors and devices are connected to the network. Leading-edge IoT technologies integrate various types of sensor data to create a new value for several business decision scenarios. The transparent cloud is a model of a new IoT emergence service platform. Many service providers store and access various types of sensor data in order to create and find out new business values by integrating such data.
Too often with compelling new technologies market participants become overly enamored with that attractiveness of the technology and neglect underlying business drivers. This tendency, what some call the “newest shiny object syndrome,” is understandable given that virtually all of us are heavily engaged in technology. But it is also mistaken. Without concrete business cases driving its deployment, IoT, like many other technologies before it, will fade into obscurity.
There are so many tools and techniques for data analytics that even for a data scientist the choices, possible systems, and even the types of data can be daunting. In his session at @ThingsExpo, Chris Harrold, Global CTO for Big Data Solutions for EMC Corporation, will show how to perform a simple, but meaningful analysis of social sentiment data using freely available tools that take only minutes to download and install. Participants will get the download information, scripts, and complete en...
Achim Weiss is Chief Executive Officer and co-founder of ProfitBricks. In 1995, he broke off his studies to co-found the web hosting company "Schlund+Partner." The company "Schlund+Partner" later became the 1&1 web hosting product line. From 1995 to 2008, he was the technical director for several important projects: the largest web hosting platform in the world, the second largest DSL platform, a video on-demand delivery network, the largest eMail backend in Europe, and a universal billing syste...
Cloud computing delivers on-demand resources that provide businesses with flexibility and cost-savings. The challenge in moving workloads to the cloud has been the cost and complexity of ensuring the initial and ongoing security and regulatory (PCI, HIPAA, FFIEC) compliance across private and public clouds. Manual security compliance is slow, prone to human error, and represents over 50% of the cost of managing cloud applications. Determining how to automate cloud security compliance is critical...
Electric power utilities face relentless pressure on their financial performance, and reducing distribution grid losses is one of the last untapped opportunities to meet their business goals. Combining IoT-enabled sensors and cloud-based data analytics, utilities now are able to find, quantify and reduce losses faster – and with a smaller IT footprint. Solutions exist using Internet-enabled sensors deployed temporarily at strategic locations within the distribution grid to measure actual line lo...
The Internet of Everything is re-shaping technology trends–moving away from “request/response” architecture to an “always-on” Streaming Web where data is in constant motion and secure, reliable communication is an absolute necessity. As more and more THINGS go online, the challenges that developers will need to address will only increase exponentially. In his session at @ThingsExpo, Todd Greene, Founder & CEO of PubNub, will explore the current state of IoT connectivity and review key trends an...
There will be 20 billion IoT devices connected to the Internet soon. What if we could control these devices with our voice, mind, or gestures? What if we could teach these devices how to talk to each other? What if these devices could learn how to interact with us (and each other) to make our lives better? What if Jarvis was real? How can I gain these super powers? In his session at 17th Cloud Expo, Chris Matthieu, co-founder and CTO of Octoblu, will show you!
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo in Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading in...
SYS-CON Events announced today that Sandy Carter, IBM General Manager Cloud Ecosystem and Developers, and a Social Business Evangelist, will keynote at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA.
The Internet of Things (IoT) is growing rapidly by extending current technologies, products and networks. By 2020, Cisco estimates there will be 50 billion connected devices. Gartner has forecast revenues of over $300 billion, just to IoT suppliers. Now is the time to figure out how you’ll make money – not just create innovative products. With hundreds of new products and companies jumping into the IoT fray every month, there’s no shortage of innovation. Despite this, McKinsey/VisionMobile data...
The IoT market is on track to hit $7.1 trillion in 2020. The reality is that only a handful of companies are ready for this massive demand. There are a lot of barriers, paint points, traps, and hidden roadblocks. How can we deal with these issues and challenges? The paradigm has changed. Old-style ad-hoc trial-and-error ways will certainly lead you to the dead end. What is mandatory is an overarching and adaptive approach to effectively handle the rapid changes and exponential growth.
Today air travel is a minefield of delays, hassles and customer disappointment. Airlines struggle to revitalize the experience. GE and M2Mi will demonstrate practical examples of how IoT solutions are helping airlines bring back personalization, reduce trip time and improve reliability. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Dr. Sarah Cooper, M2Mi's VP Business Development and Engineering, will explore the IoT cloud-based platform technologies driv...
The IoT is upon us, but today’s databases, built on 30-year-old math, require multiple platforms to create a single solution. Data demands of the IoT require Big Data systems that can handle ingest, transactions and analytics concurrently adapting to varied situations as they occur, with speed at scale. In his session at @ThingsExpo, Chad Jones, chief strategy officer at Deep Information Sciences, will look differently at IoT data so enterprises can fully leverage their IoT potential. He’ll sha...
Redis is not only the fastest database, but it has become the most popular among the new wave of applications running in containers. Redis speeds up just about every data interaction between your users or operational systems. In his session at 17th Cloud Expo, Dave Nielsen, Developer Relations at Redis Labs, will share the functions and data structures used to solve everyday use cases that are driving Redis' popularity
The enterprise is being consumerized, and the consumer is being enterprised. Moore's Law does not matter anymore, the future belongs to business virtualization powered by invisible service architecture, powered by hyperscale and hyperconvergence, and facilitated by vertical streaming and horizontal scaling and consolidation. Both buyers and sellers want instant results, and from paperwork to paperless to mindless is the ultimate goal for any seamless transaction. The sweetest sweet spot in innov...
SYS-CON Events announced today that DataClear Inc. will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. The DataClear ‘BlackBox’ is the only solution that moves your PC, browsing and data out of the United States and away from prying (and spying) eyes. Its solution automatically builds you a clean, on-demand, virus free, new virtual cloud based PC outside of the United States, and wipes it clean...

Tweets by @BigDataExpo

@BigDataExpo Blogs
The potential of big data is only limited by the creative thinking of your business stakeholders, and that may be the most important concept in the “thinking like a data scientist” process. The “thinking like a data scientist” process guides the business stakeholders into envisioning how big data can optimize their key business processes, create a more compelling customer engagement and uncover new monetization opportunities. But neither the business stakeholders, nor the data scientists, can likely do that envisioning entirely by themselves.
Today’s modern day industrial revolution is being shaped by ubiquitous connectivity, machine to machine (M2M) communications, the Internet of Things (IoT), open APIs leading to a surge in new applications and services, partnerships and eventual marketplaces. IoT has the potential to transform industry and society much like advances in steam technology, transportation, mass production and communications ushered in the industrial revolution in the 18th and 19th centuries.
Disaster recovery (DR) has traditionally been a major challenge for IT departments. Even with the advent of server virtualization and other technologies that have simplified DR implementation and some aspects of on-going management, it is still a complex and (often extremely) costly undertaking. For those applications that do not require high availability, but are still mission- and business-critical, the decision as to which [applications] to spend money on for true disaster recovery can be a struggle.
SCOPE is an acronym for Structured Computations Optimized for Parallel Execution, a declarative language for working with large-scale data. It is still under development at Microsoft. If you know SQL then working with SCOPE will be quite easy as SCOPE builds on SQL. The execution environment is different from that RDBMS oriented data. Data is still modeled as rows. Every row has typed columns and eveyr rowset has a well-defined schema. There is a SCOPe compiler that comes up with optimized execution plan and a runtime execution plan.
If you’re running Big Data applications, you’re going to want to look at some kind of distributed processing system. Hadoop is one of the best-known clustering systems, but how are you going to process all your data in a reasonable time frame? MapReduce has become a standard, perhaps the standard, for distributed file systems. While it’s a great system already, it’s really geared toward batch use, with jobs needing to queue for later output. This can severely hamper your flexibility. What if you want to explore some of your data? If it’s going to take all night, forget about it.
Too many multinational corporations delete little, if any, data even though at its creation, more than 70 percent of this data is useless for business, regulatory or legal reasons.[1] The problem is hoarding, and what businesses need is their own “Hoarders” reality show about people whose lives are driven by their stuff[2] (corporations are legally people, after all). The goal of such an intervention (and this article)? Turning hoarders into collectors.
Today’s connected world is moving from devices towards things, what this means is that by using increasingly low cost sensors embedded in devices we can create many new use cases. These span across use cases in cities, vehicles, home, offices, factories, retail environments, worksites, health, logistics, and health. These use cases rely on ubiquitous connectivity and generate massive amounts of data at scale. These technologies enable new business opportunities, ways to optimize and automate, along with new ways to engage with users.
“Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management.” While this definition is broadly accepted and has, in fact, been my adopted standard for years, it only describes technical aspects of cloud computing. The amalgamation of technologies used to deliver cloud services is not even half the story. Above all else, the successful employment requires a tight linkage to the econ...
DevOps Summit at Cloud Expo 2014 Silicon Valley was a terrific event for us. The Qubell booth was crowded on all three days. We ran demos every 30 minutes with folks lining up to get a seat and usually standing around. It was great to meet and talk to over 500 people! My keynote was well received and so was Stan's joint presentation with RingCentral on Devops for BigData. I also participated in two Power Panels – ‘Women in Technology’ and ‘Why DevOps Is Even More Important than You Think,’ both featuring brilliant colleagues and moderators and it was a blast to be a part of.
Today air travel is a minefield of delays, hassles and customer disappointment. Airlines struggle to revitalize the experience. GE and M2Mi will demonstrate practical examples of how IoT solutions are helping airlines bring back personalization, reduce trip time and improve reliability. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Dr. Sarah Cooper, M2Mi's VP Business Development and Engineering, will explore the IoT cloud-based platform technologies driving this change including privacy controls, data transparency and integration of real time context w...
Developing software for the Internet of Things (IoT) comes with its own set of challenges. Security, privacy, and unified standards are a few key issues. In addition, each IoT product is comprised of at least three separate application components: the software embedded in the device, the backend big-data service, and the mobile application for the end user's controls. Each component is developed by a different team, using different technologies and practices, and deployed to a different stack/target - this makes the integration of these separate pipelines and the coordination of software upd...
All we need to do is have our teams self-organize, and behold! Emergent design and/or architecture springs up out of the nothingness! If only it were that easy, right? I follow in the footsteps of so many people who have long wondered at the meanings of such simple words, as though they were dogma from on high. Emerge? Self-organizing? Profound, to be sure. But what do we really make of this sentence?
It’s not hard to find technology trade press commentary on the subject of Big Data. Variously defined (in non-technical terms) as the cluttered old shoebox of all data – and again (in more technical terms) as that amount of data that does not comfortably fit into a standard relational database for storage, processing and analytics within the normal constraints of processing, memory and data transport technologies – we can say that Big Data is an oft mentioned and sometimes misunderstood subject.
Recently announced Azure Data Lake addresses the big data 3V challenges; volume, velocity and variety. It is one more storage feature in addition to blobs and SQL Azure database. Azure Data Lake (should have been Azure Data Ocean IMHO) is really omnipotent. Just look at the key capabilities of Azure Data Lake:
I was recently watching one of my favorite science fiction TV shows (I’ll confess, ‘Dr. Who’). In classic dystopian fashion, there was a scene in which a young boy is running for his life across some barren ground in a war-ravaged world. One of his compatriots calls out to him to freeze, not to move another inch. The compatriot warns the young boy that he’s in a field of hand mines (no, that is not a typo, he did say hand mines). Slowly, dull gray hands with eyes in the palm start emerging from the ground around the boy and the compatriot. Suddenly, one of the hands grabs the compatriot and pu...

About @BigDataExpo
Big Data focuses on how to use your own enterprise data – processed in the Cloud – most effectively to drive value for your business.