Welcome!

@BigDataExpo Authors: Elizabeth White, Carmen Gonzalez, Liz McMillan, Peter Silva, William Schmarzo

Related Topics: @BigDataExpo, Java IoT, Open Source Cloud, Containers Expo Blog, Server Monitoring, @CloudExpo

@BigDataExpo: Blog Feed Post

Big Data – Storage Mediums and Data Structures

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

My working title was Big Data, Storage Dilemma.

They say dilemma. I say dilemna. I'm serious. I spell it dilemna.

Big Data presents something of a storage dilemma. There is no one data store to rule them all.

Should different data structures be persisted to different storage mediums?

Storage Medium
Identifying the appropriate medium is a function of performance, cost, and capacity.

Random Access Memory
It's fast. Very. It's expense. Very.

If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.

We can configure an HP DL980 G7 server with up to 4TB of memory.

Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.

If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.

That, and SLC drives often have greater endurance.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 520 (480GB) 550 520 50,000 50,000
Intel 520 (240GB) 550 520 50,000 80,000
Intel DC S3700 500 460 75,000 36,000

The performance of solid state drives can vary:

  • consumer / enterprise
  • MLC / eMLC / SLC
  • SATA / SAS / PCIe

The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.


Sequential
Read
(MB/s)
Sequential
Write
(MB/s)
Random
Read
(IOPS)
Random
Write
(IOPS)
Intel 910 (800GB) 2,000 1,000 180,000 75,000

Hard Disk Drive

It may not be fast, but it is inexpensive.

If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.

The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.

Data Structure
Hash Table

JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.

Access

  • Random Reads, Random Writes

Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.

Access

  • Complex Index in Memory
  • Random Reads, Sequential Writes

Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.

B-Tree
MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.

Access

  • Partial Index (Internal Nodes) in Memory
  • Random Reads, Sequential Reads, Random Writes

CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.

Access

  • Partial Index in Memory
  • Random Reads, Sequential Reads, Sequential Writes

In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.

CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized.  However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.

Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.

Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).

In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.

Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.

Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.

Recommendation / Access

  • Partial Index / Bloom Filter in Memory
  • Random Reads, Sequential Reads, Sequential Writes

A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.

I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.

Conclusion
I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views  (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.

Notes
I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:

Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)

Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

@BigDataExpo Stories
What happens when the different parts of a vehicle become smarter than the vehicle itself? As we move toward the era of smart everything, hundreds of entities in a vehicle that communicate with each other, the vehicle and external systems create a need for identity orchestration so that all entities work as a conglomerate. Much like an orchestra without a conductor, without the ability to secure, control, and connect the link between a vehicle’s head unit, devices, and systems and to manage the ...
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and sh...
All clouds are not equal. To succeed in a DevOps context, organizations should plan to develop/deploy apps across a choice of on-premise and public clouds simultaneously depending on the business needs. This is where the concept of the Lean Cloud comes in - resting on the idea that you often need to relocate your app modules over their life cycles for both innovation and operational efficiency in the cloud. In his session at @DevOpsSummit at19th Cloud Expo, Valentin (Val) Bercovici, CTO of Soli...
"Once customers get a year into their IoT deployments, they start to realize that they may have been shortsighted in the ways they built out their deployment and the key thing I see a lot of people looking at is - how can I take equipment data, pull it back in an IoT solution and show it in a dashboard," stated Dave McCarthy, Director of Products at Bsquare Corporation, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2017 New York. The 20th Cloud Expo and 7th @ThingsExpo will take place on June 6-8, 2017, at the Javits Center in New York City, NY. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Internet to enable us all to im...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smar...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
SYS-CON Events announced today that Fusion, a leading provider of cloud services, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Fusion, a leading provider of integrated cloud solutions to small, medium and large businesses, is the industry’s single source for the cloud. Fusion’s advanced, proprietary cloud service platform enables the integration of leading edge solutions in the cloud, including cloud...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life sett...
SYS-CON Events announced today that Dataloop.IO, an innovator in cloud IT-monitoring whose products help organizations save time and money, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Dataloop.IO is an emerging software company on the cutting edge of major IT-infrastructure trends including cloud computing and microservices. The company, founded in the UK but now based in San Fran...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
"MathFreeOn.com is a line coding platform for engineers and scientists. When they want to solve an engineering problem and they have to use software - they have to pay a lot of money for licenses - but with MathFreeOn you don't have to pay a lot of money. Just go to our site and write the code and you can check the result right away," explained Simon Lee, CMO of MathFreeOn, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Cla...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.