|By Daniel Thompson||
|February 15, 2013 10:00 AM EST||
My working title was Big Data, Storage Dilemma.
They say dilemma. I say dilemna. I'm serious. I spell it dilemna.
Big Data presents something of a storage dilemma. There is no one data store to rule them all.
Should different data structures be persisted to different storage mediums?
Identifying the appropriate medium is a function of performance, cost, and capacity.
Random Access Memory
It's fast. Very. It's expense. Very.
If we are configuring an HP DL980 G7 server, it will cost $3,672 for 128GB of memory with 8x 16GB modules or $9,996 for 128GB of memory with 4x 32GB modules. That is $28-78 / GB.
We can configure an HP DL980 G7 server with up to 4TB of memory.
Solid State Drive
It's not as fast or as expensive as random access memory. It's faster than a hard disk drive. A lot faster. It's more expensive than a hard disk drive. A lot more expensive.
If we are configuring an HP DL980 G7 server, it will cost $4,199 for a 400GB SAS MLC drive or $7,419 for a 400GB SAS SLC drive. That is $10-18 / GB. It's a performance / size trade-off. While SLC drives often perform better, MLC drives are often available in larger sizes. Further, the read / write performance is either symmetric or asymmetric.
That, and SLC drives often have greater endurance.
|Intel 520 (480GB)||550||520||50,000||50,000|
|Intel 520 (240GB)||550||520||50,000||80,000|
|Intel DC S3700||500||460||75,000||36,000|
The performance of solid state drives can vary:
- consumer / enterprise
- MLC / eMLC / SLC
- SATA / SAS / PCIe
The capacity of enterprise solid state drives is often less than that of enterprise hard disk drives (4TB). However, the capacity of PCIe drives such as the Fusion-io ioDrive Octal (5.12TB / 10.24TB) is greater than that of enterprise hard disk drives (4TB). In addition, the performance of PCIe drives is greater than that of SAS or SATA drives.
|Intel 910 (800GB)||2,000||1,000||180,000||75,000|
Hard Disk Drive
It may not be fast, but it is inexpensive.
If we are configuring an HP DL980 G7 server, it will cost $309 for a 300GB 10K drive or $649 for a 300GB 15K drive. That is $1-2 / GB. It's a performance / size trade-off. While a 15K drive will perform better than a 10K drive, it will often have less capacity. The sequential read / write performance of hard disk drives is often between 150-200MB/s. As such, a RAID configuration may be a cost effective alternative to a single solid state drive for sequential read / write access.
The capacity of enterprise hard disk drives (4TB) is often greater than that of enterprise solid state drives.
JBoss Data Grid is an in-memory data grid with the data stored in a hash table. However, JBoss Data Grid supports persistence via write-behind or write-through. For all intents and purposes (e.g. map / reduce), all of the data should fit in memory.
- Random Reads, Random Writes
Riak is a key / value store. If persistence has been configured with Bitcast, the data is stored in a log structured hash table. An in-memory hash table contains the key / value pointers. The value is a pointer to the data. As such, all of the key / value pointers must fit in memory. The data is persisted via append only log files.
- Complex Index in Memory
- Random Reads, Sequential Writes
Point queries perform well in hash tables. Range queries do not. Though there is always map / reduce.
MongoDB is a document database with the data stored in a B-Tree. The data is persisted via memory mapped files.
- Partial Index (Internal Nodes) in Memory
- Random Reads, Sequential Reads, Random Writes
CouchDB is a document database with the data stored in a B+Tree. The data is persisted via append only log files.
- Partial Index in Memory
- Random Reads, Sequential Reads, Sequential Writes
In a B+ Tree, the data is only stored in the leaf nodes. In a B-Tree, the data is stored in both the internal nodes and the leaf nodes. The advantage of a B+ Tree is that the leaf nodes are linked. As a result, range queries perform better with a B+ Tree. However, point queries perform better with a B-Tree.
CouchDB has implemented an append only B+ Tree. An alternative is the copy-on-write (CoW) B+ Tree. A write optimization for a B-Tree is buffering. First, the data written to an internal buffer in an internal node. Second, the buffer is flushed to a leaf node. As a result, random writes are turned in to sequential writes. The cost of random writes is thus amortized. However, I am not aware of any open source data stores that have implemented a CoW B+ Tree or have implemented buffering with a B-Tree.
Range queries perform well with a B/B+ Tree. Point queries, not as well as hash tables.
Log Structured Merge Tree
Apache HBase and Apache Cassandra are both column oriented data stores, and they have both store data in an LSM-Tree. Apache HBase has implemented a cache oblivious lookahead array (COLA). Apache Cassandra has implemented a sorted array merge tree (SAMT).
In both implementations, a write is first written to a write-ahead log. Next, it is written to a memtable. A memtable is an in-memory sorted string table (SSTable). Later, the memtable is flushed and persisted as an SSTable. As result, random writes are turned in to sequential writes. However, a point query with an LSM-Tree may require multiple random reads.
Apache HBase implemented a single-level index with HFile version 1. However, because every index was cached, it resulted in high memory usage.
Apache HBase implemented a multi-level index a la a B+ Tree with HFile (SSTable) version 2. The SSTable contains a root index, a root index with leaf indexes, or a root index with an intermediate index and leaf indexes. The root index is stored in memory. The intermediate and leaf indexes may be stored in the block cache. In addition, the SSTable contains a compound (block level) bloom filter. The bloom filter is used to determine if the data is not in the data block.
Recommendation / Access
- Partial Index / Bloom Filter in Memory
- Random Reads, Sequential Reads, Sequential Writes
A read optimization for an LSM-Tree is fractional cascading. The idea is that each level contains both data and pointers to data in the next level. However, I am not aware of any open source data stores that have implemented fractional cascading.
I like the idea of a hybrid / tiered storage solution for an LSM-Tree implementation with the first level in memory, second level on solid state drives, and third level on hard disk drives. I've seen this solution described in academia, but I am not aware of any open source implementations.
I find it interesting that key / value stores often store data in a hash table, that document stores often store data in a B+/B-Tree, and that column oriented stores often store data in an LSM-Tree. Perhaps it is because key / value stores focus on the performance of point queries, document stores support secondary indexes (MongoDB) / views (CouchDB), and column oriented stores support range queries. Perhaps it because document stores were not created with distribution (shards / partitions) in mind and thus assume that the complete index can not be stored in memory.
I found this paper to be very helpful in examining data structures as it covers most of the ones highlighted in this post:
Efficient, Scalable, and Versatile Application and System Transaction Management for Direct Storage Layers (link)
The initial debate is over: Any enterprise with a serious commitment to IT is migrating to the cloud. But things are not so simple. There is a complex mix of on-premises, colocated, and public-cloud deployments. In this power panel at 18th Cloud Expo, moderated by Conference Chair Roger Strukhoff, Randy De Meno, Chief Technologist - Windows Products and Microsoft Partnerships at Commvault; Dave Landa, Chief Operating Officer at kintone; William Morrish, General Manager Product Sales at Interou...
Jul. 1, 2016 02:00 PM EDT Reads: 1,121
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
Jul. 1, 2016 01:15 PM EDT Reads: 269
University of Colorado Athletics has selected FORTRUST, Colorado’s only Tier III Gold certified data center, as their official data center and colocation services provider, FORTRUST announced today. A nationally recognized and prominent collegiate athletics program, CU provides a high quality and comprehensive student-athlete experience. The program sponsors 17 varsity teams and in their history, the Colorado Buffaloes have collected an impressive 28 national championships. Maintaining uptime...
Jul. 1, 2016 01:00 PM EDT Reads: 874
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm ...
Jul. 1, 2016 01:00 PM EDT Reads: 668
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Bradley Holt, Developer Advocate a...
Jul. 1, 2016 01:00 PM EDT Reads: 1,025
Apixio Inc. has raised $19.3 million in Series D venture capital funding led by SSM Partners with participation from First Analysis, Bain Capital Ventures and Apixio’s largest angel investor. Apixio will dedicate the proceeds toward advancing and scaling products powered by its cognitive computing platform, further enabling insights for optimal patient care. The Series D funding comes as Apixio experiences strong momentum and increasing demand for its HCC Profiler solution, which mines unstruc...
Jul. 1, 2016 12:30 PM EDT Reads: 668
SYS-CON Events has announced today that Roger Strukhoff has been named conference chair of Cloud Expo and @ThingsExpo 2016 Silicon Valley. The 19th Cloud Expo and 6th @ThingsExpo will take place on November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. "The Internet of Things brings trillions of dollars of opportunity to developers and enterprise IT, no matter how you measure it," stated Roger Strukhoff. "More importantly, it leverages the power of devices and the Interne...
Jul. 1, 2016 12:00 PM EDT Reads: 636
When it comes to cloud computing, the ability to turn massive amounts of compute cores on and off on demand sounds attractive to IT staff, who need to manage peaks and valleys in user activity. With cloud bursting, the majority of the data can stay on premises while tapping into compute from public cloud providers, reducing risk and minimizing need to move large files. In his session at 18th Cloud Expo, Scott Jeschonek, Director of Product Management at Avere Systems, discussed the IT and busin...
Jul. 1, 2016 12:00 PM EDT Reads: 609
Let’s face it, embracing new storage technologies, capabilities and upgrading to new hardware often adds complexity and increases costs. In his session at 18th Cloud Expo, Seth Oxenhorn, Vice President of Business Development & Alliances at FalconStor, discussed how a truly heterogeneous software-defined storage approach can add value to legacy platforms and heterogeneous environments. The result reduces complexity, significantly lowers cost, and provides IT organizations with improved efficie...
Jul. 1, 2016 11:51 AM EDT Reads: 328
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Jul. 1, 2016 11:30 AM EDT Reads: 583
Most organizations prioritize data security only after their data has already been compromised. Proactive prevention is important, but how can you accomplish that on a small budget? Learn how the cloud, combined with a defense and in-depth approach, creates efficiencies by transferring and assigning risk. Security requires a multi-defense approach, and an in-house team may only be able to cherry pick from the essential components. In his session at 19th Cloud Expo, Vlad Friedman, CEO/Founder o...
Jul. 1, 2016 11:12 AM EDT Reads: 340
In addition to all the benefits, IoT is also bringing new kind of customer experience challenges - cars that unlock themselves, thermostats turning houses into saunas and baby video monitors broadcasting over the internet. This list can only increase because while IoT services should be intuitive and simple to use, the delivery ecosystem is a myriad of potential problems as IoT explodes complexity. So finding a performance issue is like finding the proverbial needle in the haystack.
Jul. 1, 2016 10:45 AM EDT Reads: 550
Machine Learning helps make complex systems more efficient. By applying advanced Machine Learning techniques such as Cognitive Fingerprinting, wind project operators can utilize these tools to learn from collected data, detect regular patterns, and optimize their own operations. In his session at 18th Cloud Expo, Stuart Gillen, Director of Business Development at SparkCognition, discussed how research has demonstrated the value of Machine Learning in delivering next generation analytics to imp...
Jul. 1, 2016 10:30 AM EDT Reads: 1,072
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
Jul. 1, 2016 10:00 AM EDT Reads: 494
Unless your company can spend a lot of money on new technology, re-engineering your environment and hiring a comprehensive cybersecurity team, you will most likely move to the cloud or seek external service partnerships. In his session at 18th Cloud Expo, Darren Guccione, CEO of Keeper Security, revealed what you need to know when it comes to encryption in the cloud.
Jul. 1, 2016 10:00 AM EDT Reads: 694
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, explored the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences between wh...
Jul. 1, 2016 09:30 AM EDT Reads: 1,178
[session] Extending Intelligence to IoT By @BsquareCorp | @ThingsExpo #IoT #M2M #API #DigitalTransformation
Ask someone to architect an Internet of Things (IoT) solution and you are guaranteed to see a reference to the cloud. This would lead you to believe that IoT requires the cloud to exist. However, there are many IoT use cases where the cloud is not feasible or desirable. In his session at @ThingsExpo, Dave McCarthy, Director of Products at Bsquare Corporation, will discuss the strategies that exist to extend intelligence directly to IoT devices and sensors, freeing them from the constraints of ...
Jul. 1, 2016 09:12 AM EDT Reads: 247
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discussed how businesses can gain an edge over competitors by empowering consumers to take control through IoT. He cited examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He also highlighted how IoT can revitalize and restore outdated business models, making them profitable ...
Jul. 1, 2016 09:00 AM EDT Reads: 673
[session] The Factory of the Future, Today By @Cisco | @ThingsExpo #IoT #BigData #DigitalTransformation
IoT offers a value of almost $4 trillion to the manufacturing industry through platforms that can improve margins, optimize operations & drive high performance work teams. By using IoT technologies as a foundation, manufacturing customers are integrating worker safety with manufacturing systems, driving deep collaboration and utilizing analytics to exponentially increased per-unit margins. However, as Benoit Lheureux, the VP for Research at Gartner points out, “IoT project implementers often ...
Jul. 1, 2016 08:45 AM EDT Reads: 791
The idea of comparing data in motion (at the sensor level) to data at rest (in a Big Data server warehouse) with predictive analytics in the cloud is very appealing to the industrial IoT sector. The problem Big Data vendors have, however, is access to that data in motion at the sensor location. In his session at @ThingsExpo, Scott Allen, CMO of FreeWave, discussed how as IoT is increasingly adopted by industrial markets, there is going to be an increased demand for sensor data from the outermos...
Jul. 1, 2016 08:00 AM EDT Reads: 558