Welcome!

@DXWorldExpo Authors: Zakia Bouachraoui, Yeshim Deniz, Liz McMillan, Elizabeth White, Pat Romanski

Related Topics: @DXWorldExpo, Java IoT, Microservices Expo, Containers Expo Blog, @CloudExpo, SDN Journal

@DXWorldExpo: Article

Columnar vs. Key-Value Storage Models

Pay attention to specific configuration and tuning around three points

What are the performance differences between in-memory columnar databases like SAP HANA and GridGain's In-Memory Database (IMDB) utilizing distributed key-value storage? This questions comes up regularly in conversations with our customers and the answer is not very obvious.

Storage Models
First off, let's clearly state that we are talking about storage model only and its implications on performance for various use cases. It's important to note that:

  • Storage model doesn't dictate of preclude a particular transactionality or consistency guarantees; there are columnar databases that support ACID (HANA) and those that don't (HBase); there are distributed key-value databases that support ACID (GridGain) and those that don't (for example, Riak and memcached).
  • Storage model doesn't dictate specific query language; using above examples - GridGain and HANA support SQL - HBase, for example, doesn't.

Unlike transactionality and query language - performance considerations, however, are not that straightforward.

Note also: SAP HANA has pluggable storage model and experimental row-based storage implementation. We'll concentrate on columnar storage that apparently accounts for all HANA usage at this point.

HANA's Columnar Storage Model
Let's recall what columnar storage model entails in general and note its HANA specifics.

Some of its stand out characteristics include:

  • Data in columnar model is kept in column (vs. rows as in row storage models).
  • Since data in a single column is almost always homogeneous it's frequently compressed for storage (especially in in-memory systems like HANA).
  • Aggregate functions (i.e. column functions) are very fast on columnar data model since the entire column can be fetched very quickly and effectively indexed.
  • Inserts, updates and row functions, however, are significantly slower than their row-based counterparts as a trade-off of columnar approach (inserting a row leads to multiple columns inserts). Because of this characteristic - columnar databased typically used in R/OLAP scenario (where data doesn't change) and very rarely in OLTP use cases (where data changes frequently).
  • Since columnar storage is fairly compact it doesn't generally require distribution (i.e. data partitioning) to store large datasets - the entire database can often be logically stored in memory of a single server. HANA, however, provides comprehensive support for data partitioning.

It is important to emphasize that columnar storage model is ideally suited for very compact memory utilization for the two main reasons:

  • Columnar model is a naturally fit for compression which often provides for dramatic reduction in memory consumption.
  • Since column-based functions are very fast - there is no need for materialized views for aggregated values in exchange for simply computing necessary values on the fly; this leads to significantly reduced memory footprint as well.

GridGain's IMDB Key-Value Storage Model
Key-value (KV) storage model is less defined than its columnar counterpart and usually involves a fair amount of vendor specifics.

Historically, there are two schools of KV storage models:

  • Traditional (examples include Riak, memcached, Redis). The common characteristic of these systems is a raw, language independent storage format for the keys and values.
  • Data Grid (examples include GridGain IMDB, GigaSpaces, Coherence). The common trait of these systems is the reliance on JVM as underlying runtime platform, and treating keys and values as user-defined JVM objects.

GridGain's IMDB belongs to Data Grid branch of KV storage models. Some of its key characteristics are:

  • Data is stored in a set of distributed maps (a.k.a. dictionaries or caches); in a simple approximation you can think of a value as a row in row-based model, and a key as that row's primary key. Following this analogy a single KV map can be approximated as row-based table with automatic primary key index.
  • Keys and values are represented as user-defined JVM objects and therefore no automatic compression can be performed.
  • Data distribution is designed from the ground up. Data is partitioned across the cluster mitigating, in part, lack of compression. Unlike HANA - data partitioning is mandatory.
  • MapReduce is the main API for data processing (SQL is supported as well).
  • Strong affinity and co-location semantics provided by default.
  • No bias towards aggregate or row-based processing performance and therefore no bias towards either OLAP or OLTP applicability.

Performance Considerations
It is somewhat expected that for heavy transactional processing GridGain will provide overall better performance in most cases:

  • Columnar model is rather inefficient in updating or inserting values in multiple columns.
  • Transactional locking is also less efficient in columnar model.
  • Required de-compression and re-compression further degrades performance.
  • KV storage model, on the other hand, provides an ideal model for individual updates as individual objects can be accessed, locked and updated very effectively.
  • Lack of compression in GridGain IMDB makes updates go even faster than in columnar model with compression.

As an example, GridGain just won a public tender for one of the biggest financial institutions in the world achieving 1 billion transactional updates per second on 10 commodity blades costing less than $25K all together. That transactional performance and associated TCO is clearly not the territory any columnar database can approach.

For OLAP workloads the picture is less obvious. HANA is heavily biased towards OLAP processing, and GridGain IMDB is neutral towards it. Both GridGain IMDB and SAP HANA provides comprehensive data partitioning capabilities and allow for processing parallelization - MPP traits necessary for scale out OLAP processing. I believe the actual difference observed by the customers will be driven primarily by three factors rooted deeply in differences between columnar and KV implementations in respective products:

  • Optimizations around data affinity and co-location.
  • Optimizations around the distribution overhead.
  • Optimizations around indexing of partitioned data.

Unfortunately - there's no way to provide any generalized guidance on performance difference here... We always recommend to try both in your particular scenario, pay attention to specific configuration and tuning around three points mentioned above - and see what results you'll get. It does take time and resources - but you may be surprised by your findings!

More Stories By Nikita Ivanov

Nikita Ivanov is founder and CEO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory computing platform starting every 10 seconds around the world today.

Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.

He is an active member of Java middleware community, contributor to the Java specification, and holds a Master’s degree in Electro Mechanics from Baltic State Technical University, Saint Petersburg, Russia.

DXWorldEXPO Digital Transformation Stories
The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...
Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by ...
As Cybric's Chief Technology Officer, Mike D. Kail is responsible for the strategic vision and technical direction of the platform. Prior to founding Cybric, Mike was Yahoo's CIO and SVP of Infrastructure, where he led the IT and Data Center functions for the company. He has more than 24 years of IT Operations experience with a focus on highly-scalable architectures.
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
René Bostic is the Technical VP of the IBM Cloud Unit in North America. Enjoying her career with IBM during the modern millennial technological era, she is an expert in cloud computing, DevOps and emerging cloud technologies such as Blockchain. Her strengths and core competencies include a proven record of accomplishments in consensus building at all levels to assess, plan, and implement enterprise and cloud computing solutions. René is a member of the Society of Women Engineers (SWE) and a m...
Enterprises are striving to become digital businesses for differentiated innovation and customer-centricity. Traditionally, they focused on digitizing processes and paper workflow. To be a disruptor and compete against new players, they need to gain insight into business data and innovate at scale. Cloud and cognitive technologies can help them leverage hidden data in SAP/ERP systems to fuel their businesses to accelerate digital transformation success.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Digital Transformation and Disruption, Amazon Style - What You Can Learn. Chris Kocher is a co-founder of Grey Heron, a management and strategic marketing consulting firm. He has 25+ years in both strategic and hands-on operating experience helping executives and investors build revenues and shareholder value. He has consulted with over 130 companies on innovating with new business models, product strategies and monetization. Chris has held management positions at HP and Symantec in addition to ...