Welcome!

@BigDataExpo Authors: Alberto Pan, Elizabeth White, Pat Romanski, ManageEngine IT Matters, Peter Silva

Blog Feed Post

How to find a needle in a haystack?

 Needle In A Haystack Loupe DrawingThe poster child scenario for big data – you need to sift through a large amount of data to extract a tiny “nugget” of information. Also you need to do it in as short a amount of time as possible, your business depends on it. Historically using traditional RDBMS technology this sort of scenario has required a large team and a large investment of time and money. Most traditional RDBMS’s only scale vertically, so you have to keep buying larger and larger machines to reduce your turnaround time. The advent of public clouds and NoSQL databases like MongoDB has completely disrupted how teams are thinking about this scenario.

Recently once of our customers came to us with an interesting problem. Periodically they needed to run a really complex query that scanned their entire data set. This query was pretty much a collection scan – it touched every document in the collection. Here are more details

  • Total data was about 100GB
  • Data safety was not a issue since the master copy of the data resided elsewhere
  • Query speed was extremely important. The goal was to be able to run the entire query within 10-15 mnts
  • The system needed to be up only when the query is running (minimize cost)

Due to the last requirement it made sense to run the entire system on a public cloud. The machines get turned on for only a few hours every week for the data to get updated and the query to get run. The customer was already comfortable with Amazon EC2, so the decision was made to prototype the system in AWS.

The best configuration to achieve this goal was a “Sharded” mongod deployment. Here is the configuration we settled on

  • 3 shards – each shard has a standalone instance (r3.xlarge) with 30 GB of RAM
  • 1 config server
  • 1 shard router (m3.xlarge) with 15 GB of RAM

A couple of things I would like to point out about our choices

  • Standalone vs replica set – Data safety is not an important requirement here since the master data is stored in a separate system. Hence we went with standalone servers instead of a replica set to save on cost.
  • 3 config servers vs 1 config server – Same reason as above. Data safety is not an important issue. In a typical production environment we would have gone with three config servers.

The real beauty of this configuration is that due to the sharded configuration almost the entire 100GB of data is stored completely in memory. So essentially what you are running is an “in memory” scan. This dramatically reduced the run time of the query from a few hours to less than 10 mnts. The use of the public cloud also dramatically reduced the capital investment since you only pay for the machines when they are running.

This is a fairly dramatic change to how teams have been handling this scenario over the past decade. So if you are in the “Finding a needle in a haystack” business think Cloud + NoSQL!

As always if you have any questions you can reach us at [email protected]

Read the original blog entry...

More Stories By Dharshan Rangegowda

Dharshan is the founder of MongoDirector.com. Previous to MongoDirector Dharshan worked in the Virtualization and Data management groups in Microsoft.

@BigDataExpo Stories
Infrastructure is widely available, but who’s managing inbound/outbound traffic? Data is created, stored, and managed online – who is protecting it and how? In his session at 19th Cloud Expo, Jaeson Yoo, SVP of Business Development at Penta Security Systems Inc., discussed how to keep any and all infrastructure clean, safe, and efficient by monitoring and filtering all malicious HTTP/HTTPS traffic at the OSI Layer 7. Stop attacks and web intruders before they can enter your network.
The many IoT deployments around the world are busy integrating smart devices and sensors into their enterprise IT infrastructures. Yet all of this technology – and there are an amazing number of choices – is of no use without the software to gather, communicate, and analyze the new data flows. Without software, there is no IT. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Dave McCarthy, Director of Products at Bsquare Corporation; Alan Williamson, Principal...
In his session at Cloud Expo, Robert Cohen, an economist and senior fellow at the Economic Strategy Institute, provideed economic scenarios that describe how the rapid adoption of software-defined everything including cloud services, SDDC and open networking will change GDP, industry growth, productivity and jobs. This session also included a drill down for several industries such as finance, social media, cloud service providers and pharmaceuticals.
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
Internet of @ThingsExpo has announced today that Chris Matthieu has been named tech chair of Internet of @ThingsExpo 2017 New York The 7th Internet of @ThingsExpo will take place on June 6-8, 2017, at the Javits Center in New York City, New York. Chris Matthieu is the co-founder and CTO of Octoblu, a revolutionary real-time IoT platform recently acquired by Citrix. Octoblu connects things, systems, people and clouds to a global mesh network allowing users to automate and control design flo...
Unsecured IoT devices were used to launch crippling DDOS attacks in October 2016, targeting services such as Twitter, Spotify, and GitHub. Subsequent testimony to Congress about potential attacks on office buildings, schools, and hospitals raised the possibility for the IoT to harm and even kill people. What should be done? Does the government need to intervene? This panel at @ThingExpo New York brings together leading IoT and security experts to discuss this very serious topic.
Businesses and business units of all sizes can benefit from cloud computing, but many don't want the cost, performance and security concerns of public cloud nor the complexity of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify cloud deployment and management. In his session at 20th Cloud Expo, Ajay Gulati, Co-founder and CEO of ZeroStack, will discuss how AI can simplify cloud operations. He will cover the following topics: why clou...
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
Internet-of-Things discussions can end up either going down the consumer gadget rabbit hole or focused on the sort of data logging that industrial manufacturers have been doing forever. However, in fact, companies today are already using IoT data both to optimize their operational technology and to improve the experience of customer interactions in novel ways. In his session at @ThingsExpo, Gordon Haff, Red Hat Technology Evangelist, will share examples from a wide range of industries – includin...
"We analyze the video streaming experience. We are gathering the user behavior in real time from the user devices and we analyze how users experience the video streaming," explained Eric Kim, Founder and CEO at Streamlyzer, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprise IT has been in the era of Hybrid Cloud for some time now. But it seems most conversations about Hybrid are focused on integrating AWS, Microsoft Azure, or Google ECM into existing on-premises systems. Where is all the Private Cloud? What do technology providers need to do to make their offerings more compelling? How should enterprise IT executives and buyers define their focus, needs, and roadmap, and communicate that clearly to the providers?
"We are a leader in the market space called network visibility solutions - it enables monitoring tools and Big Data analysis to access the data and be able to see the performance," explained Shay Morag, VP of Sales and Marketing at Niagara Networks, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
According to Forrester Research, every business will become either a digital predator or digital prey by 2020. To avoid demise, organizations must rapidly create new sources of value in their end-to-end customer experiences. True digital predators also must break down information and process silos and extend digital transformation initiatives to empower employees with the digital resources needed to win, serve, and retain customers.
Amazon has gradually rolled out parts of its IoT offerings in the last year, but these are just the tip of the iceberg. In addition to optimizing their back-end AWS offerings, Amazon is laying the ground work to be a major force in IoT – especially in the connected home and office. Amazon is extending its reach by building on its dominant Cloud IoT platform, its Dash Button strategy, recently announced Replenishment Services, the Echo/Alexa voice recognition control platform, the 6-7 strategic...
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
We're entering the post-smartphone era, where wearable gadgets from watches and fitness bands to glasses and health aids will power the next technological revolution. With mass adoption of wearable devices comes a new data ecosystem that must be protected. Wearables open new pathways that facilitate the tracking, sharing and storing of consumers’ personal health, location and daily activity data. Consumers have some idea of the data these devices capture, but most don’t realize how revealing and...
IoT solutions exploit operational data generated by Internet-connected smart “things” for the purpose of gaining operational insight and producing “better outcomes” (for example, create new business models, eliminate unscheduled maintenance, etc.). The explosive proliferation of IoT solutions will result in an exponential growth in the volume of IoT data, precipitating significant Information Governance issues: who owns the IoT data, what are the rights/duties of IoT solutions adopters towards t...
Whether your IoT service is connecting cars, homes, appliances, wearable, cameras or other devices, one question hangs in the balance – how do you actually make money from this service? The ability to turn your IoT service into profit requires the ability to create a monetization strategy that is flexible, scalable and working for you in real-time. It must be a transparent, smoothly implemented strategy that all stakeholders – from customers to the board – will be able to understand and comprehe...
Between 2005 and 2020, data volumes will grow by a factor of 300 – enough data to stack CDs from the earth to the moon 162 times. This has come to be known as the ‘big data’ phenomenon. Unfortunately, traditional approaches to handling, storing and analyzing data aren’t adequate at this scale: they’re too costly, slow and physically cumbersome to keep up. Fortunately, in response a new breed of technology has emerged that is cheaper, faster and more scalable. Yet, in meeting these new needs they...
Complete Internet of Things (IoT) embedded device security is not just about the device but involves the entire product’s identity, data and control integrity, and services traversing the cloud. A device can no longer be looked at as an island; it is a part of a system. In fact, given the cross-domain interactions enabled by IoT it could be a part of many systems. Also, depending on where the device is deployed, for example, in the office building versus a factory floor or oil field, security ha...