Welcome!

@BigDataExpo Authors: Dana Gardner, Elizabeth White, Scott Allen, Pat Romanski, Liz McMillan

Blog Feed Post

How to find a needle in a haystack?

 Needle In A Haystack Loupe DrawingThe poster child scenario for big data – you need to sift through a large amount of data to extract a tiny “nugget” of information. Also you need to do it in as short a amount of time as possible, your business depends on it. Historically using traditional RDBMS technology this sort of scenario has required a large team and a large investment of time and money. Most traditional RDBMS’s only scale vertically, so you have to keep buying larger and larger machines to reduce your turnaround time. The advent of public clouds and NoSQL databases like MongoDB has completely disrupted how teams are thinking about this scenario.

Recently once of our customers came to us with an interesting problem. Periodically they needed to run a really complex query that scanned their entire data set. This query was pretty much a collection scan – it touched every document in the collection. Here are more details

  • Total data was about 100GB
  • Data safety was not a issue since the master copy of the data resided elsewhere
  • Query speed was extremely important. The goal was to be able to run the entire query within 10-15 mnts
  • The system needed to be up only when the query is running (minimize cost)

Due to the last requirement it made sense to run the entire system on a public cloud. The machines get turned on for only a few hours every week for the data to get updated and the query to get run. The customer was already comfortable with Amazon EC2, so the decision was made to prototype the system in AWS.

The best configuration to achieve this goal was a “Sharded” mongod deployment. Here is the configuration we settled on

  • 3 shards – each shard has a standalone instance (r3.xlarge) with 30 GB of RAM
  • 1 config server
  • 1 shard router (m3.xlarge) with 15 GB of RAM

A couple of things I would like to point out about our choices

  • Standalone vs replica set – Data safety is not an important requirement here since the master data is stored in a separate system. Hence we went with standalone servers instead of a replica set to save on cost.
  • 3 config servers vs 1 config server – Same reason as above. Data safety is not an important issue. In a typical production environment we would have gone with three config servers.

The real beauty of this configuration is that due to the sharded configuration almost the entire 100GB of data is stored completely in memory. So essentially what you are running is an “in memory” scan. This dramatically reduced the run time of the query from a few hours to less than 10 mnts. The use of the public cloud also dramatically reduced the capital investment since you only pay for the machines when they are running.

This is a fairly dramatic change to how teams have been handling this scenario over the past decade. So if you are in the “Finding a needle in a haystack” business think Cloud + NoSQL!

As always if you have any questions you can reach us at [email protected]

Read the original blog entry...

More Stories By Dharshan Rangegowda

Dharshan is the founder of MongoDirector.com. Previous to MongoDirector Dharshan worked in the Virtualization and Data management groups in Microsoft.

@BigDataExpo Stories
In addition to all the benefits, IoT is also bringing new kind of customer experience challenges - cars that unlock themselves, thermostats turning houses into saunas and baby video monitors broadcasting over the internet. This list can only increase because while IoT services should be intuitive and simple to use, the delivery ecosystem is a myriad of potential problems as IoT explodes complexity. So finding a performance issue is like finding the proverbial needle in the haystack.
"We host and fully manage cloud data services, whether we store, the data, move the data, or run analytics on the data," stated Kamal Shannak, Senior Development Manager, Cloud Data Services, IBM, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
With the proliferation of both SQL and NoSQL databases, organizations can now target specific fit-for-purpose database tools for their different application needs regarding scalability, ease of use, ACID support, etc. Platform as a Service offerings make this even easier now, enabling developers to roll out their own database infrastructure in minutes with minimal management overhead. However, this same amount of flexibility also comes with the challenges of picking the right tool, on the right ...
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...
"This week we're really focusing on scalability, asset preservation and how do you back up to the cloud and in the cloud with object storage, which is really a new way of attacking dealing with your file, your blocked data, where you put it and how you access it," stated Jeff Greenwald, Senior Director of Market Development at HGST, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
Large scale deployments present unique planning challenges, system commissioning hurdles between IT and OT and demand careful system hand-off orchestration. In his session at @ThingsExpo, Jeff Smith, Senior Director and a founding member of Incenergy, will discuss some of the key tactics to ensure delivery success based on his experience of the last two years deploying Industrial IoT systems across four continents.
“We're a global managed hosting provider. Our core customer set is a U.S.-based customer that is looking to go global,” explained Adam Rogers, Managing Director at ANEXIA, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develo...
SYS-CON Events announced today that MangoApps will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device.
IoT is rapidly changing the way enterprises are using data to improve business decision-making. In order to derive business value, organizations must unlock insights from the data gathered and then act on these. In their session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, and Peter Shashkin, Head of Development Department at EastBanc Technologies, discussed how one organization leveraged IoT, cloud technology and data analysis to improve customer experiences and effi...
The IETF draft standard for M2M certificates is a security solution specifically designed for the demanding needs of IoT/M2M applications. In his session at @ThingsExpo, Brian Romansky, VP of Strategic Technology at TrustPoint Innovation, explained how M2M certificates can efficiently enable confidentiality, integrity, and authenticity on highly constrained devices.
"We've discovered that after shows 80% if leads that people get, 80% of the conversations end up on the show floor, meaning people forget about it, people forget who they talk to, people forget that there are actual business opportunities to be had here so we try to help out and keep the conversations going," explained Jeff Mesnik, Founder and President of ContentMX, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Let’s face it, embracing new storage technologies, capabilities and upgrading to new hardware often adds complexity and increases costs. In his session at 18th Cloud Expo, Seth Oxenhorn, Vice President of Business Development & Alliances at FalconStor, discussed how a truly heterogeneous software-defined storage approach can add value to legacy platforms and heterogeneous environments. The result reduces complexity, significantly lowers cost, and provides IT organizations with improved efficienc...
"When you think about the data center today, there's constant evolution, The evolution of the data center and the needs of the consumer of technology change, and they change constantly," stated Matt Kalmenson, VP of Sales, Service and Cloud Providers at Veeam Software, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
The IoT is changing the way enterprises conduct business. In his session at @ThingsExpo, Eric Hoffman, Vice President at EastBanc Technologies, discussed how businesses can gain an edge over competitors by empowering consumers to take control through IoT. He cited examples such as a Washington, D.C.-based sports club that leveraged IoT and the cloud to develop a comprehensive booking system. He also highlighted how IoT can revitalize and restore outdated business models, making them profitable ...
In his session at @DevOpsSummit at 19th Cloud Expo, Yoseph Reuveni, Director of Software Engineering at Jet.com, will discuss Jet.com's journey into containerizing Microsoft-based technologies like C# and F# into Docker. He will talk about lessons learned and challenges faced, the Mono framework tryout and how they deployed everything into Azure cloud. Yoseph Reuveni is a technology leader with unique experience developing and running high throughput (over 1M tps) distributed systems with extre...
We all know the latest numbers: Gartner, Inc. forecasts that 6.4 billion connected things will be in use worldwide in 2016, up 30 percent from last year, and will reach 20.8 billion by 2020. We're rapidly approaching a data production of 40 zettabytes a day – more than we can every physically store, and exabytes and yottabytes are just around the corner. For many that’s a good sign, as data has been proven to equal money – IF it’s ingested, integrated, and analyzed fast enough. Without real-ti...
"There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...