Welcome!

@BigDataExpo Authors: Karyn Jeffery, Pat Romanski, Cloud Best Practices Network, Liz McMillan, Elizabeth White

Blog Feed Post

Quick MapReduce with beanstalkd

At ProjectLocker, we operate a polyglot environment with a heavy Ruby bias. While we love Ruby and Rails, one of the drawbacks of Ruby is its Global VM Lock. In a nutshell, the Global VM Lock makes it harder to write Ruby code that can fully utilize a modern multi-core server. For Web applications, this isn’t a problem because the web server manages multiple processes for you (e.g. via Passenger). However, for offline processes, parallelism doesn’t come for free.

I was recently working on a project that involved the offline batch processing of lots of data. This project has been operating successfully for some time, but the data set has grown, causing the process to need more more time to complete than we’d like. So I dove in to see what we could do to speed it up. Fortunately, the process was still single-threaded, so we knew we’d be able to inject concurrency to increase throughput without adding hardware.

The job in question runs on a fairly well-equipped server, but the server was underutilized due to the process being serial. Here’s an outline of the initial code:

def main_job
  for retrieve_giant_dataset().each do |item|
    long_process(item)
  end

  summarize_results(retrieve_all_results()) 
end

def long_process(item)
  # Do some work on item that uses a lot of CPU time.
  item.save
end

That approach gets the job done, but I wanted to parallelize it. Conceptually, I wanted to transform the main_job method so that it looked something like this:

def main_job
  threads = []
  for retrieve_giant_dataset().each do |item|
    threads << Thread.new(item) do
      long_process(item)
    end
  end

  threads.each { |t| t.join }

  summarize_results(retrieve_all_results()) 
end

Unfortunately, it’s not that easy due to the aforementioned Global VM Lock. What I needed was a way to get my threads running on a bunch of independent processes. This is a problem tailor-made for a job queueing system. Enter beanstalkd, a simple & fast work queue. We paired beanstalkd with Stalker, a DSL that makes it easy to queue and process jobs from Ruby. Integrating these two was a cinch. Here’s what the restructured code looks like now:

def main_job
  for retrieve_giant_dataset().each do |item|
    Stalker.enqueue(JOB_NAME, :id => item.id)
  end

  beaneater = Beaneater::Pool.new(['localhost:11300'])
  tube = beaneater.tubes.find TUBE_NAME
  while tube.peek(:ready)
    sleep(5)
  end

  summarize_results(retrieve_all_results()) 
end

So instead of processing each item during the loop, now we just add each to the beanstalkd queue. Once we finish queueing all of the items, we wait until all of our entries have been processed by the worker processes. The workers are initiated via a jobs.rb file that looks something like this:

include Stalker
  
job JOB_NAME do |args|
  item = ItemClass.find(args['id'])
  Worker::long_process(item) 
end

We then start beanstalkd and a few worker processes and we’re off to the races. Now our job runs in parallel via multiple processes, and we can tune the number of worker processes we run to consume as much of the machine’s resources as we like. As a bonus, we can also run Stalker workers on other machines in our cluster for added parallelism. With just a few minor tweaks to our code, we’ve gone from single-threaded to a solution that is limited only by the capacity of the shared database used. Sweet!

What about the MapReduce reference in the title of this post? The MapReduce algorithm basically has two steps. In the Map step, you divide the work and assign it to worker nodes. The Reduce step simply combines the results of each individual node’s computation into an aggregate result. In our solution here, the Map step is done by us enqueuing our jobs into beanstalkd and then beanstalkd making the jobs available for consumption by our nodes. Our database serves to communicate the details of the jobs, and stands in for a shared filesystem like the HDFS used by Hadoop. I didn’t go into detail about this step, but our Reduce is also assisted by database aggregates; we’re able to construct a few simple queries that get us what we want from the database.

So there it is, distributed MapReduce for Ruby using beanstalkd, Stalker, and a healthy database. This is probably not the best solution if you need to scale to thousands or tens of thousands of workers. But if you just need to get tens of workers running in parallel quickly, you may be able to adapt this approach to fit your needs.

Read the original blog entry...

More Stories By Damon Young

Damon Young is Director of Sales at ProjectLocker.com. ProjectLocker was founded in 2003 to provide on-demand tools for software developers. Guided by the simple mission of helping companies build better software, ProjectLocker's services have expanded to include services for the complete lifecycle of software projects, from requirements documentation to build and test automation. ProjectLocker serves companies from startups to Fortune 1000 multinationals.

@BigDataExpo Stories
The Transparent Cloud-computing Consortium (abbreviation: T-Cloud Consortium) will conduct research activities into changes in the computing model as a result of collaboration between "device" and "cloud" and the creation of new value and markets through organic data processing High speed and high quality networks, and dramatic improvements in computer processing capabilities, have greatly changed the nature of applications and made the storing and processing of data on the network commonplace.
An IoT product’s log files speak volumes about what’s happening with your products in the field, pinpointing current and potential issues, and enabling you to predict failures and save millions of dollars in inventory. But until recently, no one knew how to listen. In his session at @ThingsExpo, Dan Gettens, Chief Research Officer at OnProcess, will discuss recent research by Massachusetts Institute of Technology and OnProcess Technology, where MIT created a new, breakthrough analytics model f...
Without a clear strategy for cost control and an architecture designed with cloud services in mind, costs and operational performance can quickly get out of control. To avoid multiple architectural redesigns requires extensive thought and planning. Boundary (now part of BMC) launched a new public-facing multi-tenant high resolution monitoring service on Amazon AWS two years ago, facing challenges and learning best practices in the early days of the new service. In his session at 19th Cloud Exp...
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, will discuss the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports. The session will include a working demo and a technical d...
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lea...
Almost two-thirds of companies either have or soon will have IoT as the backbone of their business in 2016. However, IoT is far more complex than most firms expected. How can you not get trapped in the pitfalls? In his session at @ThingsExpo, Tony Shan, a renowned visionary and thought leader, will introduce a holistic method of IoTification, which is the process of IoTifying the existing technology and business models to adopt and leverage IoT. He will drill down to the components in this fra...
Digital transformation is too big and important for our future success to not understand the rules that apply to it. The first three rules for winning in this age of hyper-digital transformation are: Advantages in speed, analytics and operational tempos must be captured by implementing an optimized information logistics system (OILS) Real-time operational tempos (IT, people and business processes) must be achieved Businesses that can "analyze data and act and with speed" will dominate those t...
I'm a lonely sensor. I spend all day telling the world how I'm feeling, but none of the other sensors seem to care. I want to be connected. I want to build relationships with other sensors to be more useful for my human. I want my human to understand that when my friends next door are too hot for a while, I'll soon be flaming. And when all my friends go outside without me, I may be left behind. Don't just log my data; use the relationship graph. In his session at @ThingsExpo, Ryan Boyd, Engi...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
Adobe is changing the world though digital experiences. Adobe helps customers develop and deliver high-impact experiences that differentiate brands, build loyalty, and drive revenue across every screen, including smartphones, computers, tablets and TVs. Adobe content solutions are used daily by millions of companies worldwide-from publishers and broadcasters, to enterprises, marketing agencies and household-name brands. Building on its established design leadership, Adobe enables customers not o...
If you’re responsible for an application that depends on the data or functionality of various IoT endpoints – either sensors or devices – your brand reputation depends on the security, reliability, and compliance of its many integrated parts. If your application fails to deliver the expected business results, your customers and partners won't care if that failure stems from the code you developed or from a component that you integrated. What can you do to ensure that the endpoints work as expect...
SYS-CON Events announced today that ReadyTalk, a leading provider of online conferencing and webinar services, has been named Vendor Presentation Sponsor at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. ReadyTalk delivers audio and web conferencing services that inspire collaboration and enable the Future of Work for today’s increasingly digital and mobile workforce. By combining intuitive, innovative tec...
Apache Hadoop is a key technology for gaining business insights from your Big Data, but the penetration into enterprises is shockingly low. In fact, Apache Hadoop and Big Data proponents recognize that this technology has not yet achieved its game-changing business potential. In his session at 19th Cloud Expo, John Mertic, director of program management for ODPi at The Linux Foundation, will explain why this is, how we can work together as an open data community to increase adoption, and the i...
There is growing need for data-driven applications and the need for digital platforms to build these apps. In his session at 19th Cloud Expo, Muddu Sudhakar, VP and GM of Security & IoT at Splunk, will cover different PaaS solutions and Big Data platforms that are available to build applications. In addition, AI and machine learning are creating new requirements that developers need in the building of next-gen apps. The next-generation digital platforms have some of the past platform needs a...
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said “No to LUNs.” With Tintri they mana...
Businesses are struggling to manage the information flow and interactions between all of these new devices and things jumping on their network, and the apps and IT systems they control. The data businesses gather is only helpful if they can do something with it. In his session at @ThingsExpo, Chris Witeck, Principal Technology Strategist at Citrix, will discuss how different the impact of IoT will be for large businesses, expanding how IoT will allow large organizations to make their legacy ap...
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
Major trends and emerging technologies – from virtual reality and IoT, to Big Data and algorithms – are helping organizations innovate in the digital era. However, to create real business value, IT must think beyond the ‘what’ of digital transformation to the ‘how’ to harness emerging trends, innovation and disruption. Architecture is the key that underpins and ties all these efforts together. In the digital age, it’s important to invest in architecture, extend the enterprise footprint to the cl...
24Notion is full-service global creative digital marketing, technology and lifestyle agency that combines strategic ideas with customized tactical execution. With a broad understand of the art of traditional marketing, new media, communications and social influence, 24Notion uniquely understands how to connect your brand strategy with the right consumer. 24Notion ranked #12 on Corporate Social Responsibility - Book of List.
Why do your mobile transformations need to happen today? Mobile is the strategy that enterprise transformation centers on to drive customer engagement. In his general session at @ThingsExpo, Roger Woods, Director, Mobile Product & Strategy – Adobe Marketing Cloud, covered key IoT and mobile trends that are forcing mobile transformation, key components of a solid mobile strategy and explored how brands are effectively driving mobile change throughout the enterprise.