Click here to close now.




















Welcome!

@BigDataExpo Authors: Pat Romanski, Carmen Gonzalez, Kaazing Blog, Pete Waterhouse, Elizabeth White

Blog Feed Post

Quick MapReduce with beanstalkd

At ProjectLocker, we operate a polyglot environment with a heavy Ruby bias. While we love Ruby and Rails, one of the drawbacks of Ruby is its Global VM Lock. In a nutshell, the Global VM Lock makes it harder to write Ruby code that can fully utilize a modern multi-core server. For Web applications, this isn’t a problem because the web server manages multiple processes for you (e.g. via Passenger). However, for offline processes, parallelism doesn’t come for free.

I was recently working on a project that involved the offline batch processing of lots of data. This project has been operating successfully for some time, but the data set has grown, causing the process to need more more time to complete than we’d like. So I dove in to see what we could do to speed it up. Fortunately, the process was still single-threaded, so we knew we’d be able to inject concurrency to increase throughput without adding hardware.

The job in question runs on a fairly well-equipped server, but the server was underutilized due to the process being serial. Here’s an outline of the initial code:

def main_job
  for retrieve_giant_dataset().each do |item|
    long_process(item)
  end

  summarize_results(retrieve_all_results()) 
end

def long_process(item)
  # Do some work on item that uses a lot of CPU time.
  item.save
end

That approach gets the job done, but I wanted to parallelize it. Conceptually, I wanted to transform the main_job method so that it looked something like this:

def main_job
  threads = []
  for retrieve_giant_dataset().each do |item|
    threads << Thread.new(item) do
      long_process(item)
    end
  end

  threads.each { |t| t.join }

  summarize_results(retrieve_all_results()) 
end

Unfortunately, it’s not that easy due to the aforementioned Global VM Lock. What I needed was a way to get my threads running on a bunch of independent processes. This is a problem tailor-made for a job queueing system. Enter beanstalkd, a simple & fast work queue. We paired beanstalkd with Stalker, a DSL that makes it easy to queue and process jobs from Ruby. Integrating these two was a cinch. Here’s what the restructured code looks like now:

def main_job
  for retrieve_giant_dataset().each do |item|
    Stalker.enqueue(JOB_NAME, :id => item.id)
  end

  beaneater = Beaneater::Pool.new(['localhost:11300'])
  tube = beaneater.tubes.find TUBE_NAME
  while tube.peek(:ready)
    sleep(5)
  end

  summarize_results(retrieve_all_results()) 
end

So instead of processing each item during the loop, now we just add each to the beanstalkd queue. Once we finish queueing all of the items, we wait until all of our entries have been processed by the worker processes. The workers are initiated via a jobs.rb file that looks something like this:

include Stalker
  
job JOB_NAME do |args|
  item = ItemClass.find(args['id'])
  Worker::long_process(item) 
end

We then start beanstalkd and a few worker processes and we’re off to the races. Now our job runs in parallel via multiple processes, and we can tune the number of worker processes we run to consume as much of the machine’s resources as we like. As a bonus, we can also run Stalker workers on other machines in our cluster for added parallelism. With just a few minor tweaks to our code, we’ve gone from single-threaded to a solution that is limited only by the capacity of the shared database used. Sweet!

What about the MapReduce reference in the title of this post? The MapReduce algorithm basically has two steps. In the Map step, you divide the work and assign it to worker nodes. The Reduce step simply combines the results of each individual node’s computation into an aggregate result. In our solution here, the Map step is done by us enqueuing our jobs into beanstalkd and then beanstalkd making the jobs available for consumption by our nodes. Our database serves to communicate the details of the jobs, and stands in for a shared filesystem like the HDFS used by Hadoop. I didn’t go into detail about this step, but our Reduce is also assisted by database aggregates; we’re able to construct a few simple queries that get us what we want from the database.

So there it is, distributed MapReduce for Ruby using beanstalkd, Stalker, and a healthy database. This is probably not the best solution if you need to scale to thousands or tens of thousands of workers. But if you just need to get tens of workers running in parallel quickly, you may be able to adapt this approach to fit your needs.

Read the original blog entry...

More Stories By Damon Young

Damon Young is Director of Sales at ProjectLocker.com. ProjectLocker was founded in 2003 to provide on-demand tools for software developers. Guided by the simple mission of helping companies build better software, ProjectLocker's services have expanded to include services for the complete lifecycle of software projects, from requirements documentation to build and test automation. ProjectLocker serves companies from startups to Fortune 1000 multinationals.

@BigDataExpo Stories
SYS-CON Events announced today that Pythian, a global IT services company specializing in helping companies leverage disruptive technologies to optimize revenue-generating systems, has been named “Bronze Sponsor” of SYS-CON's 17th Cloud Expo, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Founded in 1997, Pythian is a global IT services company that helps companies compete by adopting disruptive technologies such as cloud, Big Data, advance...
Containers are not new, but renewed commitments to performance, flexibility, and agility have propelled them to the top of the agenda today. By working without the need for virtualization and its overhead, containers are seen as the perfect way to deploy apps and services across multiple clouds. Containers can handle anything from file types to operating systems and services, including microservices. What are microservices? Unlike what the name implies, microservices are not necessarily small,...
SYS-CON Events announced today that IceWarp will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. IceWarp, the leader of cloud and on-premise messaging, delivers secured email, chat, documents, conferencing and collaboration to today's mobile workforce, all in one unified interface
All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to the Internet by 2020. This number will continue to grow at a rapid pace for the next several decades. With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo, November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Learn what is going on, contribute to the discussions, and e...
Too often with compelling new technologies market participants become overly enamored with that attractiveness of the technology and neglect underlying business drivers. This tendency, what some call the “newest shiny object syndrome,” is understandable given that virtually all of us are heavily engaged in technology. But it is also mistaken. Without concrete business cases driving its deployment, IoT, like many other technologies before it, will fade into obscurity.
As more and more data is generated from a variety of connected devices, the need to get insights from this data and predict future behavior and trends is increasingly essential for businesses. Real-time stream processing is needed in a variety of different industries such as Manufacturing, Oil and Gas, Automobile, Finance, Online Retail, Smart Grids, and Healthcare. Azure Stream Analytics is a fully managed distributed stream computation service that provides low latency, scalable processing of ...
In 2014, the market witnessed a massive migration to the cloud as enterprises finally overcame their fears of the cloud’s viability, security, etc. Over the past 18 months, AWS, Google and Microsoft have waged an ongoing battle through a wave of price cuts and new features. For IT executives, sorting through all the noise to make the best cloud investment decisions has become daunting. Enterprises can and are moving away from a "one size fits all" cloud approach. The new competitive field has ...
SYS-CON Events announced today that DataClear Inc. will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. The DataClear ‘BlackBox’ is the only solution that moves your PC, browsing and data out of the United States and away from prying (and spying) eyes. Its solution automatically builds you a clean, on-demand, virus free, new virtual cloud based PC outside of the United States, and wipes it clean...
SYS-CON Events announced today that HPM Networks will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. For 20 years, HPM Networks has been integrating technology solutions that solve complex business challenges. HPM Networks has designed solutions for both SMB and enterprise customers throughout the San Francisco Bay Area.
SYS-CON Events announced today the Containers & Microservices Bootcamp, being held November 3-4, 2015, in conjunction with 17th Cloud Expo, @ThingsExpo, and @DevOpsSummit at the Santa Clara Convention Center in Santa Clara, CA. This is your chance to get started with the latest technology in the industry. Combined with real-world scenarios and use cases, the Containers and Microservices Bootcamp, led by Janakiram MSV, a Microsoft Regional Director, will include presentations as well as hands-on...
Moving an existing on-premise infrastructure into the cloud can be a complex and daunting proposition. It is critical to understand the benefits as well as the challenges associated with either a full or hybrid approach. In his session at 17th Cloud Expo, Richard Weiss, Principal Consultant at Pythian, will present a roadmap that can be leveraged by any organization to plan, analyze, evaluate and execute on a cloud migration solution. He will review the five major cloud transformation phases a...
Contrary to mainstream media attention, the multiple possibilities of how consumer IoT will transform our everyday lives aren’t the only angle of this headline-gaining trend. There’s a huge opportunity for “industrial IoT” and “Smart Cities” to impact the world in the same capacity – especially during critical situations. For example, a community water dam that needs to release water can leverage embedded critical communications logic to alert the appropriate individuals, on the right device, as...
Consumer IoT applications provide data about the user that just doesn’t exist in traditional PC or mobile web applications. This rich data, or “context,” enables the highly personalized consumer experiences that characterize many consumer IoT apps. This same data is also providing brands with unprecedented insight into how their connected products are being used, while, at the same time, powering highly targeted engagement and marketing opportunities. In his session at @ThingsExpo, Nathan Trel...
In his session at @ThingsExpo, Lee Williams, a producer of the first smartphones and tablets, will talk about how he is now applying his experience in mobile technology to the design and development of the next generation of Environmental and Sustainability Services at ETwater. He will explain how M2M controllers work through wirelessly connected remote controls; and specifically delve into a retrofit option that reverse-engineers control codes of existing conventional controller systems so the...
The web app is agile. The REST API is agile. The testing and planning are agile. But alas, data infrastructures certainly are not. Once an application matures, changing the shape or indexing scheme of data often forces at best a top down planning exercise and at worst includes schema changes that force downtime. The time has come for a new approach that fundamentally advances the agility of distributed data infrastructures. Come learn about a new solution to the problems faced by software organ...
With the Apple Watch making its way onto wrists all over the world, it’s only a matter of time before it becomes a staple in the workplace. In fact, Forrester reported that 68 percent of technology and business decision-makers characterize wearables as a top priority for 2015. Recognizing their business value early on, FinancialForce.com was the first to bring ERP to wearables, helping streamline communication across front and back office functions. In his session at @ThingsExpo, Kevin Roberts...
U.S. companies are desperately trying to recruit and hire skilled software engineers and developers, but there is simply not enough quality talent to go around. Tiempo Development is a nearshore software development company. Our headquarters are in AZ, but we are a pioneer and leader in outsourcing to Mexico, based on our three software development centers there. We have a proven process and we are experts at providing our customers with powerful solutions. We transform ideas into reality.
SYS-CON Events announced today that Micron Technology, Inc., a global leader in advanced semiconductor systems, will exhibit at the 17th International Cloud Expo®, which will take place on November 3–5, 2015, at the Santa Clara Convention Center in Santa Clara, CA. Micron’s broad portfolio of high-performance memory technologies – including DRAM, NAND and NOR Flash – is the basis for solid state drives, modules, multichip packages and other system solutions. Backed by more than 35 years of tech...
17th Cloud Expo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises ar...
While many app developers are comfortable building apps for the smartphone, there is a whole new world out there. In his session at @ThingsExpo, Narayan Sainaney, Co-founder and CTO of Mojio, will discuss how the business case for connected car apps is growing and, with open platform companies having already done the heavy lifting, there really is no barrier to entry.