Welcome!

Big Data Journal Authors: Roger Strukhoff, Gilad Parann-Nissany, Carmen Gonzalez, Jayaram Krishnaswamy, Tom Leyden

Blog Feed Post

A Thumbnail History of Ensemble Methods

By Mike Bowles Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context. Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics.  The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance.  Ensemble methods have been called crowd sourcing for machines.  Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance.  Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees.  The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley.  Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability.  The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious.  Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method.  The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data).  Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns.  The randomForest package in R was written by Professor Breiman and Adele Cutler.  Boosting methods grew out of work on computational learning theory.  The first algorithm of this type was called AdaBoost by its authors Freund and Shapire.  In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses.  One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend.  The goal with Boosting is maximum predictive performance.  AdaBoost stood for a long time as the best example of a black box algorithm.  A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting.  It was a little mysterious.  In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost.  Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful.  The paper caused a storm of controversy.  The comments on the paper were longer than the paper itself.  Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction.  Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus. The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly.  That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc.  These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway. Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity. In a future post we will explore several of these algorithms in R. References Breiman  Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995 Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley. Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997. Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
In high-production environments where release cycles are measured in hours or minutes — not days or weeks — there's little room for mistakes and no room for confusion. Everyone has to understand what's happening, in real time, and have the means to do whatever is necessary to keep applications up and running optimally. DevOps is a high-stakes world, but done well, it delivers the agility and performance to significantly impact business competitiveness.
ScriptRock makes GuardRail, a DevOps-ready platform for configuration monitoring. Realizing we were spending way too much time digging up, cataloguing, and tracking machine configurations, we began writing our own scripts and tools to handle what is normally an enormous chore. Then we took the concept a step further, giving it a beautiful interface and making it simple enough for our bosses to understand. We named it GuardRail after its function - to allow businesses to move fast and stay sa...
"BSQUARE is in the business of selling software solutions for smart connected devices. It's obvious that IoT has moved from being a technology to being a fundamental part of business, and in the last 18 months people have said let's figure out how to do it and let's put some focus on it, " explained Dave Wagstaff, VP & Chief Architect, at BSQUARE Corporation, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The major cloud platforms defy a simple, side-by-side analysis. Each of the major IaaS public-cloud platforms offers their own unique strengths and functionality. Options for on-site private cloud are diverse as well, and must be designed and deployed while taking existing legacy architecture and infrastructure into account. Then the reality is that most enterprises are embarking on a hybrid cloud strategy and programs. In this Power Panel at 15th Cloud Expo (http://www.CloudComputingExpo.com...
SYS-CON Events announced today that Windstream, a leading provider of advanced network and cloud communications, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Windstream (Nasdaq: WIN), a FORTUNE 500 and S&P 500 company, is a leading provider of advanced network communications, including cloud computing and managed services, to businesses nationwide. The company also offers broadband, p...

ARMONK, N.Y., Nov. 20, 2014 /PRNewswire/ --  IBM (NYSE: IBM) today announced that it is bringing a greater level of control, security and flexibility to cloud-based application development and delivery with a single-tenant version of Bluemix, IBM's

“We help people build clusters, in the classical sense of the cluster. We help people put a full stack on top of every single one of those machines. We do the full bare metal install," explained Greg Bruno, Vice President of Engineering and co-founder of StackIQ, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps Summit 2015 New York, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long development cycles that produce software that is obsolete...
"People are a lot more knowledgeable about APIs now. There are two types of people who work with APIs - IT people who want to use APIs for something internal and the product managers who want to do something outside APIs for people to connect to them," explained Roberto Medrano, Executive Vice President at SOA Software, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
“We are a managed services company. We have taken the key aspects of the cloud and the purposed data center and merged the two together and launched the Purposed Cloud about 18–24 months ago," explained Chetan Patwardhan, CEO of Stratogent, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
The Internet of Things is a misnomer. That implies that everything is on the Internet, and that simply should not be - especially for things that are blurring the line between medical devices that stimulate like a pacemaker and quantified self-sensors like a pedometer or pulse tracker. The mesh of things that we manage must be segmented into zones of trust for sensing data, transmitting data, receiving command and control administrative changes, and peer-to-peer mesh messaging. In his session a...
"At our booth we are showing how to provide trust in the Internet of Things. Trust is where everything starts to become secure and trustworthy. Now with the scaling of the Internet of Things it becomes an interesting question – I've heard numbers from 200 billion devices next year up to a trillion in the next 10 to 15 years," explained Johannes Lintzen, Vice President of Sales at Utimaco, in this SYS-CON.tv interview at @ThingsExpo, held Nov 4–6, 2014, at the Santa Clara Convention Center in San...
As enterprises engage with Big Data technologies to develop applications needed to meet operational demands, new computation fabrics are continually being introduced. To leverage these new innovations, organizations are sacrificing market opportunities to gain expertise in learning new systems. In his session at Big Data Expo, Supreet Oberoi, Vice President of Field Engineering at Concurrent, Inc., discussed how to leverage existing infrastructure and investments and future-proof them against e...
"Desktop as a Service is emerging as a very big trend. One of the big influencers of this – for Esri – is that we have a large user base that uses virtualization and they are looking at Desktop as a Service right now," explained John Meza, Product Engineer at Esri, in this SYS-CON.tv interview at Cloud Expo, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the ...
We certainly live in interesting technological times. And no more interesting than the current competing IoT standards for connectivity. Various standards bodies, approaches, and ecosystems are vying for mindshare and positioning for a competitive edge. It is clear that when the dust settles, we will have new protocols, evolved protocols, that will change the way we interact with devices and infrastructure. We will also have evolved web protocols, like HTTP/2, that will be changing the very core...
SYS-CON Events announced today that Creative Business Solutions will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Creative Business Solutions is the top stocking authorized HP Renew Distributor in the U.S. Based out of Long Island, NY, Creative Business Solutions offers a one-stop shop for a diverse range of products including Proliant, Blade and Industry Standard Servers, Networking, Server Options and...
You use an agile process; your goal is to make your organization more agile. But what about your data infrastructure? The truth is, today's databases are anything but agile - they are effectively static repositories that are cumbersome to work with, difficult to change, and cannot keep pace with application demands. Performance suffers as a result, and it takes far longer than it should to deliver new features and capabilities needed to make your organization competitive. As your application an...
An effective way of thinking in Big Data is composed of a methodical framework for dealing with the predicted shortage of 50-60% of the qualified Big Data resources in the U.S. This holistic model comprises the scientific and engineering steps that are involved in accelerating Big Data solutions: problem, diagnosis, facts, analysis, hypothesis, solution, prototype and implementation. In his session at Big Data Expo®, Tony Shan focused on the concept, importance, and considerations for each of t...
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.