Welcome!

@BigDataExpo Authors: Liz McMillan, Rajesh Ramchandani, Kevin Benedict, Kevin Jackson, Elizabeth White

Blog Feed Post

A Thumbnail History of Ensemble Methods

By Mike Bowles Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context. Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics.  The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance.  Ensemble methods have been called crowd sourcing for machines.  Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance.  Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees.  The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley.  Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability.  The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious.  Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method.  The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data).  Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns.  The randomForest package in R was written by Professor Breiman and Adele Cutler.  Boosting methods grew out of work on computational learning theory.  The first algorithm of this type was called AdaBoost by its authors Freund and Shapire.  In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses.  One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend.  The goal with Boosting is maximum predictive performance.  AdaBoost stood for a long time as the best example of a black box algorithm.  A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting.  It was a little mysterious.  In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost.  Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful.  The paper caused a storm of controversy.  The comments on the paper were longer than the paper itself.  Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction.  Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus. The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly.  That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc.  These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway. Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity. In a future post we will explore several of these algorithms in R. References Breiman  Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995 Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley. Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997. Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
IoT is at the core or many Digital Transformation initiatives with the goal of re-inventing a company's business model. We all agree that collecting relevant IoT data will result in massive amounts of data needing to be stored. However, with the rapid development of IoT devices and ongoing business model transformation, we are not able to predict the volume and growth of IoT data. And with the lack of IoT history, traditional methods of IT and infrastructure planning based on the past do not app...
WebRTC is bringing significant change to the communications landscape that will bridge the worlds of web and telephony, making the Internet the new standard for communications. Cloud9 took the road less traveled and used WebRTC to create a downloadable enterprise-grade communications platform that is changing the communication dynamic in the financial sector. In his session at @ThingsExpo, Leo Papadopoulos, CTO of Cloud9, discussed the importance of WebRTC and how it enables companies to focus o...
"We're bringing out a new application monitoring system to the DevOps space. It manages large enterprise applications that are distributed throughout a node in many enterprises and we manage them as one collective," explained Kevin Barnes, President of eCube Systems, in this SYS-CON.tv interview at DevOps at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, discussed the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports.
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, will highlight the current challenges of these transformative technologies and share strategies for preparing your organization for these changes. This “view from the top” will outline the latest trends and developm...
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
"There's a growing demand from users for things to be faster. When you think about all the transactions or interactions users will have with your product and everything that is between those transactions and interactions - what drives us at Catchpoint Systems is the idea to measure that and to analyze it," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York Ci...
SYS-CON Events announced today that Dataloop.IO, an innovator in cloud IT-monitoring whose products help organizations save time and money, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Dataloop.IO is an emerging software company on the cutting edge of major IT-infrastructure trends including cloud computing and microservices. The company, founded in the UK but now based in San Fran...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
Discover top technologies and tools all under one roof at April 24–28, 2017, at the Westin San Diego in San Diego, CA. Explore the Mobile Dev + Test and IoT Dev + Test Expo and enjoy all of these unique opportunities: The latest solutions, technologies, and tools in mobile or IoT software development and testing. Meet one-on-one with representatives from some of today's most innovative organizations
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in Embedded and IoT solutions, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 7-9, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and E...
Internet of @ThingsExpo, taking place June 6-8, 2017 at the Javits Center in New York City, New York, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. @ThingsExpo New York Call for Papers is now open.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
"This week we're really focusing on scalability, asset preservation and how do you back up to the cloud and in the cloud with object storage, which is really a new way of attacking dealing with your file, your blocked data, where you put it and how you access it," stated Jeff Greenwald, Senior Director of Market Development at HGST, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Catchpoint Systems, Inc., a provider of innovative web and infrastructure monitoring solutions, has been named “Silver Sponsor” of SYS-CON's DevOps Summit at 18th Cloud Expo New York, which will take place June 7-9, 2016, at the Javits Center in New York City, NY. Catchpoint is a leading Digital Performance Analytics company that provides unparalleled insight into customer-critical services to help consistently deliver an amazing customer experience. Designed ...
"A lot of times people will come to us and have a very diverse set of requirements or very customized need and we'll help them to implement it in a fashion that you can't just buy off of the shelf," explained Nick Rose, CTO of Enzu, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive ad...