Welcome!

@BigDataExpo Authors: Pat Romanski, Elizabeth White, Dana Gardner, Liz McMillan, Luisa Milic

Blog Feed Post

A Thumbnail History of Ensemble Methods

By Mike Bowles Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context. Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics.  The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance.  Ensemble methods have been called crowd sourcing for machines.  Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance.  Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees.  The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley.  Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability.  The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious.  Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method.  The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data).  Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns.  The randomForest package in R was written by Professor Breiman and Adele Cutler.  Boosting methods grew out of work on computational learning theory.  The first algorithm of this type was called AdaBoost by its authors Freund and Shapire.  In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses.  One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend.  The goal with Boosting is maximum predictive performance.  AdaBoost stood for a long time as the best example of a black box algorithm.  A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting.  It was a little mysterious.  In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost.  Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful.  The paper caused a storm of controversy.  The comments on the paper were longer than the paper itself.  Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction.  Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus. The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly.  That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc.  These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway. Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity. In a future post we will explore several of these algorithms in R. References Breiman  Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995 Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley. Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997. Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
DevOps at Cloud Expo, taking place Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long dev...
19th Cloud Expo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterpri...
Using new techniques of information modeling, indexing, and processing, new cloud-based systems can support cloud-based workloads previously not possible for high-throughput insurance, banking, and case-based applications. In his session at 18th Cloud Expo, John Newton, CTO, Founder and Chairman of Alfresco, described how to scale cloud-based content management repositories to store, manage, and retrieve billions of documents and related information with fast and linear scalability. He addres...
Personalization has long been the holy grail of marketing. Simply stated, communicate the most relevant offer to the right person and you will increase sales. To achieve this, you must understand the individual. Consequently, digital marketers developed many ways to gather and leverage customer information to deliver targeted experiences. In his session at @ThingsExpo, Lou Casal, Founder and Principal Consultant at Practicala, discussed how the Internet of Things (IoT) has accelerated our abil...
With so much going on in this space you could be forgiven for thinking you were always working with yesterday’s technologies. So much change, so quickly. What do you do if you have to build a solution from the ground up that is expected to live in the field for at least 5-10 years? This is the challenge we faced when we looked to refresh our existing 10-year-old custom hardware stack to measure the fullness of trash cans and compactors.
The emerging Internet of Everything creates tremendous new opportunities for customer engagement and business model innovation. However, enterprises must overcome a number of critical challenges to bring these new solutions to market. In his session at @ThingsExpo, Michael Martin, CTO/CIO at nfrastructure, outlined these key challenges and recommended approaches for overcoming them to achieve speed and agility in the design, development and implementation of Internet of Everything solutions wi...
Cloud computing is being adopted in one form or another by 94% of enterprises today. Tens of billions of new devices are being connected to The Internet of Things. And Big Data is driving this bus. An exponential increase is expected in the amount of information being processed, managed, analyzed, and acted upon by enterprise IT. This amazing is not part of some distant future - it is happening today. One report shows a 650% increase in enterprise data by 2020. Other estimates are even higher....
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 19th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - comp...
Qosmos has announced new milestones in the detection of encrypted traffic and in protocol signature coverage. Qosmos latest software can accurately classify traffic encrypted with SSL/TLS (e.g., Google, Facebook, WhatsApp), P2P traffic (e.g., BitTorrent, MuTorrent, Vuze), and Skype, while preserving the privacy of communication content. These new classification techniques mean that traffic optimization, policy enforcement, and user experience are largely unaffected by encryption. In respect wit...
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo Silicon Valley Call for Papers is now open.
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Digital Transformation, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportuni...
SYS-CON Events announced today that Venafi, the Immune System for the Internet™ and the leading provider of Next Generation Trust Protection, will exhibit at @DevOpsSummit at 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Venafi is the Immune System for the Internet™ that protects the foundation of all cybersecurity – cryptographic keys and digital certificates – so they can’t be misused by bad guys in attacks...
Smart Cities are here to stay, but for their promise to be delivered, the data they produce must not be put in new siloes. In his session at @ThingsExpo, Mathias Herberts, Co-founder and CTO of Cityzen Data, will deep dive into best practices that will ensure a successful smart city journey.
SYS-CON Events announced today that 910Telecom will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Housed in the classic Denver Gas & Electric Building, 910 15th St., 910Telecom is a carrier-neutral telecom hotel located in the heart of Denver. Adjacent to CenturyLink, AT&T, and Denver Main, 910Telecom offers connectivity to all major carriers, Internet service providers, Internet backbones and ...
DevOps at Cloud Expo – being held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA – announces that its Call for Papers is open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's largest enterprises – and delivering real results. Am...
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Aspose.Total for .NET is the most complete package of all file format APIs for .NET as offered by Aspose. It empowers developers to create, edit, render, print and convert between a wide range of popular document formats within any .NET, C#, ASP.NET and VB.NET applications. Aspose compiles all .NET APIs on a daily basis to ensure that it contains the most up to date versions of each of Aspose .NET APIs. If a new .NET API or a new version of existing APIs is released during the subscription peri...
Actian Corporation has announced the latest version of the Actian Vector in Hadoop (VectorH) database, generally available at the end of July. VectorH is based on the same query engine that powers Actian Vector, which recently doubled the TPC-H benchmark record for non-clustered systems at the 3000GB scale factor (see tpc.org/3323). The ability to easily ingest information from different data sources and rapidly develop queries to make better business decisions is becoming increasingly importan...
Deploying applications in hybrid cloud environments is hard work. Your team spends most of the time maintaining your infrastructure, configuring dev/test and production environments, and deploying applications across environments – which can be both time consuming and error prone. But what if you could automate provisioning and deployment to deliver error free environments faster? What could you do with your free time?