Welcome!

@BigDataExpo Authors: Pat Romanski, Carmen Gonzalez, Elizabeth White, AppNeta Blog, Gerardo A Dada

Blog Feed Post

A Thumbnail History of Ensemble Methods

By Mike Bowles Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context. Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics.  The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance.  Ensemble methods have been called crowd sourcing for machines.  Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance.  Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees.  The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley.  Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability.  The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious.  Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method.  The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data).  Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns.  The randomForest package in R was written by Professor Breiman and Adele Cutler.  Boosting methods grew out of work on computational learning theory.  The first algorithm of this type was called AdaBoost by its authors Freund and Shapire.  In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses.  One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend.  The goal with Boosting is maximum predictive performance.  AdaBoost stood for a long time as the best example of a black box algorithm.  A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting.  It was a little mysterious.  In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost.  Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful.  The paper caused a storm of controversy.  The comments on the paper were longer than the paper itself.  Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction.  Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus. The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly.  That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc.  These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway. Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity. In a future post we will explore several of these algorithms in R. References Breiman  Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995 Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley. Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997. Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
"MathFreeOn.com is a line coding platform for engineers and scientists. When they want to solve an engineering problem and they have to use software - they have to pay a lot of money for licenses - but with MathFreeOn you don't have to pay a lot of money. Just go to our site and write the code and you can check the result right away," explained Simon Lee, CMO of MathFreeOn, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Cla...
You have great SaaS business app ideas. You want to turn your idea quickly into a functional and engaging proof of concept. You need to be able to modify it to meet customers' needs, and you need to deliver a complete and secure SaaS application. How could you achieve all the above and yet avoid unforeseen IT requirements that add unnecessary cost and complexity? You also want your app to be responsive in any device at any time. In his session at 19th Cloud Expo, Mark Allen, General Manager of...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
As data explodes in quantity, importance and from new sources, the need for managing and protecting data residing across physical, virtual, and cloud environments grow with it. Managing data includes protecting it, indexing and classifying it for true, long-term management, compliance and E-Discovery. Commvault can ensure this with a single pane of glass solution – whether in a private cloud, a Service Provider delivered public cloud or a hybrid cloud environment – across the heterogeneous enter...
In his session at Cloud Expo, Robert Cohen, an economist and senior fellow at the Economic Strategy Institute, provideed economic scenarios that describe how the rapid adoption of software-defined everything including cloud services, SDDC and open networking will change GDP, industry growth, productivity and jobs. This session also included a drill down for several industries such as finance, social media, cloud service providers and pharmaceuticals.
"Dice has been around for the last 20 years. We have been helping tech professionals find new jobs and career opportunities," explained Manish Dixit, VP of Product and Engineering at Dice, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
Extracting business value from Internet of Things (IoT) data doesn’t happen overnight. There are several requirements that must be satisfied, including IoT device enablement, data analysis, real-time detection of complex events and automated orchestration of actions. Unfortunately, too many companies fall short in achieving their business goals by implementing incomplete solutions or not focusing on tangible use cases. In his general session at @ThingsExpo, Dave McCarthy, Director of Products...
Infrastructure is widely available, but who’s managing inbound/outbound traffic? Data is created, stored, and managed online – who is protecting it and how? In his session at 19th Cloud Expo, Jaeson Yoo, SVP of Business Development at Penta Security Systems Inc., discussed how to keep any and all infrastructure clean, safe, and efficient by monitoring and filtering all malicious HTTP/HTTPS traffic at the OSI Layer 7. Stop attacks and web intruders before they can enter your network.
The many IoT deployments around the world are busy integrating smart devices and sensors into their enterprise IT infrastructures. Yet all of this technology – and there are an amazing number of choices – is of no use without the software to gather, communicate, and analyze the new data flows. Without software, there is no IT. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, Dave McCarthy, Director of Products at Bsquare Corporation; Alan Williamson, Principal...
Businesses and business units of all sizes can benefit from cloud computing, but many don't want the cost, performance and security concerns of public cloud nor the complexity of building their own private clouds. Today, some cloud vendors are using artificial intelligence (AI) to simplify cloud deployment and management. In his session at 20th Cloud Expo, Ajay Gulati, Co-founder and CEO of ZeroStack, will discuss how AI can simplify cloud operations. He will cover the following topics: why clou...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life sett...
"We analyze the video streaming experience. We are gathering the user behavior in real time from the user devices and we analyze how users experience the video streaming," explained Eric Kim, Founder and CEO at Streamlyzer, in this SYS-CON.tv interview at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain. In this power panel at @...
SYS-CON Events announced today that Fusion, a leading provider of cloud services, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Fusion, a leading provider of integrated cloud solutions to small, medium and large businesses, is the industry’s single source for the cloud. Fusion’s advanced, proprietary cloud service platform enables the integration of leading edge solutions in the cloud, including clou...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and sh...
SYS-CON Events announced today that Dataloop.IO, an innovator in cloud IT-monitoring whose products help organizations save time and money, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Dataloop.IO is an emerging software company on the cutting edge of major IT-infrastructure trends including cloud computing and microservices. The company, founded in the UK but now based in San Fran...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...