Welcome!

@DXWorldExpo Authors: Elizabeth White, Yeshim Deniz, Zakia Bouachraoui, Liz McMillan, Pat Romanski

Blog Feed Post

rxDTree(): a new type of tree algorithm for big data

by Joseph Rickert The rxDTree() function included in the RevoScaleR package distributed with Revolution R Enterprise is an an example of a new class of algorithms that are being developed to deal with very large data sets. Although the particulars differ, what these algorithms have in common is the use of approximations, methods of summarizing or compressing data and built-in parallelism. I think that it is really interesting to see something as basic to modern statistics as a tree algorithm rejuvenated this way. In the nearly thirty years since Breiman et al. introduced classification and regression trees they have become part of the foundation for modern nonparametric statistics, machine learning and data mining. The basic implementation of these algorithms in R’s rpart() function (recursive partitioning and regression trees) and elsewhere have proved to be adequate for many large scale, industrial strength data analysis problems. Nevertheless, today’s very large data sets (“Big Data”) present significant challenges for decision trees. In part this is due to the need to sort all the numerical attributes used in a model in order to determine the split points. One approach to dealing with the issue is to avoid sorting the raw data altogether by working with an approximation of the data. In a 2010 paper, Ben-Haim and Yom-Tov introduce a novel algorithm along these lines by using histograms to build trees. This algorithm, explicitly designed for parallel computing, takes the approach of implementing horizontal parallelism: each processing node sees all of the variables for a subset (chunk) of the data. These “compute” nodes build histograms of the data and the master node integrates the histograms and builds the tree. The details of the algorithm, its behavior and performance characteristics are described in a second, longer paper by the same authors. One potential downside of the approach is that since the algorithm only examines a limited number of split points (the boundaries of the histogram bins), for a given data set, it may produce a tree that is different from what rpart() would build. In practice though, this is not as bad as it sounds. Increasing the number of bins improves the accuracy of the algorithm. Moreover, Ben-Haim and Yom-Tov provide both an analytical argument and empirical results that show the error rate of trees built with their algorithm approaches the error rate of the standard tree algorithm. rxDTree() is an implementation of the Ben-Haim and Yom-Tov algorithm designed for working with very large data sets in a distributed computing environment. Most of the parameters controlling the behavior of rxDTree() are similar to those of rpart(). However, rxDTree() provides an additional parameter: maxNumBins specifies the maximum number of bins to use in building histograms and hence, controls the accuracy of the algorithm. For small data sets where you can test it out, specifying a large number of bins will enable rxDTree() to produce exactly the same results as rpart(). Because of the computational overhead involved with the histogram building mechanisms of rxDTree() you might expect it to be rather slow with small data. However, we have found that rxDTree performs well with respect to rpart() even for relatively small data sets. The following script gives some idea of the performance that can be expected from running on a reasonably complex data set. (All 59 explanatory variables are numeric.). The script reads in the segmentationData set from the caret package, replicates the data to produce a file containing 2,021,019 rows, specifies a model and then runs it using both rpart() and rxDTree(). ########## BENCHMARKING rxDTree ON A CLUSTER ############# # # This script was created to show some simple benchmarks for the RevoScaleR # rxDTree function for building classification and regression trees on large data sets. # The benchmarks were run on a 5 node HPC cluster comprised of Intel 16 GB of RAM per node) # The script does the following: # 1. Fetch the 2,019 row by 61 columns segmentationData set from the caret package # 2. Set up a compute context to run the code on a Microsoft HPC Cluster # 3. Replicate the SegmentationData to create a file with 2,021,019 rows # 4. Set up the formula and other parameters for the model # 5. Run the rxDTree to build a classification model #------------------------------------------------------------------------------------------------- # Get SegmentationData from caret package library(caret) data(segmentationData) #dim: [1] 2019 61 rxOptions(reportProgress = 0) # Set up comput Contect for HPC Cluster grxTestComputeContext <- rxhpcserver( datapath = c("C:/data"), headnode="cluster-head2">

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

DXWorldEXPO Digital Transformation Stories
As the fourth industrial revolution continues to march forward, key questions remain related to the protection of software, cloud, AI, and automation intellectual property. Recent developments in Supreme Court and lower court case law will be reviewed to explain the intricacies of what inventions are eligible for patent protection, how copyright law may be used to protect application programming interfaces (APIs), and the extent to which trademark and trade secret law may have expanded relev...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It's clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Th...
Docker and Kubernetes are key elements of modern cloud native deployment automations. After building your microservices, common practice is to create docker images and create YAML files to automate the deployment with Docker and Kubernetes. Writing these YAMLs, Dockerfile descriptors are really painful and error prone.Ballerina is a new cloud-native programing language which understands the architecture around it - the compiler is environment aware of microservices directly deployable into infra...
When Enterprises started adopting Hadoop-based Big Data environments over the last ten years, they were mainly on-premise deployments. Organizations would spin up and manage large Hadoop clusters, where they would funnel exabytes or petabytes of unstructured data.However, over the last few years the economics of maintaining this enormous infrastructure compared with the elastic scalability of viable cloud options has changed this equation. The growth of cloud storage, cloud-managed big data e...
Your applications have evolved, your computing needs are changing, and your servers have become more and more dense. But your data center hasn't changed so you can't get the benefits of cheaper, better, smaller, faster... until now. Colovore is Silicon Valley's premier provider of high-density colocation solutions that are a perfect fit for companies operating modern, high-performance hardware. No other Bay Area colo provider can match our density, operating efficiency, and ease of scalability.
The Japan External Trade Organization (JETRO) is a non-profit organization that provides business support services to companies expanding to Japan. With the support of JETRO's dedicated staff, clients can incorporate their business; receive visa, immigration, and HR support; find dedicated office space; identify local government subsidies; get tailored market studies; and more.
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
At CloudEXPO Silicon Valley, June 24-26, 2019, Digital Transformation (DX) is a major focus with expanded DevOpsSUMMIT and FinTechEXPO programs within the DXWorldEXPO agenda. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of business. Only 12% still survive. Similar percentages are found throug...
In an age of borderless networks, security for the cloud and security for the corporate network can no longer be separated. Security teams are now presented with the challenge of monitoring and controlling access to these cloud environments, at the same time that developers quickly spin up new cloud instances and executives push forwards new initiatives. The vulnerabilities created by migration to the cloud, such as misconfigurations and compromised credentials, require that security teams t...
AI and machine learning disruption for Enterprises started happening in the areas such as IT operations management (ITOPs) and Cloud management and SaaS apps. In 2019 CIOs will see disruptive solutions for Cloud & Devops, AI/ML driven IT Ops and Cloud Ops. Customers want AI-driven multi-cloud operations for monitoring, detection, prevention of disruptions. Disruptions cause revenue loss, unhappy users, impacts brand reputation etc.