Welcome!

@DXWorldExpo Authors: Kevin Jackson, Xenia von Wedel, Ed Featherston, Liz McMillan, Pat Romanski

Blog Feed Post

Data Sets for Data Science

by Joseph Rickert Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil.  Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available. However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good: It should be big enough to pose some computational challenges without being so big that it requires a cluster or some specialized hardware just to get started. It should require some cleaning or pre-processing (making decisions about mission data for example) but not appear to be hopelessly corrupt; dirty but not to dirty. It should be rich enough that once you have gone through the trouble of accessing and cleaning it there are enough variables or features to suggest multiple questions to analyze, or make it possible to try out different machine learning algorithms. Here are three data sets that meet these criteria in ascending order of degree of difficulty: The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files. Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.    Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here. As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account. Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
When it comes to cloud computing, the ability to turn massive amounts of compute cores on and off on demand sounds attractive to IT staff, who need to manage peaks and valleys in user activity. With cloud bursting, the majority of the data can stay on premises while tapping into compute from public cloud providers, reducing risk and minimizing need to move large files. In his session at 18th Cloud Expo, Scott Jeschonek, Director of Product Management at Avere Systems, discussed the IT and busine...
It’s conference season and, as you might expect, Jason and I have been on the road covering a bunch of them. It’s always great to see what the disruptive players in the market are doing — and this year did not disappoint. But there is one thing that repeatedly happens that just gets under my skin: transformation-washing. As Jason explained in a Forbes article over a year ago, ‘washing’ is when a vendor (or pundit) applies a buzzword loosely in an overt attempt to attach themselves to its buzz. ...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.