Click here to close now.

Welcome!

Big Data Journal Authors: Carmen Gonzalez, Elizabeth White, Ian Khan, Liz McMillan, William Schmarzo

Blog Feed Post

Data Sets for Data Science

by Joseph Rickert Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil.  Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available. However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good: It should be big enough to pose some computational challenges without being so big that it requires a cluster or some specialized hardware just to get started. It should require some cleaning or pre-processing (making decisions about mission data for example) but not appear to be hopelessly corrupt; dirty but not to dirty. It should be rich enough that once you have gone through the trouble of accessing and cleaning it there are enough variables or features to suggest multiple questions to analyze, or make it possible to try out different machine learning algorithms. Here are three data sets that meet these criteria in ascending order of degree of difficulty: The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files. Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.    Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here. As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account. Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.  

Read the original blog entry...

More Stories By David Smith

David Smith is Vice President of Marketing and Community at Revolution Analytics. He has a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He continued his association with S-PLUS at Insightful (now TIBCO Spotfire) overseeing the product management of S-PLUS and other statistical and data mining products.<

David smith is the co-author (with Bill Venables) of the popular tutorial manual, An Introduction to R, and one of the originating developers of the ESS: Emacs Speaks Statistics project. Today, he leads marketing for REvolution R, supports R communities worldwide, and is responsible for the Revolutions blog. Prior to joining Revolution Analytics, he served as vice president of product management at Zynchros, Inc. Follow him on twitter at @RevoDavid

@BigDataExpo Stories
There is no doubt that Big Data is here and getting bigger every day. Building a Big Data infrastructure today is no easy task. There are an enormous number of choices for database engines and technologies. To make things even more challenging, requirements are getting more sophisticated, and the standard paradigm of supporting historical analytics queries is often just one facet of what is needed. As Big Data growth continues, organizations are demanding real-time access to data, allowing immed...
SYS-CON Events announced today that DragonGlass, an enterprise search platform, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. After eleven years of designing and building custom applications, OpenCrowd has launched DragonGlass, a cloud-based platform that enables the development of search-based applications. These are a new breed of applications that utilize a search index as their backbone for data...
SYS-CON Events announced today that EnterpriseDB (EDB), the leading worldwide provider of enterprise-class Postgres products and database compatibility solutions, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. EDB is the largest provider of Postgres software and services that provides enterprise-class performance and scalability and the open source freedom to divert budget from more costly traditiona...
Gartner predicts that the bulk of new IT spending by 2016 will be for cloud platforms and applications and that nearly half of large enterprises will have cloud deployments by the end of 2017. The benefits of the cloud may be clear for applications that can tolerate brief periods of downtime, but for critical applications like SQL Server, Oracle and SAP, companies need a strategy for HA and DR protection. While traditional SAN-based clusters are not possible in these environments, SANless cluste...
Discussions about cloud computing are evolving into discussions about enterprise IT in general. As enterprises increasingly migrate toward their own unique clouds, new issues such as the use of containers and microservices emerge to keep things interesting. In this Power Panel at 16th Cloud Expo, moderated by Conference Chair Roger Strukhoff, panelists will address the state of cloud computing today, and what enterprise IT professionals need to know about how the latest topics and trends affec...
Software Defined Storage provides many benefits for customers including agility, flexibility, faster adoption of new technology and cost effectiveness. However, for IT organizations it can be challenging and complex to build your Enterprise Grade Storage from software. In his session at Cloud Expo, Paul Turner, CMO at Cloudian, looked at the new Original Design Manufacturer (ODM) market and how it is changing the storage world. Now Software Defined Storage companies can build Enterprise grade ...
To manage complex web services with lots of calls to the cloud, many businesses have invested in Application Performance Management (APM) and Network Performance Management (NPM) tools. Together APM and NPM tools are essential aids in improving a business's infrastructure required to support an effective web experience... but they are missing a critical component - Internet visibility. Internet connectivity has always played a role in customer access to web presence, but in the past few years u...
There's Big Data, then there's really Big Data from the Internet of Things. IoT is evolving to include many data possibilities like new types of event, log and network data. The volumes are enormous, generating tens of billions of logs per day, which raise data challenges. Early IoT deployments are relying heavily on both the cloud and managed service providers to navigate these challenges. In her session at Big Data Expo®, Hannah Smalltree, Director at Treasure Data, discussed how IoT, Big D...
SYS-CON Events announced today that MetraTech, now part of Ericsson, has been named “Silver Sponsor” of SYS-CON's 16th International Cloud Expo®, which will take place on June 9–11, 2015, at the Javits Center in New York, NY. Ericsson is the driving force behind the Networked Society- a world leader in communications infrastructure, software and services. Some 40% of the world’s mobile traffic runs through networks Ericsson has supplied, serving more than 2.5 billion subscribers.
In their general session at 16th Cloud Expo, Michael Piccininni, Global Account Manager – Cloud SP at EMC Corporation, and Mike Dietze, Regional Director at Windstream Hosted Solutions, will review next generation cloud services, including the Windstream-EMC Tier Storage solutions, and discuss how to increase efficiencies, improve service delivery and enhance corporate cloud solution development. Speaker Bios Michael Piccininni is Global Account Manager – Cloud SP at EMC Corporation. He has b...
There is little doubt that Big Data solutions will have an increasing role in the Enterprise IT mainstream over time. 8th International Big Data Expo, co-located with 17th International Cloud Expo - to be held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA - has announced its Call for Papers is open. As advanced data storage, access and analytics technologies aimed at handling high-volume and/or fast moving data all move center stage, aided by the cloud computing bo...
Cloud services are the newest tool in the arsenal of IT products in the market today. These cloud services integrate process and tools. In order to use these products effectively, organizations must have a good understanding of themselves and their business requirements. In his session at 15th Cloud Expo, Brian Lewis, Principal Architect at Verizon Cloud, outlined key areas of organizational focus, and how to formalize an actionable plan when migrating applications and internal services to the ...
"We have developers who are really passionate about getting their code out to customers, no matter what, in the shortest possible time. Operations are very focused on procedures and policies," explained Stan Klimoff, CTO of Qubell, in this SYS-CON.tv interview at DevOps Summit, held Nov 4–6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
Agility is top of mind for Cloud/Service providers and Enterprises alike. Policy Driven Data Center provides a policy model for application deployment by decoupling application needs from the underlying infrastructure primitives. In his session at 15th Cloud Expo, David Klebanov, a Technical Solutions Architect with Cisco Systems, discussed how it differentiates from the software-defined top-down control by offering a declarative approach to allow faster and simpler application deployment. Davi...
Disruptive macro trends in technology are impacting and dramatically changing the "art of the possible" relative to supply chain management practices through the innovative use of IoT, cloud, machine learning and Big Data to enable connected ecosystems of engagement. Enterprise informatics can now move beyond point solutions that merely monitor the past and implement integrated enterprise fabrics that enable end-to-end supply chain visibility to improve customer service delivery and optimize sup...
Over the last few years the healthcare ecosystem has revolved around innovations in Electronic Health Record (HER) based systems. This evolution has helped us achieve much desired interoperability. Now the focus is shifting to other equally important aspects - scalability and performance. While applying cloud computing environments to the EHR systems, a special consideration needs to be given to the cloud enablement of Veterans Health Information Systems and Technology Architecture (VistA), i.e....
Cloud is not a commodity. And no matter what you call it, computing doesn’t come out of the sky. It comes from physical hardware inside brick and mortar facilities connected by hundreds of miles of networking cable. And no two clouds are built the same way. SoftLayer gives you the highest performing cloud infrastructure available. One platform that takes data centers around the world that are full of the widest range of cloud computing options, and then integrates and automates everything. J...
With the arrival of the Big Data revolution, a data professional is expected to master a broad spectrum of complex domains including data processing, mathematics, programming languages, machine learning techniques, and business knowledge. While this mastery is undoubtedly important, this narrow focus on tool usage has divorced many from the imagination required to solve real-world problems. As the demand for analysis increases, the data science community must transform from tool experts to "data...
SYS-CON Media announced today that 9 out of 10 " most read" DevOps articles are published by @DevOpsSummit Blog. Launched in October 2014, @DevOpsSummit Blog offers top articles, news stories, and blog posts from the world's well-known experts and guarantees better exposure for its authors than any other publication. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dy...
15th Cloud Expo, which took place Nov. 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA, expanded the conference content of @ThingsExpo, Big Data Expo, and DevOps Summit to include two developer events. IBM held a Bluemix Developer Playground on November 5 and ElasticBox held a Hackathon on November 6. Both events took place on the expo floor. The Bluemix Developer Playground, for developers of all levels, highlighted the ease of use of Bluemix, its services and functionalit...