Welcome!

@BigDataExpo Authors: Elizabeth White, Liz McMillan, William Schmarzo, Angsuman Dutta, Yeshim Deniz

Related Topics: @BigDataExpo, Agile Computing, @CloudExpo

@BigDataExpo: Article

Finding the Right Little Data | @CloudExpo #BigData #ML #InternetOfThings

Even with the great strides technology has taken, data quality remains a tremendous challenge for genealogy researchers

Over the years, one of my favorite pastimes has been working on the family genealogy. I first started work on it in the early 1990s. At that time, records were not digitized and research involved going to libraries, newspapers, and various local, state, and federal archives. There one would have to sift through reams of old records, documents and microfiche. If you were lucky, someone had created printed indices of the information contained in those documents. These indices, when they existed, were based on manual transcription of the data, and prone to the data quality. This is inherent in transcribing information from what frequently were old handwritten documents. The challenge was then trying to find those nuggets of information that would relate and connect to individuals you were trying to locate in your research.

For anyone operating in the Big Data space of today, this may all sound very familiar. Genealogists, amateur and otherwise, have been dealing with the challenges of Big Data, unstructured data, and data quality long before the terms became technology buzzwords. There are tools and products today to help; who hasn't seen the commercials for ancestry.com? However, even with technology, there are still challenges. These challenges are not unique to genealogy, and make a good lens for viewing and discussing the business needs in general for Big Data. Let's take a closer look at some of them.

Data Quality
Even with the great strides technology has taken, data quality remains a tremendous challenge for genealogy researchers. Let's take U.S. Census data as an example. Every 10 years, the U.S. government conducts a census of the population, and the results of these censuses become public 72 years after they are taken (the 1940 Census material just recently became available). U.S. Census data is a gold mine for genealogy research.

Below is a sample from the 1930 Federal census. Census forms were filled out by individuals going door to door and asking the residents questions. However, there were a number of data quality factors you must take into consideration. The sample here has fairly good quality handwriting, although that's not always the case. Also, you are constrained by the census takers interpretation of the person's answers and pronunciation. For example, this could result in different variations on the spelling of names.

When this document gets transcribed, you could still have multiple sources of problems with the data quality. The original census taker could have written it down incorrectly or the person transcribing it could have made a transcription error.

This challenge is not unique to genealogy research. Data quality has been an issue in IT systems since the first IT system. In the world of Big Data, unstructured data (such as social media), and things like crowd-sourced data, can become a daunting challenge. As with any challenge, we must understand the impact of those issues, the risk, and what can be done to mitigate that risk. In my above example, Ancestry.com takes an interesting approach to the mitigation. Given they have millions of records based on scanned documents, checking each one is beyond reasonable expectations. Given that, they crowd-source corrections. As a customer, I locate a particular record for someone I am looking for, that little data in all the Big Data. If I notice there is some type of error I can flag that record, categorize the error and provide what I believe is the correct information. Ancestry will then look at my correction and, if appropriate, cleanse the transcribed data.

Data Pedigree
Even though we are discussing genealogy, data pedigree is not about the family tree. Data pedigree is ‘Where did the data comes from?' and ‘How reliable is that source?' If, as an organization, you own the data, that is not an issue. In today's Big Data world, many sources are outside of your direct control (unstructured social media data, crowd-sourced data). For genealogy research, data pedigree has always been an issue and concern. A date of birth is a lot more reliable from a town birth record than from say the census example above, where the information is ‘the age on last birthday, as provided in the interview' (I have seen variations of multiple years from sequential census forms for an individual). In my Ancestry.com example again, as well as source records, Ancestry members can make their research available for online search and sharing. When using others' data (i.e., crowd sourcing research), one must always feel comfortable with the reliability of the source. Ancestry allows you to identify what your source of information was, and can identify multiple sources (for example, I may source data of birth based on both a birth record, a marriage record, and a death record). That information is more reliable than a date of birth with no source cited. When I find a potential match (again, that little data I am truly looking for), I can determine if it truly is a match or possibly a false correlation.

Similarly, in any Big Data implementation, we must understand the pedigree of our data sources. This impacts any analytics we perform and the resulting correlations. If you don't, you run the risk of potentially false correlations and assumptions. For some entertaining examples of false correlations check out www.tylervigen.com.

Finding That Gem of Little Data in the Huge Oceans of Big Data
The ultimate value of Big Data is not the huge ocean of data. It's being able to find the gems of little data that provide the information you seek. In genealogy, it is wonderful that I have millions of public records, documents, and other genealogy research available to sift through, but that's not the value. The value is when I find that record for that one individual in the family tree I have been trying to find. Doing the analysis, and the matching of the data is very dependent on the challenges we have been discussing, data quality and data pedigree. The same is true for any Big Data implementation. Big Data without good understanding of the data is just a big pile of data taking up space.

No technology negates the need for good planning and design. Big Data not just about storing structured and unstructured data. It's not just providing the latest and greatest analytic tools. As technologists we must work with the business plan and design how to leverage and balance the data and its analysis. Work with the business to ensure there is the correct understanding of the data that is available, its quality, its pedigree and the impact of those. Then the true value of the Big Data will shine through as all the gems of little data are found.

This post is sponsored by SAS and Big Data Forum.

More Stories By Ed Featherston

Ed Featherston is VP, Principal Architect at Cloud Technology Partners. He brings 35 years of technology experience in designing, building, and implementing large complex solutions. He has significant expertise in systems integration, Internet/intranet, and cloud technologies. He has delivered projects in various industries, including financial services, pharmacy, government and retail.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@BigDataExpo Stories
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
SYS-CON Events announced today that Massive Networks will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Massive Networks mission is simple. To help your business operate seamlessly with fast, reliable, and secure internet and network solutions. Improve your customer's experience with outstanding connections to your cloud.
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Because IoT devices are deployed in mission-critical environments more than ever before, it’s increasingly imperative they be truly smart. IoT sensors simply stockpiling data isn’t useful. IoT must be artificially and naturally intelligent in order to provide more value In his session at @ThingsExpo, John Crupi, Vice President and Engineering System Architect at Greenwave Systems, will discuss how IoT artificial intelligence (AI) can be carried out via edge analytics and machine learning techn...
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Existing Big Data solutions are mainly focused on the discovery and analysis of data. The solutions are scalable and highly available but tedious when swapping in and swapping out occurs in disarray and thrashing takes place. The resolution for thrashing through machine learning algorithms and support nomenclature is through simple techniques. Organizations that have been collecting large customer data are increasingly seeing the need to use the data for swapping in and out and thrashing occurs ...
In his session at @ThingsExpo, Arvind Radhakrishnen discussed how IoT offers new business models in banking and financial services organizations with the capability to revolutionize products, payments, channels, business processes and asset management built on strong architectural foundation. The following topics were covered: How IoT stands to impact various business parameters including customer experience, cost and risk management within BFS organizations.
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
An increasing number of companies are creating products that combine data with analytical capabilities. Running interactive queries on Big Data requires complex architectures to store and query data effectively, typically involving data streams, an choosing efficient file format/database and multiple independent systems that are tied together through custom-engineered pipelines. In his session at @BigDataExpo at @ThingsExpo, Tomer Levi, a senior software engineer at Intel’s Advanced Analytics ...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, will discuss th...
SYS-CON Events announced today that App2Cloud will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. App2Cloud is an online Platform, specializing in migrating legacy applications to any Cloud Providers (AWS, Azure, Google Cloud).
Recently, IoT seems emerging as a solution vehicle for data analytics on real-world scenarios from setting a room temperature setting to predicting a component failure of an aircraft. Compared with developing an application or deploying a cloud service, is an IoT solution unique? If so, how? How does a typical IoT solution architecture consist? And what are the essential components and how are they relevant to each other? How does the security play out? What are the best practices in formulating...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infrastructure throu...
Cloud adoption is often driven by a desire to increase efficiency, boost agility and save money. All too often, however, the reality involves unpredictable cost spikes and lack of oversight due to resource limitations. In his session at 20th Cloud Expo, Joe Kinsella, CTO and Founder of CloudHealth Technologies, tackled the question: “How do you build a fully optimized cloud?” He will examine: Why TCO is critical to achieving cloud success – and why attendees should be thinking holistically ab...
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
SYS-CON Events announced today that Elastifile will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Elastifile Cloud File System (ECFS) is software-defined data infrastructure designed for seamless and efficient management of dynamic workloads across heterogeneous environments. Elastifile provides the architecture needed to optimize your hybrid cloud environment, by facilitating efficient...
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...