Welcome!

@BigDataExpo Authors: Carmen Gonzalez, Yeshim Deniz, Liz McMillan, Robin Miller, Pat Romanski

Related Topics: @BigDataExpo, Agile Computing, @CloudExpo

@BigDataExpo: Article

Finding the Right Little Data | @CloudExpo #BigData #ML #InternetOfThings

Even with the great strides technology has taken, data quality remains a tremendous challenge for genealogy researchers

Over the years, one of my favorite pastimes has been working on the family genealogy. I first started work on it in the early 1990s. At that time, records were not digitized and research involved going to libraries, newspapers, and various local, state, and federal archives. There one would have to sift through reams of old records, documents and microfiche. If you were lucky, someone had created printed indices of the information contained in those documents. These indices, when they existed, were based on manual transcription of the data, and prone to the data quality. This is inherent in transcribing information from what frequently were old handwritten documents. The challenge was then trying to find those nuggets of information that would relate and connect to individuals you were trying to locate in your research.

For anyone operating in the Big Data space of today, this may all sound very familiar. Genealogists, amateur and otherwise, have been dealing with the challenges of Big Data, unstructured data, and data quality long before the terms became technology buzzwords. There are tools and products today to help; who hasn't seen the commercials for ancestry.com? However, even with technology, there are still challenges. These challenges are not unique to genealogy, and make a good lens for viewing and discussing the business needs in general for Big Data. Let's take a closer look at some of them.

Data Quality
Even with the great strides technology has taken, data quality remains a tremendous challenge for genealogy researchers. Let's take U.S. Census data as an example. Every 10 years, the U.S. government conducts a census of the population, and the results of these censuses become public 72 years after they are taken (the 1940 Census material just recently became available). U.S. Census data is a gold mine for genealogy research.

Below is a sample from the 1930 Federal census. Census forms were filled out by individuals going door to door and asking the residents questions. However, there were a number of data quality factors you must take into consideration. The sample here has fairly good quality handwriting, although that's not always the case. Also, you are constrained by the census takers interpretation of the person's answers and pronunciation. For example, this could result in different variations on the spelling of names.

When this document gets transcribed, you could still have multiple sources of problems with the data quality. The original census taker could have written it down incorrectly or the person transcribing it could have made a transcription error.

This challenge is not unique to genealogy research. Data quality has been an issue in IT systems since the first IT system. In the world of Big Data, unstructured data (such as social media), and things like crowd-sourced data, can become a daunting challenge. As with any challenge, we must understand the impact of those issues, the risk, and what can be done to mitigate that risk. In my above example, Ancestry.com takes an interesting approach to the mitigation. Given they have millions of records based on scanned documents, checking each one is beyond reasonable expectations. Given that, they crowd-source corrections. As a customer, I locate a particular record for someone I am looking for, that little data in all the Big Data. If I notice there is some type of error I can flag that record, categorize the error and provide what I believe is the correct information. Ancestry will then look at my correction and, if appropriate, cleanse the transcribed data.

Data Pedigree
Even though we are discussing genealogy, data pedigree is not about the family tree. Data pedigree is ‘Where did the data comes from?' and ‘How reliable is that source?' If, as an organization, you own the data, that is not an issue. In today's Big Data world, many sources are outside of your direct control (unstructured social media data, crowd-sourced data). For genealogy research, data pedigree has always been an issue and concern. A date of birth is a lot more reliable from a town birth record than from say the census example above, where the information is ‘the age on last birthday, as provided in the interview' (I have seen variations of multiple years from sequential census forms for an individual). In my Ancestry.com example again, as well as source records, Ancestry members can make their research available for online search and sharing. When using others' data (i.e., crowd sourcing research), one must always feel comfortable with the reliability of the source. Ancestry allows you to identify what your source of information was, and can identify multiple sources (for example, I may source data of birth based on both a birth record, a marriage record, and a death record). That information is more reliable than a date of birth with no source cited. When I find a potential match (again, that little data I am truly looking for), I can determine if it truly is a match or possibly a false correlation.

Similarly, in any Big Data implementation, we must understand the pedigree of our data sources. This impacts any analytics we perform and the resulting correlations. If you don't, you run the risk of potentially false correlations and assumptions. For some entertaining examples of false correlations check out www.tylervigen.com.

Finding That Gem of Little Data in the Huge Oceans of Big Data
The ultimate value of Big Data is not the huge ocean of data. It's being able to find the gems of little data that provide the information you seek. In genealogy, it is wonderful that I have millions of public records, documents, and other genealogy research available to sift through, but that's not the value. The value is when I find that record for that one individual in the family tree I have been trying to find. Doing the analysis, and the matching of the data is very dependent on the challenges we have been discussing, data quality and data pedigree. The same is true for any Big Data implementation. Big Data without good understanding of the data is just a big pile of data taking up space.

No technology negates the need for good planning and design. Big Data not just about storing structured and unstructured data. It's not just providing the latest and greatest analytic tools. As technologists we must work with the business plan and design how to leverage and balance the data and its analysis. Work with the business to ensure there is the correct understanding of the data that is available, its quality, its pedigree and the impact of those. Then the true value of the Big Data will shine through as all the gems of little data are found.

This post is sponsored by SAS and Big Data Forum.

More Stories By Ed Featherston

Ed Featherston is VP, Principal Architect at Cloud Technology Partners. He brings 35 years of technology experience in designing, building, and implementing large complex solutions. He has significant expertise in systems integration, Internet/intranet, and cloud technologies. He has delivered projects in various industries, including financial services, pharmacy, government and retail.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


@BigDataExpo Stories
The age of Digital Disruption is evolving into the next era – Digital Cohesion, an age in which applications securely self-assemble and deliver predictive services that continuously adapt to user behavior. Information from devices, sensors and applications around us will drive services seamlessly across mobile and fixed devices/infrastructure. This evolution is happening now in software defined services and secure networking. Four key drivers – Performance, Economics, Interoperability and Trust ...
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
With billions of sensors deployed worldwide, the amount of machine-generated data will soon exceed what our networks can handle. But consumers and businesses will expect seamless experiences and real-time responsiveness. What does this mean for IoT devices and the infrastructure that supports them? More of the data will need to be handled at - or closer to - the devices themselves.
SYS-CON Events announced today that CollabNet, a global leader in enterprise software development, release automation and DevOps solutions, will be a Bronze Sponsor of SYS-CON's 20th International Cloud Expo®, taking place from June 6-8, 2017, at the Javits Center in New York City, NY. CollabNet offers a broad range of solutions with the mission of helping modern organizations deliver quality software at speed. The company’s latest innovation, the DevOps Lifecycle Manager (DLM), supports Value S...
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will look at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deli...
Automation is enabling enterprises to design, deploy, and manage more complex, hybrid cloud environments. Yet the people who manage these environments must be trained in and understanding these environments better than ever before. A new era of analytics and cognitive computing is adding intelligence, but also more complexity, to these cloud environments. How smart is your cloud? How smart should it be? In this power panel at 20th Cloud Expo, moderated by Conference Chair Roger Strukhoff, pane...
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
@ThingsExpo has been named the Most Influential ‘Smart Cities - IIoT' Account and @BigDataExpo has been named fourteenth by Right Relevance (RR), which provides curated information and intelligence on approximately 50,000 topics. In addition, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is applicable to eve...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Analytic. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software in the hope of capturing value in IoT. Although IoT is relatively new in the market, it has already gone through many promotional terms such as IoE, IoX, SDX, Edge/Fog, Mist Compute, etc. Ultimately, irrespective of the name, it is about deriving value from independent software assets participating in an ecosystem as one comprehensive solution.
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...