Data Science & Ancestry

Ancestry is curious topic for many people to find out their origin and history. Today, data science is used to help these people to dig into their family history and build the family trees.

By Niels Reinhard, Idalab.

Data Science & Ancestry

History is exciting, especially if we can relate to it. If our ancestors have been involved in any historic event, we are much more likely to attribute importance to this happening, develop a desire to know the details well beyond storytelling. Who have our ancestors been? What traces did they leave in history? How far back can we track their blood lines? Many people share a passion for ancestry, travel to archives in different countries, puzzle together information from different documents to gather new insights about their roots. The internet provides those efforts with new and powerful tools for research. Not only are historic documents being digitized, but online networks of people with a passion for ancestry, , e.g., allow for new ways to gather insights about one’s family tree. Interestingly, it is data science which powers the conquest of family history.

The question of name distinguishability throughout the centuries

Anyone who has been involved in genealogy, who has been trying to track down distant relatives from the 16th century is well aware of one of the key challenges: surnames change overtime. The current surname is most often just a derivative of previous versions, phonetically adjusted overtime. Some names change completely, others might have gradually evolved with the centuries along with rules of orthography. Connecting the dots when combing through dusty documents in church and municipal archives is already a challenging endeavor, but automating this process at scale is rather difficult.

There are fairly well-functioning machine learning algorithms, which account for potential phonetic transformations, spelling mistakes and allow for a clustering of those first and surnames. Nevertheless, the digitization of personal research endeavors and thus the respective family trees, allows for more fine tuning of those algorithmic libraries. For once, more information about the actual historic name transformation is feeding the algorithms – not just assumptions about potential phonetic iterations. As a matter of fact, the adaptations improve performance – which in turn enables genealogy research platforms to allow its users comprehensive searches.

Genealogy Search Algorithms – helping to find the relevant “John Smith”

At the same time, searching for names – when not extremely rare – oftentimes does not narrow down the search to a manually controllable sub-set. With millions of documents available for reference, a lot of context information for the respective individual is available. Certain individuals might thus for example automatically be tagged with geographic locations. Sophisticated search algorithms make historic data accessible for the time-constrained researcher. And make the casual trip to the distant library obsolete. claims that is has digitized more than 200 billion documents (pictures, bureaucratic files, etc.) through its users, which upload these, and access to archives. Powerful backend algorithms are of necessity, to make sense of this history overflow and make precisely that information accessible, which the user searches for. At the same time, when users build their family tree online, the platform is able to automatically give hints and recommendations to fill-out blank spots within the different branches. Having a large and active user base, machine learning is used to power the record linking activities, which build the engine for the “recommendation” and family history discoveries.

Does this take away the mystery of genealogy, the romanticized time spent in archives? Not really, but it poses a more cost-effective way to acquire information about family history and find precisely that one “John Smith”. Those who desire physical journeys, the ups and downs of the information hunt, are certainly free to continue to do so.

Using all that data to create a better understanding of history

While there are so many obvious advantages of having a central platform for the management of family tree research, it also needs to be mentioned that users might err when composing their family tree – be it with or without intention. However, modern technology should be sophisticated enough to account for these flaws.

Because looking at the potential is just too attractive: online genealogy platforms could contribute to the creation of a fully personalized immersion in history. Registered users could, for example, get location-based information about their family members historic presence at the respective place. Linked with relevant documents (personal and general) it would function almost like a portable, location-based museum. With increasing sophistication of image processing, it could even be sufficient to take a picture of the location to get contextual information.

The organization of family history is fascinating, as it is composed of so many intersecting family trees. Attaching and providing information at the relevant nodes with data science is a key challenge. It bears not only benefits for users on their specific research quest, but also forms the backbone for a more personalized history experience. To add some craziness, has – already for several year – tried out a DNA feature, which could allow for a more fine-grained analysis of potential connections between users.

Original. Reposted by permission.

Bio: Niels Reinhard works as a Data Strategist at idalab, an agency for data science in Berlin, Germany. Working with companies from the mobility, fintech and public sector, he identifies new areas for innovative value creation through data science.