Thursday, September 21, 2006

Automated Merger of GEDCOM Files

Have you ever had the misfortune of having to merge two GEDCOM files while doing family history research? I hope not. It’s dehumanizing. Dehumanizing? Yes, because doing a job that should be done 99% by a machine is most definitely de-human-izing.

Here’s a typical example:
Billy Bob Jones and Suzy Lee Jones are a brother and sister team working together on their family’s genealogy. The two of them live two or three time zones away from each other, and so they each maintain their own files and coordinate their research by email.

One day, Billy makes an amazing discovery: he finds the birth date of their great-great-great-great-great grandfather Olaf. What’s more, he discovers that Olaf’s death date was off by 5 years, so he updates that. Billy emails Suzy to tell her the exciting news, and Suzy requests that he send an updated version of the GEDCOM file so she can benefit from his new research. So Billy sends the requested file.

Suzy imports the file, which only contains Olaf and his ancestors, a total of 200 people (impressive, it’s true). Then she begins to merge all of the now-duplicate individuals, sources, events, …. Five hours later, she finishes, and now wonders if the five hours of tedious merging operations was a fair price to pay for the updated information.

Here’s what should happen:
Upon receiving the GEDCOM file from Billy, Suzy imports it into her family history software, which informs her that one individual has been updated, one fact being modified, another added. She tells the program that this is A-OK and she procedes to make further Amazing Discoveries.

Here’s what should __really__ happen:
Upon updating the Olaf information, Billy tells his genealogy program to notify Suzy of all updates made since they last synchronized their records. The next time Suzy opens her genealogy program, it notifies her of the new information provided by Billy, and she approves the merger.



People have been thinking about this problem for years. The sad reality, however, is that such an advanced merge capability does not seem to be available in any current consumer software (as far as I can tell). Well, why not? Part of the reason is surely that reliably determining the differences between family trees is a complex problem. But what if both versions of the tree are guaranteed to have a common individual? In other words, what if we know that Billy and Suzy’s parents are going to exist in both of their files. What if, in fact, they exist in both files with the same record ID’s? What if the family history software allows the two to synchronize their data by specifying the ID of a common root individual?

If we can depend on constant record ID’s, our job is much simpler. So what if Billy imports his current data into a new system, after which Suzy synchronizes with his system, essentially copying over all of his data, with matching record ID’s and a set common individual. Then, when one of the two makes a change, resynchronizing involves no more than comparing the corresponding ancestors of the common individual, notifying the user of any differences.

That’s where my project comes in. ‘geddiff’ is designed to be the program that does that comparison. It’s straightforward (is it not?) It limits its scope so the algorithm stays relatively simple. It will easily integrate with revision control systems such as Subversion. It will be licensed under the LGPL license, allowing anybody to link to it or call it externally, while ensuring that the source code itself remains open. It will be based on libgedcom, avoiding unnecessary duplication of effort.

Is this relevant? Is it doable? Does anybody want to help? Let me know!

References:
Geddiff project at Google Code: http://code.google.com/p/geddiff/
Beyond Project, discussing many of the same ideas: http://www.beyondproject.org/
BeyondGen, related discussion group: http://groups.google.com/group/beyondgen?lnk=li