For some odd reason, this week at work I’ve been pulled back into data cleaning and data structuring. St. Mungos, a local hospital is trying to bounce their service data against Mayhew Insurance claims data. We received several hundred scrolls of names and claims. None of their data is quite structured like our claims and enrollment data, so we’ve been building a crosswalking system. For data security purposes, none of the hospital data contains Social Security numbers and for Pain In The Ass purposes, none of their data contains the Mayhew Insurance coverage number.
So that means we have to find the identities of people in their service universe by a combination of birth date, last name, first name, provider, and date of service. Creating automated queries that match on a significant number of not quite unique keys has allowed for very high confidence matches on 97% of the probable unique individuals. The remaining 3% are manual look-ups and judgement calls.
For instance, is Harold J. Potter with a DOB of 7/31/80 from the hospital file the same person as Harry Potter DOB 7/31/80, with multiple dates of death in the insurance file the same individual? Eyeballing the data and making an informed judgement, I am guessing at 95% confidence that this is the same person. How about Sirius R. Black and S. Regulas Black? Are they the same person? Probably, but not definately. Being slightly more serious, how do you match Markos Moulitsas Zúniga to Kos Moulitsas where the DOBs are off by a day? Going through the 3% is costing the company at least a dollar per name, and probably closer to $1.50/name.
For the purposes of my current task getting 98% or 99% of the data set matched at high confidence is more than sufficient. The research plan allowed for a 5% mismatch rate.
Eyeballing the names that are in the unmatchable bucket, there are three notable groups. The first are recent births where the name is Baby Girl Smith matched against 15 girls born on that date in the shared claims universe. These are no big deal. The second group of unmatched names are women whose relationship status has changed multiple times. My wife would be in this group as the hospital data has her as Jane Doe while the insurance data has her as Jane Mayhew. The final group are people with names that fall outside of culturally dominant naming conventions. Names with extra punctuation and names whose ordering structures don’t follow the First/Last or First/Maiden-Last convention fall out very readily.
This is important because error is acceptable even if we are willing to spend money to minimize the error. Verifying voter rolls based on name and date of birth matches is guaranteed to produce significant unmatched. There is a good argument that voting rolls should be as up to date as possible, but to do that in a good faith manner requires spending serious money as one person dropped from the rolls who should not have been dropped is one too many. $5 to $10 per name at 100% confidence would be reasonable estimate of the cost of scrubbing an entire list of votes. The vast majority of names would be passed at a dime per name, while the last 5% to 1% might cost $2 per name to verify, and the last 1% will cost $50 per name to verify.
Scrubbing an entire state for pennies per name is a guarantee of producing false positives and statistically certain mismatches. And given who is more likely to have mismatched names between data sets, cynicism should reign.
Amir Khalid
I think I can guess what you’ve been reading lately. ;)
Jim Parish
Shortly after my father died, we discovered that the date he and we had always celebrated as his birthday was not the date on his birth certificate. We have no idea which date was actually correct.
Richard Mayhew
@Amir Khalid: Actually, I have not read HP in at least the past 15 months, and I am holding off on re-reading it until my daughter is slightly older so we can enjoy 3-4 pages a night for the next 5 years together.
I was looking for some good examples that I could use with absolutely no privacy concerns . So Harry Potter it was…
Fester Addams
At least no one’s name was
Robert’); DROP TABLE Patients;–
(XKCD #327)
Mike J
@Fester Addams: Not an issue for competent DB admins.
patrick II
your numbers assume good intentions for the people trying to match names. Jeb Bush in Florida in 2000 hired a company who’s algorithm was refined enough that it differentiated between the various criminals he was trying to wipe off the poles and actual people with the right to vote. so he fired them and hired a company with a less rigorous algorithm with more false matches and got his brother elected.
Richard mayhew
@patrick II: exactly. This is just a bullshit detection tool post
Roger Moore
How about Sirius R. Black and Stubby Boardman? I’ve heard they’re actually the same person.
Eric S.
I’ve got a friend whose first name is “Jean-Paul”. On almost every legal document or financial form he is known as “Jean”. Most computer systems just balk and don’t know what to do with the hyphen.
Another friend has the same first and last name (like “James James”) and he is Junior. His credit report shows he’s owned cars and houses prior to his birth. He’s had it scrubbed and they pop back up.
All of us are programmers and we kind of shrug with these things happen attitude. It doesn’t make them any less troublesome and for the most part this is with people with good intentions.
BillinGlendaleCA
@Jim Parish: My wife’s birthday on legal docs(birth certificate, passport, driver’s license, etc) is not her real birthday, it was the date her father got around to registering her birth(about 7 months later).
FlyingToaster
I remember the multiple phone calls to our Blue when BabyGirl Toaster turned into WarriorBabyGirl Doktor. The hospital sent them the damn birth certificate with everyone’s name on it, and they still couldn’t figure out who was who, until HerrDoktor drove over to Cambridge and got three copies and faxed it to them (again).
Her Blue card did turn up when she was 7 weeks old.
RSA
@Eric S.:
My last name has a period and a space in it. I can’t count the times a computer system has told me that my last name is invalid. Or that a clerk has had to try several variations to find me in the system.
It used to be irritating. Okay, it still is, but I’m more used to it.
Old Dan and Little Ann
My 48 year old sister has been writing Kelly Anne her whole life as first and middle name. My mom told me yesterday my sister recently looked at her original birth certificate and Kelley Ann was typed on it. Whoops.
piratedan
@RSA: the problem is that most medical systems have an issue as to when to use the hyphen as a “punctuation” versus an “arithmetic” expression.Plus you’re talking about delineating names into easily defined fields, it used to be just a first name, middle name and last name. Now you have suffixes, hyphens, apostrophes and fourth and fifth names. This is because we’re no longer as homogenized as a country as we were prior to the mid 70’s when medicine started to get invaded by tech and it’s still slow to catch up culturally. This makes it difficult for everyone from admitting clerks to nurses to try and decipher just who is whom.. The other side of this coin is that medical data is so space intensive, that usual transmission of data isn’t just solely numbers, but entire forms, medical histories and insurance requirements.
With many hospitals losing money (be they for profit or not) the idea of springing for millions of dollars to upgrade is difficult to sell to folks. Yet the idea that you can handle Risk Management better by reducing patient misidentification is a huge motivator and when the ACA was passed, there were numerous incentives that many hospitals took advantage of to try and keep up with the age.
The thing is, you have the same issues with airlines and other businesses ( I bet even banking has their issues as well).
Annamal
Slightly off from the point you are making (which is absolutely valid) but my partner is trying to do very similar sorts of data matching, just with parish records from the 18th and 19th centuries.
Since the preference is for 100% certainty of matches, ironically the foreign or peculiar names ( or those with odd occupations) actually come with a bit of an advantage since the data is full of John Smiths who were labourers and can not be matched but Thomas Peach who was a petrefactioner is much easier to identify.
We’re using metaphone3 to account for the scattershot approach to spelling back then.
RSA
@piratedan: Nice summary! Thanks.
WaterGirl
Excellent example and great post, Richard!
Roger Moore
@piratedan:
Names have always been more complex than first, middle, and last, or at least they have been since well before the mid 70’s. There are plenty of Western European names that involve spaces and punctuation (Jean-Pierre, O’Malley, von Trapp, St. John, etc.) that should have clued people in. I suspect that a lot of the problem was that computer back in the 70s were underpowered by modern standards so they didn’t want to deal with a free-form name. So they made the simplifying assumption that names could fit into their nice little boxes and put the weight of dealing with outliers on the unlucky people who didn’t fit their patterns. Once that pattern was started, it got stuck by backwards compatibility.
piratedan
@Roger Moore: to a degree, part of the issue is history itself, a lot of the systems got around the exclusion of punctuation values by allowing the space or simply ignoring the punctuation itself. Why is the punctuation such a big deal? Goes back to the languages that were used to handle medical data because back in the day, space used to be an issue and cramming the amount of data intensive records, that were filled with diagnosis codes (ICD values, I believe that we’re now using ICD 10 these days) textual reports and raw data results from instrumentation and many of the systems were compartmentalized, i.e. dietary has theirs, as does radiology, the lab, the hospital information system and some have separate ER systems as well as some hospitals separate out their medical records. We haven’t even covered the instrumentation that is used, because now they transmit images, films from biopsies, autopsies, EKG’s etc
So you could chalk that up to being underpowered but a lot of the issues come back to space, because back in the day, the amount of space that you had available to store medical record data was limited. Now that this is much less of a concern, we’re seeing the integration of a lot more windows style information being laid over databases with middle ware easing the translation from the look to how the data is stored. Still a long way to go because of the limitations on how the systems talk to each other. Even one size fits all systems from a sole vendor have issues because their individual modules aren’t even fully integrated (as many of these systems bought out competing systems for one module and are STILL integrating them into seamless communication modules.
Zoe Johnson
Oh man, I wish I was in your shoes. I can’t think of anything more fun than matching lists of names and cleaning. I used to match motor and voter lists to get jury lists. 98% was our goal. I used SAS and especially soundex coding to get em close enough to figure it out. I got forcibly retired because I was to old at 58. Damn I love data cleaning and list matching. Most fun I can think of other than playing with grandbabies. I envy you. Those were the good old days.
Big Jim Slade
Jeb Bush and Katherine Harris should be in prison for purposefully disenfranchising thousands of citizens.