Open GoogleCodeExporter opened 9 years ago
I see your point about them being the same person. I think considering them
titles might lead to strange output if you want to out the names as a string,
e.g. "PhD J. Smith".
Were you specifically trying to do equals comparison to find if they're the
same person? Perhaps the issue is more that Jr. and Sr. should be treated
differently than PhD and MD when testing if names are equal?
Original comment by dere...@gmail.com
on 1 Apr 2012 at 2:57
Name parsing is a hard problem; ultimately I think you'd want a statistical,
machine learning approach, but you can probably get pretty far with rules.
The two issues are: 1) some suffixes are part of your name, some aren't; and 2)
some titles come before your name, some after.
You could solve both by splitting titles into pre- and post-titles, and making
suffixes just ('jr','sr','2','i','ii','iii','iv','v').
I was not using equals to find if they're the same person, because that's a
slippery slope with a probabilistic answer. I would like to distinguish names
and suffixes from titles, and I would like to be able to treat ' '.join(first,
middle, last, suffix) as a name and use title_list as "metadata," regardless of
where those titles might have appeared.
Original comment by jayqhac...@gmail.com
on 2 Apr 2012 at 4:26
I played with adding a new list to keep track of titles that were added at the
end. If we treat the suffixes as a definitive and complete list, then we can
assume anything else is a title. The initials "i" and "v" are problematic, but
we could probably assume that they are initials in the case of "John V".
I like the idea of separating out the parts of the name that definitely signify
another person, and your definition of suffix. Thinking about it, I guess a
suffix always comes directly after the name? Like you wouldn't have "John Doe,
Phd, Jr". Also the case of having 2 suffixes seems somewhat remote, e.g.
"'Smith, John E, III, Jr'"? So I guess that would make the patterns look
something like this.
# no commas: title first middle middle middle last suffix|title_suffix
title_suffix
# suffix comma: title first middle last, suffix|title_suffix [, title_suffix]
# lastname comma: last, title first middles[,] suffix|title_suffix
[,title_suffix]
SUFFIXES = set((
'jr','sr','2','i','ii','iii','iv','v',
))
TITLE_SUFFIXES = set((
'phd','md','esquire','esq','clu','chfc','cfp',
))
I got as far as finding that equality test would need to be updated. It got me
wondering if perhaps we should change the equality test, per your example, to
test that ' '.join(first, middle, last, suffix) are the same. Perhaps its easy
enough for someone to test if unicode() representations are equal on their own
if they want titles too. Or maybe that's too smart.
Original comment by dere...@gmail.com
on 12 Feb 2013 at 9:03
That sounds like a reasonable approach. I don't personally use equality, but
you might consider having it do the "dumb" least-surprise exact comparison, and
adding a similarity method that returns a float in 0.0 - 1.0, eventually aiming
for something like the probability that these two names reference the same
person.
Also, watch out for "King John V." ;)
Original comment by jayqhac...@gmail.com
on 12 Feb 2013 at 7:55
Original issue reported on code.google.com by
jayqhac...@gmail.com
on 7 Feb 2012 at 6:16