yhydhx / python-nameparser

Automatically exported from code.google.com/p/python-nameparser
Other
0 stars 0 forks source link

Ph.D., Esq., M.D., C.F.P., etc. are titles, not suffices #7

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In version 0.2, Ph.D., Esq., M.D. and other titles are classified as suffices.  
This is perhaps convenient for parsing, since they appear at the end of a name, 
but they are in fact titles.  A suffix distinguishes people and is part of your 
legal name; a title does not and (in most countries) is not.  "J. Smith Jr." 
and "J. Smith Sr." are certainly different people, whereas "J. Smith", "J. 
Smith, PhD" and "J. Smith, MD" may or may not be.

I propose titles end up in the .title field, and suffices end up in the .suffix 
field.

Original issue reported on code.google.com by jayqhac...@gmail.com on 7 Feb 2012 at 6:16

GoogleCodeExporter commented 9 years ago
I see your point about them being the same person. I think considering them 
titles might lead to strange output if you want to out the names as a string, 
e.g. "PhD J. Smith". 

Were you specifically trying to do equals comparison to find if they're the 
same person? Perhaps the issue is more that Jr. and Sr. should be treated 
differently than PhD and MD when testing if names are equal?

Original comment by dere...@gmail.com on 1 Apr 2012 at 2:57

GoogleCodeExporter commented 9 years ago
Name parsing is a hard problem; ultimately I think you'd want a statistical, 
machine learning approach, but you can probably get pretty far with rules.

The two issues are: 1) some suffixes are part of your name, some aren't; and 2) 
some titles come before your name, some after.  

You could solve both by splitting titles into pre- and post-titles, and making 
suffixes just ('jr','sr','2','i','ii','iii','iv','v').

I was not using equals to find if they're the same person, because that's a 
slippery slope with a probabilistic answer.  I would like to distinguish names 
and suffixes from titles, and I would like to be able to treat ' '.join(first, 
middle, last, suffix) as a name and use title_list as "metadata," regardless of 
where those titles might have appeared.

Original comment by jayqhac...@gmail.com on 2 Apr 2012 at 4:26

GoogleCodeExporter commented 9 years ago
I played with adding a new list to keep track of titles that were added at the 
end. If we treat the suffixes as a definitive and complete list, then we can 
assume anything else is a title. The initials "i" and "v" are problematic, but 
we could probably assume that they are initials in the case of "John V". 

I like the idea of separating out the parts of the name that definitely signify 
another person, and your definition of suffix. Thinking about it, I guess a 
suffix always comes directly after the name? Like you wouldn't have "John Doe, 
Phd, Jr". Also the case of having 2 suffixes seems somewhat remote, e.g. 
"'Smith, John E, III, Jr'"? So I guess that would make the patterns look 
something like this.

# no commas:      title first middle middle middle last suffix|title_suffix 
title_suffix
# suffix comma:   title first middle last, suffix|title_suffix [, title_suffix]
# lastname comma: last, title first middles[,] suffix|title_suffix 
[,title_suffix]

SUFFIXES = set((
    'jr','sr','2','i','ii','iii','iv','v',
))

TITLE_SUFFIXES = set((
    'phd','md','esquire','esq','clu','chfc','cfp',
))

I got as far as finding that equality test would need to be updated. It got me 
wondering if perhaps we should change the equality test, per your example, to 
test that ' '.join(first, middle, last, suffix) are the same. Perhaps its easy 
enough for someone to test if unicode() representations are equal on their own 
if they want titles too. Or maybe that's too smart.

Original comment by dere...@gmail.com on 12 Feb 2013 at 9:03

GoogleCodeExporter commented 9 years ago
That sounds like a reasonable approach.  I don't personally use equality, but 
you might consider having it do the "dumb" least-surprise exact comparison, and 
adding a similarity method that returns a float in 0.0 - 1.0, eventually aiming 
for something like the probability that these two names reference the same 
person.

Also, watch out for "King John V."  ;)

Original comment by jayqhac...@gmail.com on 12 Feb 2013 at 7:55