initialisms and abbreviations of complex names

kgilbert-cmu commented 8 years ago

So, now that names have somewhat stabilized and people are giving positive feedback about the new foreign names, I wanted to revisit one of the aspects I was working on right before the new name generation went into place.

(1) What do we do with complex names? and (2) How should we make complex names?

The first is the easier question, and fairly methodical. France and Canada are the two largest sources of foreign names, and we get fun easter eggs like "J.D." and "J.R." on the first names side as well as "Alexander-Walker" on the last names side. France has a dozen hyphenated names, like Jean-Emmanuel, Jean-Frederic, Jean-Michel, and Jean-Philippe, among other non-Jean names. So basically what I'm saying is that we know these names exist in our names.js database. Should a name like "J.D. Walker" appear on the draftScouting page as literally "J.D. Walker" or "J. Walker"? I presume the former. What about "Jean-Philippe Eliezer-Vanerot" or "Guy-Marc Jean-Baptiste-Adolphe"? Those are all legitimate names under the current generator. I'm just going to suggest changing my original code to interpret hyphens and periods in the first name as if they were spaces, so that we'll get "J.P. Eliezer-Vanerot" and "G.M. Jean-Baptiste-Adolphe".

The second is... significantly harder. I'd like to be able to say that we're knowledgable enough of cultural customs in a hundred countries that we could make accurate estimates of the hyphenation and middle name prevalence in each country's language. However, I'm almost certain we can't do that. I'd like to leave it an open question of how we can extrapolate the webcrawler's basketball data to build names that are indicative of what's in the greater basketball population of a country, regardless of whether a player of that specific name has played basketball in the real world.

matsonj commented 8 years ago

Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?

dumbmatter commented 8 years ago

Should a name like "J.D. Walker" appear on the draftScouting page as literally "J.D. Walker" or "J. Walker"? I presume the former.

Correct.

What about "Jean-Philippe Eliezer-Vanerot" or "Guy-Marc Jean-Baptiste-Adolphe"? Those are all legitimate names under the current generator. I'm just going to suggest changing my original code to interpret hyphens and periods in the first name as if they were spaces, so that we'll get "J.P. Eliezer-Vanerot" and "G.M. Jean-Baptiste-Adolphe".

That would be awesome. Or maybe it should be J.-P. and G.-M., I think I've seen abbreviations like that. Off to Google... yep.

The second is... significantly harder. I'd like to be able to say that we're knowledgable enough of cultural customs in a hundred countries that we could make accurate estimates of the hyphenation and middle name prevalence in each country's language. However, I'm almost certain we can't do that. I'd like to leave it an open question of how we can extrapolate the webcrawler's basketball data to build names that are indicative of what's in the greater basketball population of a country, regardless of whether a player of that specific name has played basketball in the real world.

That's a hard problem. But you could do a 90% solution: hyphenation only in the US and Canada. So sometimes, pick two last names randomly and hyphenate.

On the downside, I really fucking hate hyphenated last names and I wish they didn't exist. It's such a short-sighted solution. Does not scale. What happens if someone with a hyphenated last name marries someone with a hyphenated last name? Madness. So selfish and short-sighted of parents to do this to kids. Let alone the hassle of writing long fucking last names on every form for the rest of the kid's life. Pick one damn name.

Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?

Yeah, but that sounds less fun :)

kgilbert-cmu commented 8 years ago

Does not scale. What happens if someone with a hyphenated last name marries someone with a hyphenated last name?

That is exactly what I was alluding to with the examples I gave. I would have to re-roll a random name if I draw a hypenated name for one of the hyphenated halves. Wouldn't want a "Jean-Philippe Guy-Marc Eliezer-Vanerot-Jean-Baptiste-Adolphe". Little J.P.G.M E-V-J-B-A would dominate the Hall of Fame of census forms.

matsonj commented 8 years ago

It sounds like there is a "hybrid" solution necessary here - so the issue is defining the scope. If we can identify what the "out of scope" names are, we can force a re-roll in those scenarios. That would I think solve both problems - allow us to solve the problem algorithmicaly so it scales but also handle those pesky exceptions.

battaile commented 8 years ago

Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?

Haha less fun or not I like this approach, its close to how .NET handles similar problems.

kgilbert-cmu commented 8 years ago

Exploratory data analysis:

There are 5903 unique first names defined in data/names.js. There are 7630 total across all the countries.

Of these 7630, 42 include hyphens (41 uniques), 7 include spaces (all 7 unique), and 65 include periods (54 unique).

"J.R." is of course the most popular initialized name.

"J. Robert" and "St. Paul" are funnily enough double-counted, being in spaces and periods at the same time, and "J. Robert" is another example of the form "J.R.".

The stunner to me is the extreme drop-off (compared to what I expected) of first and middle names. Here is the entire list of two-name names:

Billy Ray
Hot Rod
J. Robert
Ja Ja
Jay Jay
St. Paul
Chu Chu

Really? That cements in my mind that we need some sort of middle name generator, because I have a feeling that the web scraper probably picked up a name like "Jose Juan Barea" as just "Jose Barea" and "James Michael McAdoo" as just "James McAdoo". Can you quickly confirm the crawler's logic for complex names, @dumbmatter?

Another thing that surprised me is the lack of "Jr." and "III" names. We've got "Larry Nance, Jr." and "Tim Hardaway, Jr." kicking around the NBA today as we speak. Grepping the names.js file for "Jr." hilariously only pops up the name "Jrue". Whoops, wrong regex.

I think last names selection should stay as it is for right now, in the interest of not changing too many things at once. Let's return to last names as a separate issue, and deal with first names in the short term.

I propose two changes to begin with:

Let's add "Jr." and "III" randomly to last names generated in the USA. I'll start with a 3% population based on old data from the 1940s until someone can find a better source with a better estimate. It will be appended to the last name but will otherwise not affect any calculations. No names currently use either, so we won't run into any conflicts of "Quincy Smith, Jr. Jr." or "John Wayne, Sr. III". I propose 3% get Jr., and 3% of that 3% (so .09%) will get III.
102 out of 5903 scraped names are "complex" with either a period, space, or hyphen. This suggests that the "complex name frequency" is somewhere around 2%. I don't know if I buy that, because I kind of expected it to be closer to 5%, but I'll start with 2% for now. 2% of the time, if the player's name is "simple", then I will draw another "simple" name and make a complex first name. So I roll John, then roll a random uniform less than .02, and then I roll a second name. If the second name is "Jean-Marc", I roll again, otherwise, if it's "Kris", then I create the new name "John Kris". 50% of the time, I'll preemptively initialize this to create "J.K." but will otherwise keep "John Kris".

Finally, it goes without saying (having been approved by Jeremy earlier in this thread) that I will update my original change to draftScouting.js to accept hyphens and periods as additional initialism characters on the draftScouting page.

Anyone got feedback? This was one half ramble and one half "I see a discrepancy in our data HULK SMASH MUST FIX."

kgilbert-cmu commented 8 years ago

I just came up with another terrible idea that I want to add. During any offseason after a player's 35th birthday, they have a 0.1% chance to throw an eventLog message that they are celebrating the birth of a son and will change their name to ${firstName} ${lastName}, Sr..

battaile commented 8 years ago

Do they actually start calling the father "Sr" as soon as the child is born though? Like I don't remember hearing the name "Ken Griffey, Sr." back in the 70s when Jr was still a kid, they just called him "Ken Griffey" until Jr started balling and they needed some way to distinguish the two.

kgilbert-cmu commented 8 years ago

I present Steve Smith, Sr.

battaile commented 8 years ago

I counter with John Stallworth and his son John Stallworth, Jr. (who I knew from being the a-hole in our fantasy football league that always reneged on trades ) ... http://highschoolsports.al.com/news/article/6144966258261842369/john-stallworth-selected-by-fans-as-tuscaloosas-all-time-best-football-player/

edit: As your example shows though it does happen. Seems like a rarity to me but I guess 1/1000 after age 35 would lead to it being a rarity in BBGM also.

dumbmatter commented 8 years ago

Can you quickly confirm the crawler's logic for complex names

Your post is correct. Also Jr/Sr/etc are dropped in the names list.

I like adding Jr/Sr, it'll add some character to the names. Just make sure that table sorting by last name can handle it. I can help with that if needed.

I'm not sure if I like the other idea for complex name generation. Reasons:

Initial first names are already handled fine by the current method. Your method would produce weird initials, while the current method produces only common ones. I think it'd be weird to see players with initials like F.J. and S.W. and everything else possible. Just doesn't happen in the real world with any noticeable frequency.
For dual first names (Jose Juan) the current situation is not good, admittedly. However I don't think it's as simple as just picking two random names, you might wind up with stuff that sounds weird. Like Jose Juan sounds fine, but Shaquille Muhammad doesn't.

kgilbert-cmu commented 8 years ago

Initial first names are already handled fine by the current method. Your method would produce weird initials, while the current method produces only common ones. I think it'd be weird to see players with initials like F.J. and S.W. and everything else possible. Just doesn't happen in the real world with any noticeable frequency.

I definitely felt kind of the same way. I got a couple "C. B." and "B. R." names I didn't like. I can remove it. Another super obvious thing about it was that my initialization scheme didn't remove the spare space so it was really obvious that those names would stand out.

For dual first names (Jose Juan) the current situation is not good, admittedly. However I don't think it's as simple as just picking two random names, you might wind up with stuff that sounds weird. Like Jose Juan sounds fine, but Shaquille Muhammad doesn't.

Yes. https://i.redd.it/tvngxz1v0p6x.png

Though I would like to keep it in mind, because it does produce some truly stupendous names.

screen shot 2016-07-05 at 11 28 54 am

So how about this:

keep Jr./III for English-speaking countries
remove initFn, which initializes first names
when drawing a middle name, initialize it automatically in the U.S. (so your two examples would render as Shaquille M. O'Neal and Jose J. ${lastName}) but use the name fully in foreign countries.

I'm picking the U.S. because of the immense diversity of names, but even the double-foreigns rule breaks down in the "Jasko Kenny" example, just because there's at least one dude from Sweden named 'Kenny'.

dumbmatter commented 8 years ago

I like that plan. And having weird names sometimes is okay as long as it's the exception not the rule.

kgilbert-cmu commented 7 years ago

Current PR follows agreed-upon plan:

keep Jr./III for English-speaking countries only. 0.6% of players will be First Last Jr., and 0.4% will be First Last III. These numbers can of course be adjusted, but they were guesstimated because 3% (as suggested in earlier post in this thread) was absurdly high and seemed too obvious.
initFn is gone
middle names are drawn for 5% of players. Fifty percent of this 5% will immediately initialize it, and any player who is in USA (where the most diverse first name pool is) will also automatically initialize. The remaining 2.5% in countries which are not USA will therefore display the full middle name.

zengm-games / zengm

initialisms and abbreviations of complex names #127