Open kgilbert-cmu opened 8 years ago
Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?
Should a name like "J.D. Walker" appear on the draftScouting page as literally "J.D. Walker" or "J. Walker"? I presume the former.
Correct.
What about "Jean-Philippe Eliezer-Vanerot" or "Guy-Marc Jean-Baptiste-Adolphe"? Those are all legitimate names under the current generator. I'm just going to suggest changing my original code to interpret hyphens and periods in the first name as if they were spaces, so that we'll get "J.P. Eliezer-Vanerot" and "G.M. Jean-Baptiste-Adolphe".
That would be awesome. Or maybe it should be J.-P. and G.-M., I think I've seen abbreviations like that. Off to Google... yep.
The second is... significantly harder. I'd like to be able to say that we're knowledgable enough of cultural customs in a hundred countries that we could make accurate estimates of the hyphenation and middle name prevalence in each country's language. However, I'm almost certain we can't do that. I'd like to leave it an open question of how we can extrapolate the webcrawler's basketball data to build names that are indicative of what's in the greater basketball population of a country, regardless of whether a player of that specific name has played basketball in the real world.
That's a hard problem. But you could do a 90% solution: hyphenation only in the US and Canada. So sometimes, pick two last names randomly and hyphenate.
On the downside, I really fucking hate hyphenated last names and I wish they didn't exist. It's such a short-sighted solution. Does not scale. What happens if someone with a hyphenated last name marries someone with a hyphenated last name? Madness. So selfish and short-sighted of parents to do this to kids. Let alone the hassle of writing long fucking last names on every form for the rest of the kid's life. Pick one damn name.
Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?
Yeah, but that sounds less fun :)
Does not scale. What happens if someone with a hyphenated last name marries someone with a hyphenated last name?
That is exactly what I was alluding to with the examples I gave. I would have to re-roll a random name if I draw a hypenated name for one of the hyphenated halves. Wouldn't want a "Jean-Philippe Guy-Marc Eliezer-Vanerot-Jean-Baptiste-Adolphe". Little J.P.G.M E-V-J-B-A would dominate the Hall of Fame of census forms.
It sounds like there is a "hybrid" solution necessary here - so the issue is defining the scope. If we can identify what the "out of scope" names are, we can force a re-roll in those scenarios. That would I think solve both problems - allow us to solve the problem algorithmicaly so it scales but also handle those pesky exceptions.
Is it possible to solve this simply - i.e. an additional field vs. a complex algorithm? "First name initial" or similar?
Haha less fun or not I like this approach, its close to how .NET handles similar problems.
Exploratory data analysis:
There are 5903 unique first names defined in data/names.js
. There are 7630 total across all the countries.
Of these 7630, 42 include hyphens (41 uniques), 7 include spaces (all 7 unique), and 65 include periods (54 unique).
"J.R." is of course the most popular initialized name.
"J. Robert" and "St. Paul" are funnily enough double-counted, being in spaces and periods at the same time, and "J. Robert" is another example of the form "J.R.".
The stunner to me is the extreme drop-off (compared to what I expected) of first and middle names. Here is the entire list of two-name names:
Really? That cements in my mind that we need some sort of middle name generator, because I have a feeling that the web scraper probably picked up a name like "Jose Juan Barea" as just "Jose Barea" and "James Michael McAdoo" as just "James McAdoo". Can you quickly confirm the crawler's logic for complex names, @dumbmatter?
Another thing that surprised me is the lack of "Jr." and "III" names. We've got "Larry Nance, Jr." and "Tim Hardaway, Jr." kicking around the NBA today as we speak. Grepping the names.js file for "Jr." hilariously only pops up the name "Jrue". Whoops, wrong regex.
I think last names selection should stay as it is for right now, in the interest of not changing too many things at once. Let's return to last names as a separate issue, and deal with first names in the short term.
I propose two changes to begin with:
Finally, it goes without saying (having been approved by Jeremy earlier in this thread) that I will update my original change to draftScouting.js
to accept hyphens and periods as additional initialism characters on the draftScouting page.
Anyone got feedback? This was one half ramble and one half "I see a discrepancy in our data HULK SMASH MUST FIX."
I just came up with another terrible idea that I want to add. During any offseason after a player's 35th birthday, they have a 0.1% chance to throw an eventLog message that they are celebrating the birth of a son and will change their name to ${firstName} ${lastName}, Sr.
.
Do they actually start calling the father "Sr" as soon as the child is born though? Like I don't remember hearing the name "Ken Griffey, Sr." back in the 70s when Jr was still a kid, they just called him "Ken Griffey" until Jr started balling and they needed some way to distinguish the two.
I present Steve Smith, Sr.
I counter with John Stallworth and his son John Stallworth, Jr. (who I knew from being the a-hole in our fantasy football league that always reneged on trades ) ... http://highschoolsports.al.com/news/article/6144966258261842369/john-stallworth-selected-by-fans-as-tuscaloosas-all-time-best-football-player/
edit: As your example shows though it does happen. Seems like a rarity to me but I guess 1/1000 after age 35 would lead to it being a rarity in BBGM also.
Can you quickly confirm the crawler's logic for complex names
Your post is correct. Also Jr/Sr/etc are dropped in the names list.
I like adding Jr/Sr, it'll add some character to the names. Just make sure that table sorting by last name can handle it. I can help with that if needed.
I'm not sure if I like the other idea for complex name generation. Reasons:
Initial first names are already handled fine by the current method. Your method would produce weird initials, while the current method produces only common ones. I think it'd be weird to see players with initials like F.J. and S.W. and everything else possible. Just doesn't happen in the real world with any noticeable frequency.
I definitely felt kind of the same way. I got a couple "C. B." and "B. R." names I didn't like. I can remove it. Another super obvious thing about it was that my initialization scheme didn't remove the spare space so it was really obvious that those names would stand out.
For dual first names (Jose Juan) the current situation is not good, admittedly. However I don't think it's as simple as just picking two random names, you might wind up with stuff that sounds weird. Like Jose Juan sounds fine, but Shaquille Muhammad doesn't.
Yes. https://i.redd.it/tvngxz1v0p6x.png
Though I would like to keep it in mind, because it does produce some truly stupendous names.
So how about this:
Shaquille M. O'Neal
and Jose J. ${lastName}
) but use the name fully in foreign countries. I'm picking the U.S. because of the immense diversity of names, but even the double-foreigns rule breaks down in the "Jasko Kenny" example, just because there's at least one dude from Sweden named 'Kenny'.
I like that plan. And having weird names sometimes is okay as long as it's the exception not the rule.
Current PR follows agreed-upon plan:
So, now that names have somewhat stabilized and people are giving positive feedback about the new foreign names, I wanted to revisit one of the aspects I was working on right before the new name generation went into place.
(1) What do we do with complex names? and (2) How should we make complex names?
The first is the easier question, and fairly methodical. France and Canada are the two largest sources of foreign names, and we get fun easter eggs like "J.D." and "J.R." on the first names side as well as "Alexander-Walker" on the last names side. France has a dozen hyphenated names, like Jean-Emmanuel, Jean-Frederic, Jean-Michel, and Jean-Philippe, among other non-Jean names. So basically what I'm saying is that we know these names exist in our names.js database. Should a name like "J.D. Walker" appear on the draftScouting page as literally "J.D. Walker" or "J. Walker"? I presume the former. What about "Jean-Philippe Eliezer-Vanerot" or "Guy-Marc Jean-Baptiste-Adolphe"? Those are all legitimate names under the current generator. I'm just going to suggest changing my original code to interpret hyphens and periods in the first name as if they were spaces, so that we'll get "J.P. Eliezer-Vanerot" and "G.M. Jean-Baptiste-Adolphe".
The second is... significantly harder. I'd like to be able to say that we're knowledgable enough of cultural customs in a hundred countries that we could make accurate estimates of the hyphenation and middle name prevalence in each country's language. However, I'm almost certain we can't do that. I'd like to leave it an open question of how we can extrapolate the webcrawler's basketball data to build names that are indicative of what's in the greater basketball population of a country, regardless of whether a player of that specific name has played basketball in the real world.