|
| View previous topic :: View next topic |
| Author |
Message |
ekim256
Joined: 19 Sep 2008 Posts: 3
|
Posted: Wed Oct 15, 2008 11:16 pm Post subject: Given Names, Gender, Ethnicity |
|
|
| I have a database of names. Does anyone know how I can guess their gender and ethnicity based on their first and last name? |
|
| Back to top |
|
 |
adam Data Mining Guru
Joined: 23 Jan 2008 Posts: 20
|
Posted: Thu Oct 16, 2008 2:04 pm Post subject: |
|
|
| Do you know the primary country of residence for these people? |
|
| Back to top |
|
 |
ekim256
Joined: 19 Sep 2008 Posts: 3
|
Posted: Thu Oct 16, 2008 2:06 pm Post subject: |
|
|
| Yes, Canada. I also know that a good portion (if I had to guess, ~80%) of the sample are immigrants |
|
| Back to top |
|
 |
adam Data Mining Guru
Joined: 23 Jan 2008 Posts: 20
|
Posted: Thu Oct 16, 2008 3:24 pm Post subject: |
|
|
Does anyone in your population have a known gender and ethnicity? If not, try to get a good sample of names (with genders and ethnicities) from somewhere.
This is how I see it... The first name will predict the gender and the last name will predict the ethnicity. So that means you're looking at 2 models. Let's say you start with the gender problem first. I would use the first names to derive as many variables as you possibly can. Variables such as Last Letter, Last 2 Letters, Last 3 Letters, First Letter, First 2 Letters, First 3 Letters, Number of Vowels, Length, etc. etc... Then use a feature selection method or perhaps just a decision tree to see if any of those are predictive. If you can find some predictive variables, you might want to explore them further and maybe you'll discover more variables.
Then you could repeat the same process with the ethnicity problem. Keep in mind that I have zero experience in computational linguistics, but this is probably how I would approach the problem. |
|
| Back to top |
|
 |
TimManns Data Mining Guru
Joined: 25 Sep 2006 Posts: 37 Location: Sydney
|
Posted: Thu Oct 16, 2008 6:39 pm Post subject: ethnicity for what purpose? |
|
|
- maybe I'm stating the obvious here, and I'm just checking in case.
There has not been any mention of what these are used for...
As a data cleaning exercise to later report (say to ensure we communicate with the customer with a 'ethnic origin' tailored message would be fine, or describe how many x, y, z ethnic origins voted for presidential candidate etc) again fine.
But don't use as inputs to a predictive model, something like credit risk or insurance claims... Anything that limits with whom you take action (maybe some exceptions, say medical reasons).
Cheers
Tim |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|
|
|