random: How to keep in data frame every ID once preferring largest value fist in case of tie newest in case of tie random

lundi 20 avril 2020

How to keep in data frame every ID once preferring largest value fist in case of tie newest in case of tie random

I got a large DF with about 250K obs. and 19 variables about peoples graduation data. People are in the DF multiple times. I need to make two new DF where every person appears only once (ID) per DF. My true DF has more columns but for the example i took out 3 that are needed and 2 as an example of other background information in my DF.

ID - a persons unique ID

Level - Shows the level of education. Higher the number the better: Example 8 = Phd, 7 = Masters

Year - The year of the graduation

Field - background information on the field of study. We need to keep this data.

Gender - Gender of the person. We need to keep this data.

 Orignal DF named "Graduations"
ID    Level     Year     Field     Gender
1       4       2016      31         M
1       5       2016      43         M
2       6       2010      12         F
2       7       2012      12         F
2       8       2017      19         F
3       5       2011      12         F
3       5       2009      31         F
4       6       2018      43         M
4       6       2018      44         M
5       5       2015      19         M
5       6       2011      32         M

DF 1 is named "Highest" meaning highest level of graduation. In this DF we only keep rows where the person has reached their highest level of education. If a person has reached it more then once then we keep the row with the highest and newest graduation and if both are a tie then we choose one randomly.

 "Highest"
ID    Level     Year     Field     Gender
1       5       2016      43         M   (Row kept because it was the highest level)
2       8       2017      19         F   (Row kept because it was the highest level)
3       5       2011      12         F   (Row kept because it was the highest level and newest)
4       6       2018      43         M   (Row chosen randomly)
5       6       2011      32         M   (Row Kept because it was the Highest level)

DF 2 is named "Last" meaning the last obtained level of graduation. In this DF we only keep row with the newest data. If a person has had more then one graduation per year then we choose the highest level if that is also a tie then at random (Does not need to be the same random row as was in DF1)

 "Last"
ID    Level     Year     Field     Gender
1       5       2016      43         M     (Row kept because it was the newest and highest level)
2       8       2017      19         F     (Row kept because it was the newest)
3       5       2011      12         F     (Row kept because it was the newest)
4       6       2018      44         M     (Row chosen randomly)
5       5       2015      19         M     (Row Kept because it was the newest)

I tried to search for it on the stackoverflow but did not find what i was looking for. I might have used wrong keywords.

I prefer base R functions but if needed the most common packages are also ok. This is because this DF is on a secure computer and installing uncommon packages takes forever as i must make a request for it. Most common ones are already installed.

Thank you for your help.

random

lundi 20 avril 2020

How to keep in data frame every ID once preferring largest value fist in case of tie newest in case of tie random

Aucun commentaire:

Enregistrer un commentaire