lundi 20 avril 2020

How to keep in data frame every ID once preferring largest value fist in case of tie newest in case of tie random

I got a large DF with about 250K obs. and 19 variables about peoples graduation data. People are in the DF multiple times. I need to make two new DF where every person appears only once (ID) per DF. My true DF has more columns but for the example i took out 3 that are needed and 2 as an example of other background information in my DF.

ID - a persons unique ID

Level - Shows the level of education. Higher the number the better: Example 8 = Phd, 7 = Masters

Year - The year of the graduation

Field - background information on the field of study. We need to keep this data.

Gender - Gender of the person. We need to keep this data.

 Orignal DF named "Graduations"
ID    Level     Year     Field     Gender
1       4       2016      31         M
1       5       2016      43         M
2       6       2010      12         F
2       7       2012      12         F
2       8       2017      19         F
3       5       2011      12         F
3       5       2009      31         F
4       6       2018      43         M
4       6       2018      44         M
5       5       2015      19         M
5       6       2011      32         M

DF 1 is named "Highest" meaning highest level of graduation. In this DF we only keep rows where the person has reached their highest level of education. If a person has reached it more then once then we keep the row with the highest and newest graduation and if both are a tie then we choose one randomly.

 "Highest"
ID    Level     Year     Field     Gender
1       5       2016      43         M   (Row kept because it was the highest level)
2       8       2017      19         F   (Row kept because it was the highest level)
3       5       2011      12         F   (Row kept because it was the highest level and newest)
4       6       2018      43         M   (Row chosen randomly)
5       6       2011      32         M   (Row Kept because it was the Highest level)

DF 2 is named "Last" meaning the last obtained level of graduation. In this DF we only keep row with the newest data. If a person has had more then one graduation per year then we choose the highest level if that is also a tie then at random (Does not need to be the same random row as was in DF1)

 "Last"
ID    Level     Year     Field     Gender
1       5       2016      43         M     (Row kept because it was the newest and highest level)
2       8       2017      19         F     (Row kept because it was the newest)
3       5       2011      12         F     (Row kept because it was the newest)
4       6       2018      44         M     (Row chosen randomly)
5       5       2015      19         M     (Row Kept because it was the newest)

I tried to search for it on the stackoverflow but did not find what i was looking for. I might have used wrong keywords.

I prefer base R functions but if needed the most common packages are also ok. This is because this DF is on a secure computer and installing uncommon packages takes forever as i must make a request for it. Most common ones are already installed.

Thank you for your help.




Aucun commentaire:

Enregistrer un commentaire