I got a large DF with about 250K obs. and 19 variables about peoples graduation data. People are in the DF multiple times. I need to make two new DF where every person appears only once (ID) per DF. My true DF has more columns but for the example i took out 3 that are needed and 2 as an example of other background information in my DF.
ID - a persons unique ID
Level - Shows the level of education. Higher the number the better: Example 8 = Phd, 7 = Masters
Year - The year of the graduation
Field - background information on the field of study. We need to keep this data.
Gender - Gender of the person. We need to keep this data.
Orignal DF named "Graduations" ID Level Year Field Gender 1 4 2016 31 M 1 5 2016 43 M 2 6 2010 12 F 2 7 2012 12 F 2 8 2017 19 F 3 5 2011 12 F 3 5 2009 31 F 4 6 2018 43 M 4 6 2018 44 M 5 5 2015 19 M 5 6 2011 32 M
DF 1 is named "Highest" meaning highest level of graduation. In this DF we only keep rows where the person has reached their highest level of education. If a person has reached it more then once then we keep the row with the highest and newest graduation and if both are a tie then we choose one randomly.
"Highest" ID Level Year Field Gender 1 5 2016 43 M (Row kept because it was the highest level) 2 8 2017 19 F (Row kept because it was the highest level) 3 5 2011 12 F (Row kept because it was the highest level and newest) 4 6 2018 43 M (Row chosen randomly) 5 6 2011 32 M (Row Kept because it was the Highest level)
DF 2 is named "Last" meaning the last obtained level of graduation. In this DF we only keep row with the newest data. If a person has had more then one graduation per year then we choose the highest level if that is also a tie then at random (Does not need to be the same random row as was in DF1)
"Last" ID Level Year Field Gender 1 5 2016 43 M (Row kept because it was the newest and highest level) 2 8 2017 19 F (Row kept because it was the newest) 3 5 2011 12 F (Row kept because it was the newest) 4 6 2018 44 M (Row chosen randomly) 5 5 2015 19 M (Row Kept because it was the newest)
I tried to search for it on the stackoverflow but did not find what i was looking for. I might have used wrong keywords.
I prefer base R functions but if needed the most common packages are also ok. This is because this DF is on a secure computer and installing uncommon packages takes forever as i must make a request for it. Most common ones are already installed.
Thank you for your help.
Aucun commentaire:
Enregistrer un commentaire