I am attempting to create a machine learning model to classify bank loans which will default. I have over 540,000+ rows in my dataset and wish to sample it down to maybe <100,000. Currently I am looking at either randomly sampling based on the proportions of the U.S. States occurring in the dataset, or using the Bureau of Economic Analysis' 8 regions based on Economic Analysis and selecting one of them.
- New England: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island and Vermont
- Mideast: Delaware, District of Columbia, Maryland, New Jersey, New York and Pennsylvania
- Great Lakes: Illinois, Indiana, Michigan, Ohio and Wisconsin
- Plains: Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota and South Dakota
- Southeast: Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, Virginia and West Virginia
- Southwest: Arizona, New Mexico, Oklahoma and Texas
- Rocky Mountain: Colorado, Idaho, Montana, Utah and Wyoming
- Far West: Alaska, California, Hawaii, Nevada, Oregon and Washington
The largest average default rate is ~40% and the lowest in another state is ~6%.
The standard deviation between default rates of the 8 regions are as follows (with value counts):
- New England: 6.07% (58k)
- Mideast: 7.65% (61k)
- Great Lakes: 6.75% (101k)
- Plains: 6.62% (54k)
- Southeast: 8.78% (127k)
- Southwest: 3.56% (27k)
- Rocky Mountain: 5.3% (25k)
- Far West: 8.5% (86k)
Is there some sort of standardised way of deciding this or is it just by feel? If anyone could steer me in the right directon I'd really appreciate it. Thank you very much.
Aucun commentaire:
Enregistrer un commentaire