lundi 15 juin 2020

Using original data in output of random forest

I have built a random forest model using sklearn and python to predict 'pages' from a variety of size features. In my data, I also have a variable 'seconds' that was used initially to determine 'pages', but I do not want it to be one of the features used to predict 'pages'. However, I was wondering if there is any way to still include this variable in the output. I have attached some code and also a sample of what the data looks like. I would like the output to be able to include 'seconds', even though it is not in the test data. Thank you![enter image description here]1

X = dataset[['size2','size3','size4']]
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.pipeline import make_pipeline
data = X_test 
df = pd.DataFrame(data) 
pred = pd.concat([df, pd.Series(y_pred, name="label")], axis=1)



Aucun commentaire:

Enregistrer un commentaire