You are using this method in both training and testing:
def encode_string(cat_features):
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
enc_cat_features = enc.transform(cat_features)
ohe = preprocessing.OneHotEncoder()
encoded = ohe.fit(enc_cat_features.reshape(-1,1))
return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()
by calling:
Features = encode_string(combined_custs['CountryRegionName'])
for col in categorical_columns:
temp = encode_string(combined_custs[col])
Features = np.concatenate([Features, temp],axis=1)
But as I said in my comment above, you need to apply same preprocessing on the test as you did in train.
Here what happens is, during testing, depending on the order of data in the x_test_data
, the encoding changes. So maybe a string value which got the number 0, during training is now getting number 1, and the order of features in your final Features
changes.
To solve this, you need to save the LabelEncoder and OneHotEncoder for each column separately.
So during training, do this:
import pickle
def encode_string(cat_features):
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
enc_cat_features = enc.transform(cat_features)
# Save the LabelEncoder for this column
encoder_file = open('./'+cat_features+'_encoder.pickle', 'wb')
pickle.dump(lin_mod, encoder_file)
encoder_file.close()
ohe = preprocessing.OneHotEncoder()
encoded = ohe.fit(enc_cat_features.reshape(-1,1))
# Same for OHE
ohe_file = open('./'+cat_features+'_ohe.pickle', 'wb')
pickle.dump(lin_mod, ohe_file)
ohe_file.close()
return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()
And then, during testing:
def encode_string(cat_features):
# Load the previously saved encoder
with open('./'+cat_features+'_encoder.pickle', 'rb') as file:
enc = pickle.load(file)
# No fitting, only transform
enc_cat_features = enc.transform(cat_features)
# Same for OHE
with open('./'+cat_features+'_ohe.pickle', 'rb') as file:
enc = pickle.load(file)
return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()
1
solved Why couldn’t I predict directly using Features Matrix?