[Solved] Why couldn’t I predict directly using Features Matrix?


You are using this method in both training and testing:

def encode_string(cat_features):
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    enc_cat_features = enc.transform(cat_features)
    ohe = preprocessing.OneHotEncoder()
    encoded = ohe.fit(enc_cat_features.reshape(-1,1))
    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

by calling:

Features = encode_string(combined_custs['CountryRegionName'])
for col in categorical_columns:
    temp = encode_string(combined_custs[col])
    Features = np.concatenate([Features, temp],axis=1)

But as I said in my comment above, you need to apply same preprocessing on the test as you did in train.

Here what happens is, during testing, depending on the order of data in the x_test_data, the encoding changes. So maybe a string value which got the number 0, during training is now getting number 1, and the order of features in your final Features changes.

To solve this, you need to save the LabelEncoder and OneHotEncoder for each column separately.

So during training, do this:

import pickle
def encode_string(cat_features):
    enc = preprocessing.LabelEncoder()
    enc.fit(cat_features)
    enc_cat_features = enc.transform(cat_features)

    # Save the LabelEncoder for this column
    encoder_file = open('./'+cat_features+'_encoder.pickle', 'wb')
    pickle.dump(lin_mod, encoder_file)
    encoder_file.close()

    ohe = preprocessing.OneHotEncoder()
    encoded = ohe.fit(enc_cat_features.reshape(-1,1))

    # Same for OHE
    ohe_file = open('./'+cat_features+'_ohe.pickle', 'wb')
    pickle.dump(lin_mod, ohe_file)
    ohe_file.close()

    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

And then, during testing:

def encode_string(cat_features):
    # Load the previously saved encoder
    with open('./'+cat_features+'_encoder.pickle', 'rb') as file:
        enc = pickle.load(file)

    # No fitting, only transform
    enc_cat_features = enc.transform(cat_features)

    # Same for OHE
    with open('./'+cat_features+'_ohe.pickle', 'rb') as file:
        enc = pickle.load(file)

    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()

1

solved Why couldn’t I predict directly using Features Matrix?