You can not use
target, predicted = pipe.fit_predict(df)
with your defined pipeline, because the fit_predict() method can only be used, if the estimator has such a method implemented as well. Reference in documentation
Valid only if the final estimator implements fit_predict.
Also, it would only return the predictions, so you can not use target,predicted =
but should use predicted =
You got the error
ValueError: setting an array element with a sequence.
because you are providing the StandardScaler()
a pandas.TimeSeries
.
This is because with your method call pipe.fit_predict(df)
you only provide an ‘X’ and not an ‘y’ to the pipeline. This is fine for your first component of the pipeline “MakeFeatures” since it accepts an ‘X’ and returns ‘data’ and ‘y’, but in the pipeline the ‘y’ will not be used, because the ‘y’ has to be defined in the beginning of the fit_predict() call.
Have a look at the documentation of the method here: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.fit_predict
It states for the ‘y’ parameter
Training targets. Must fulfill label requirements for all steps of the
pipeline.
So that ‘y’ would be used as the ‘y’ for all parts of the pipeline, but yours is not defined, so None
.
What basically happens with your current pipeline is therefore this:
makeF = MakeFeatures(df, 2 , 24)
transformed_df = makeF.fit_transform(df)
sc = StandardScaler()
sc.fit(transformed_df)
and causes ValueError: setting an array element with a sequence.
So I suggest you to update your code like this:
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
np.random.seed(1)
rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min')
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
class MakeFeatures(BaseEstimator, TransformerMixin):
def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
self.X = X.resample('1H').sum()
self.max_lag = max_lag
self.rolling_mean_day = rolling_mean_day
self.rolling_mean_month = rolling_mean_month
def fit(self, X):
return self
def transform(self, X):
data = pd.DataFrame(index = self.X.index)
data['num_orders'] = self.X['num_orders']
data['year'] = self.X.index.year
data['month'] = self.X.index.month
data['day'] = self.X.index.day
data['dayofweek'] = self.X.index.dayofweek
data['detrend'] = self.X.shift() - self.X
if self.max_lag:
for lag in range(1, self.max_lag + 1):
data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
if self.rolling_mean_day:
data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
if self.rolling_mean_month:
data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
if data['year'].mean() == data['year'][1]:
data = data.drop('year', axis = 1)
data = data.dropna()
y = data.num_orders
data = data.drop('num_orders', 1)
return data, list(y)
pipe = Pipeline([
('scaler', StandardScaler()),
('Model' , LinearRegression())
])
makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y) # where ‘Target’ is y - the output from the Class
Then you can use your pipeline to predict your data and evaluate the performance for instance with the r2_score:
from sklearn.metrics import r2_score
predictions = pipe.predict(data)
r2_score(y,predictions)
4
solved Hello, two questions about sklearn.Pipeline with custom transformer for timeseries [closed]