[Solved] Hello, two questions about sklearn.Pipeline with custom transformer for timeseries [closed]

Question

You can not use

target, predicted = pipe.fit_predict(df)

with your defined pipeline, because the fit_predict() method can only be used, if the estimator has such a method implemented as well. Reference in documentation

Valid only if the final estimator implements fit_predict.

Also, it would only return the predictions, so you can not use target,predicted = but should use predicted =

You got the error

ValueError: setting an array element with a sequence.

because you are providing the StandardScaler() a pandas.TimeSeries.

This is because with your method call pipe.fit_predict(df) you only provide an ‘X’ and not an ‘y’ to the pipeline. This is fine for your first component of the pipeline “MakeFeatures” since it accepts an ‘X’ and returns ‘data’ and ‘y’, but in the pipeline the ‘y’ will not be used, because the ‘y’ has to be defined in the beginning of the fit_predict() call.

Have a look at the documentation of the method here: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.fit_predict

It states for the ‘y’ parameter

Training targets. Must fulfill label requirements for all steps of the
pipeline.

So that ‘y’ would be used as the ‘y’ for all parts of the pipeline, but yours is not defined, so None.

What basically happens with your current pipeline is therefore this:

makeF = MakeFeatures(df, 2 , 24)
transformed_df = makeF.fit_transform(df)

sc = StandardScaler()
sc.fit(transformed_df)

and causes ValueError: setting an array element with a sequence.

So I suggest you to update your code like this:

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression

np.random.seed(1)

rows,cols = 100,1
data = np.random.randint(100, size = (rows,cols))
tidx = pd.date_range('2019-01-01', periods=rows, freq='20min') 
df = pd.DataFrame(data, columns=['num_orders'], index=tidx)
      
class MakeFeatures(BaseEstimator, TransformerMixin):

  def __init__(self, X, max_lag = None, rolling_mean_day = None, rolling_mean_month = None):
      self.X = X.resample('1H').sum()
      self.max_lag = max_lag
      self.rolling_mean_day = rolling_mean_day
      self.rolling_mean_month = rolling_mean_month
          
  def fit(self, X):
      return self

  def transform(self, X):
      data = pd.DataFrame(index = self.X.index)
      data['num_orders'] = self.X['num_orders']
      data['year'] = self.X.index.year
      data['month'] = self.X.index.month
      data['day'] = self.X.index.day
      data['dayofweek'] = self.X.index.dayofweek
      
      data['detrend'] = self.X.shift() - self.X
      
      if self.max_lag:
          for lag in range(1, self.max_lag + 1):
              data['lag_{}'.format(lag)] = data['detrend'].shift(lag)
      if self.rolling_mean_day:
          data['rolling_mean_24'] = data.detrend.shift().rolling(self.rolling_mean_day).mean()
      
      if self.rolling_mean_month:
          data['rolling_mean_24'] = data['detrend'].shift().rolling(self.rolling_mean_month).mean()
      
      if data['year'].mean() == data['year'][1]:
          data = data.drop('year', axis = 1)
      
      data = data.dropna()
      
      y = data.num_orders
      data = data.drop('num_orders', 1)
      
      return data, list(y)

pipe = Pipeline([
                 ('scaler', StandardScaler()),
                ('Model' , LinearRegression())
      ])

makeF = MakeFeatures(df, 2 , 24)
makeF.fit(df)
data,y = makeF.transform(df)
pipe.fit(data,y)  # where ‘Target’ is y - the output from the Class

Then you can use your pipeline to predict your data and evaluate the performance for instance with the r2_score:

from sklearn.metrics import r2_score

predictions = pipe.predict(data)
r2_score(y,predictions)

Accepted Answer