Pipelines with sklearn


Even with modern algorithms in the field of deep learning, this is necessary to either increase the model's performance or reduce computation time. During data preparation, missing values are replaced, outliers are detected, or data is logarithmized, for example. At statworx, this also constitutes a large part of the work in a project, which is why this post aligns with previous blog posts on data preparation, such as the comparison of data preparation functions of different software packages. The pipelines presented here enable many transformation steps to be kept clear and reproducible.
The usual processing with sklearn
In the probably most important module for machine learning in Python "skLearn," a separate module with special functions for data processing has been implemented. The various functions are applied to both the training and test datasets via fit_transform()
. The application must be done individually for both datasets. This is necessary to prevent any influence of the test data on the training dataset ("Data Leakage"). The following code, for example, imputes missing values and centers the data:
# Insert the average into missing values
impute=Imputer(strategy="mean")
X_train_impute=impute.fit_transform(X_train)
X_test_impute=impute.fit_transform(X_test)
# Scale the variables
scaler=StandardScaler()
X_train_scaled=scaler.fit_transform(X_train_impute)
X_test_scaled=scaler.fit_transform(X_test_impute)
Already here, it can be seen that with more elaborate transformations, the respective application to training and test data becomes very unclear. Here, the use of pipelines can offer a significant advantage.
The use of pipelines
To simplify these steps, the Pipeline function was introduced in sklearn. In the function, the individual transformations can be easily passed, and Python takes over the passing to the next function.
Now we replace the missing values and center the data using a pipeline:
# Definition of transformations
impute = Imputer(strategy="mean")
scaler = StandardScaler()
# Definition of the pipeline
pipe = Pipeline(steps=[('impute',impute),('scaler',scaler)])
# Apply to training and test data:
X_train = pipe.fit_transform(X_train)
X_test = pipe.fit_transform(X_test)
The Pipeline function is passed a list of tuples, in which the name and the function are specified. Using the make_pipeline
function, naming the functions is omitted, resulting in pipe = make_pipeline(impute, scaler)
.
Direct application of algorithms is also possible. The corresponding algorithm can simply be included as another tuple in the pipeline. Here, the use of Random Forest Regression is shown as an example:
# Transformation
imputer = Imputer(strategy="mean")
scaler = StandardScaler()
# Random Forest Regression
regressor = RandomForestRegressor(n_estimators=100)
# Build and execute the pipeline
pipe = Pipeline(steps= [('imp',imputer),('scaler',scaler),('Regressor',regressor)])
pipe.fit(X_train)
The great flexibility of pipelines and sklearn is mainly demonstrated by the use of the FunctionTransformer()
function implemented in sklearn. This allows any user-defined function to be created as a function usable in the pipeline, onto which the methods fit
and transform
can be applied. Thus, as part of the pipeline, variables can be squared using the function square = FunctionTransformer(np.square)
.
More complicated functions can be created using TransformerMixin
. A class is created, and the associated methods fit
and transform
are defined. Here, a function is shown exemplarily that sets the most frequent value for missing values in categorical variables:
# Impute the most frequent value:
class Imputer_Most_Frequent(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
self.fill = X.mode().iloc[0]
return X.fillna(self.fill)
Summary
A model only works if data is properly prepared for the algorithm. Careful data preparation can also save a lot of time in modeling, as values can be simplified by scaling, for example. Additionally, with a cleanly prepared dataset, new features can be easily generated. At Statworx, we therefore perform complex and long preparation steps in our projects with our clients. The use of pipelines helps us make the code clearer, simpler, and thus better understandable. It is often assumed that most of the work of data scientists consists of selecting and tuning algorithms. However, a survey by Crowdflower from 2016 revealed that most of the time is spent cleaning and transforming data.