scikit-learn's Pipeline and Friends
When I started writing Python some eight or nine years ago one of the first libraries I was introduced to was
scikit-learn
.
Being a beginner, I had to learn how to "read the docs," which can be challenging for anyone exploring unfamiliar software.
Articles, videos, doc strings, example code snippets, source code, and even unit tests.
I see them all as another side to the same box.
When one didn't make sense, I could search for another to help me better understand the puzzle.
The more I studied scikit-learn
, the more I'd try to rebuild examples using techniques I was comfortable with:
pandas
, numpy
, hand-written math and sketches.
Fairly quickly I noticed the
Pipeline
was used in a lot of code
examples.
The idea of a pipeline wasn't new—the output of one operation was the input to next—but knowing how to read and use one was.
Once I figured it out,
the Pipeline
earned a place in my toolbox and serves as a part of my signature style when working in machine learning.
I now find it much easier to use the scikit-learn
API,
as well as contribute
to its source code.
With that, I'd like to share some alternative approaches to building a machine learning model using scikit-learn
's Pipeline
and friends:
FeatureUnion
,
ColumnTransformer
,
and FunctionTransformer
.
Setup¶
I'll start with the prerequisites; to guarantee that you get the same results as I do I'd suggest installing the following package versions. You can do that by uncommenting the below cell and running it.
# !pip install numpy==1.24.3
# !pip install pandas==2.1.4
# !pip install scikit-learn==1.3.2
Below are the initial imports. I'll introduce more in each section as we go. The comments should describe what each is going to be used for, but if you have questions drop a comment at the bottom of the post.
import warnings # To suppress some warnings.
import numpy as np # For numerical computation when a dataframe isn't available.
import pandas as pd # For reading/manipulating data.
from sklearn.impute import SimpleImputer # For imputing missing values.
from sklearn.linear_model import LogisticRegression # Simple classifier.
from sklearn.model_selection import train_test_split # Split train data into train/val.
from sklearn.preprocessing import MinMaxScaler # Simple preprocessing step.
# Filtering out a scikit-learn warning related to the `LogisticRegression` model
# not converging in the last section. This does not pertain to the tutorial so it
# will be hidden.
warnings.filterwarnings(action="ignore", module="sklearn")
I'm using the Titanic data set from Kaggle. You can download it from here.
The goal of the competition is to build a classification model that can correctly predict if a passenger survived the Titanic.
I'm not aiming for a state-of-the-art model here;
I'm sharing how one might build a model and then introduce the Pipeline
et al.
# Read the Titanic data set.
train = pd.read_csv("train.csv", index_col="PassengerId")
test = pd.read_csv("test.csv", index_col="PassengerId")
# Separate X and y.
X = train.drop(columns="Survived")
y = train.Survived
# Split into train/val data.
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0, stratify=y)
Multi-step Transformation¶
Let's start off simple. Suppose we want to use the continuous features to predict survival.
# Limit to features with dtype float.
cont_cols = X_train.select_dtypes(include="float").columns
X_train_float = X_train[cont_cols]
X_val_float = X_val[cont_cols]
We could scale the features between $[0, 1]$ using
MinMaxScaler
.
# Scale data.
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_float)
X_val_scaled = scaler.transform(X_val_float)
Use the SimpleImputer
to fill any missing values with the average across all passengers.
# Impute data.
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_scaled)
X_val_imputed = imputer.transform(X_val_scaled)
And fit our
LogisticRegression
with our scaled-then-imputed data and check out the score on our held-out data.
# Fit the model and grade against val data.
clf = LogisticRegression(random_state=0)
clf.fit(X_train_imputed, y_train)
clf.score(X_val_imputed, y_val)
0.6457399103139013
It's a straightforward and easy-to-understand series of steps.
- Scale
- Impute
- Fit
Here's how we'd do it with a Pipeline
.
# Import Pipeline.
from sklearn.pipeline import Pipeline
# Limit to features with dtype float.
cont_cols = X_train.select_dtypes(include="float").columns
X_train_float = X_train[cont_cols]
X_val_float = X_val[cont_cols]
# Define a pipe with three steps: scale -> impute -> classify.
pipe = Pipeline(
steps=[
("scale", MinMaxScaler()),
("impute", SimpleImputer()),
("clf", LogisticRegression(random_state=0)),
],
)
pipe.fit(X_train_float, y_train)
pipe.score(X_val_float, y_val)
0.6457399103139013
In 10 lines of code (with whitespace) we created a process to take data as input, scale it, impute it, and classify it.
Here's what our pipe
looks like.
# HTML representation of the pipe.
pipe
Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer()), ('clf', LogisticRegression(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer()), ('clf', LogisticRegression(random_state=0))])
MinMaxScaler()
SimpleImputer()
LogisticRegression(random_state=0)
Compared to the previous method, the pipe
keeps everything in one composable object.
We don't have to store each transformer or estimator as its own variable that we may or may not forget later.
And we don't have to store the output after each transformation, which includes both the training and validation data.
All together we create four variables: cont_cols
, X_train_float
, X_val_float
, and pipe
.
The previous method has ten:
cont_cols
, X_train_float
, X_val_float
,
scaler
, X_train_scaled
, X_val_scaled
,
imputer
, X_train_imputed
, X_val_imputed
,
and clf
.
Expanding Feature Space¶
What if we wanted to transform features in not one, but two ways?
First let's use the
OneHotEncoder
to transform some categorical/ordinal features.
Then we'll independently use the
TargetEncoder
to transform those same features.
# Import one-hot and target encoding transformers
from sklearn.preprocessing import OneHotEncoder, TargetEncoder
We will consider features with less than 10 unique values to be categorical/ordinal.
# Get the number of unique values in each feature sorted ascendingly.
X_train.nunique().sort_values()
Sex 2 Pclass 3 Embarked 3 SibSp 7 Parch 7 Age 82 Cabin 120 Fare 216 Ticket 537 Name 668 dtype: int64
# Limit to categorical/ordinal features.
cat_ord_cols = ["Sex", "Pclass", "Embarked", "SibSp", "Parch"]
X_train_enc = X_train[cat_ord_cols]
X_val_enc = X_val[cat_ord_cols]
We one-hot-encode the features, dropping the first and only tracking the five most frequent values. This helps us to avoid "exploding" our feature space.
# One-hot-encode categorical/ordinal features.
ohe = OneHotEncoder(drop="first", sparse_output=False, max_categories=5)
X_train_ohe = ohe.fit_transform(X_train_enc)
X_val_ohe = ohe.transform(X_val_enc)
Target encoding only transforms the values of each feature, so we don't have to worry about the number of columns growing.
# Target encode categorical/ordinal features.
tgt = TargetEncoder(random_state=0)
X_train_tgt = tgt.fit_transform(X_train_enc, y_train)
X_val_tgt = tgt.transform(X_val_enc)
We stack our two transformed data sets along the column-axis, keeping our row count the same but increasing the number of columns.
# Join One-hot-encoded features with target-encoded features.
X_train_feat_union = np.hstack((X_train_ohe, X_train_tgt))
X_val_feat_union = np.hstack((X_val_ohe, X_val_tgt))
clf.fit(X_train_feat_union, y_train)
clf.score(X_val_feat_union, y_val)
0.7847533632286996
It looks like there's some signal coming from the cat_ord_cols
!
Using a FeatureUnion
saves us variable assignments similar to the Pipeline
,
and allows us to execute all transformers in parallel by setting the n_jobs=-1
.
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion
# Define a feature union with two transformers: ohe & tgt.
cat_ord = FeatureUnion(
transformer_list=[
("ohe", OneHotEncoder(drop="first", sparse_output=False, max_categories=5)),
("tgt", TargetEncoder(random_state=0)),
],
n_jobs=-1, # Execute all transformers in parallel.
)
Because of how the scikit-learn
API functions, we can drop the cat_ord
into a pipe
as a component.
# Define a pipeline to run the cat_ord, and then classify.
pipe = Pipeline(
steps=[
("cat_ord", cat_ord),
("clf", LogisticRegression(random_state=0)),
],
)
pipe.fit(X_train_enc, y_train)
pipe.score(X_val_enc, y_val)
0.7847533632286996
The HTML representation of the pipe
shows which items will be ran in parallel (horizontal),
and which will be run sequentially (vertical).
pipe
Pipeline(steps=[('cat_ord', FeatureUnion(n_jobs=-1, transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))])), ('clf', LogisticRegression(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('cat_ord', FeatureUnion(n_jobs=-1, transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))])), ('clf', LogisticRegression(random_state=0))])
FeatureUnion(n_jobs=-1, transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))])
OneHotEncoder(drop='first', max_categories=5, sparse_output=False)
TargetEncoder(random_state=0)
LogisticRegression(random_state=0)
Transform Feature Subsets¶
We can now pipe data through a series of transformers (sequentially and in parallel) into a classifier.
How would we go about joining the features from both our pipelines together and using that as input into a model?
Let's define a cont_pipe
to handle the preprocessing of our continuous features.
# Limit to features with dtype float.
cont_cols = X_train.select_dtypes(include="float").columns
X_train_float = X_train[cont_cols]
X_val_float = X_val[cont_cols]
# Continuous feature pipeline.
cont_pipe = Pipeline(
steps=[
("scale", MinMaxScaler()),
("impute", SimpleImputer()),
],
)
X_train_cont = cont_pipe.fit_transform(X_train_float)
X_val_cont = cont_pipe.transform(X_val_float)
And let's also use our cat_ord
to preprocess our cat_ord_cols
.
# Limit to categorical/ordinal features.
cat_ord_cols = ["Sex", "Pclass", "Embarked", "SibSp", "Parch"]
X_train_enc = X_train[cat_ord_cols]
X_val_enc = X_val[cat_ord_cols]
# Categorical/ordinal feature union.
cat_ord = FeatureUnion(
transformer_list=[
("ohe", OneHotEncoder(drop="first", sparse_output=False, max_categories=5)),
("tgt", TargetEncoder(random_state=0)),
],
)
X_train_feat_union = cat_ord.fit_transform(X_train_enc, y_train)
X_val_feat_union = cat_ord.transform(X_val_enc)
We'll need to join the output of each preprocessor and then feed that into our classifier.
# Join continuous transformations with categorical/ordinal transformations.
X_train_join = np.hstack((X_train_cont, X_train_feat_union))
X_val_join = np.hstack((X_val_cont, X_val_feat_union))
clf.fit(X_train_join, y_train)
clf.score(X_val_join, y_val)
0.8026905829596412
Another improvement!
Similar to FeatureUnion
, the ColumnTransformer
allows us to execute multiple transformers in parallel.
The difference is that a FeatureUnion
's transformers will be applied to the same input.
The ColumnTransformer
allows us to pick and choose which features are given to a transformer.
In our current example it wouldn't make sense to apply the same transformations to both the cont_cols
and cat_ord_cols
.
So we delegate the cont_pipe
to the cont_cols
and the cat_ord
to the cat_ord_cols
.
# Import ColumnTransformer.
from sklearn.compose import ColumnTransformer
# Define a column transformer with two transformers: the cont_pipe pipeline and the cat_ord feature union.
col_trf = ColumnTransformer(
transformers=[
("cont_pipe", cont_pipe, cont_cols),
("cat_ord", cat_ord, cat_ord_cols),
],
remainder="drop", # Drop features not used in the transformers.
n_jobs=-1,
)
We can drop the col_trf
into a pipe
as a component, just like we did with the cat_ord
before.
# Define a pipeline to run the col_ord, and then classify.
pipe = Pipeline(
steps=[
("col_trf", col_trf),
("clf", LogisticRegression(random_state=0)),
],
)
pipe.fit(X_train, y_train)
pipe.score(X_val, y_val)
0.8026905829596412
We now have a pipe
with an internal Pipeline
and FeatureUnion
running next to each other.
pipe
Pipeline(steps=[('col_trf', ColumnTransformer(n_jobs=-1, transformers=[('cont_pipe', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), Index(['Age', 'Fare'], dtype='object')), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])), ('clf', LogisticRegression(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('col_trf', ColumnTransformer(n_jobs=-1, transformers=[('cont_pipe', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), Index(['Age', 'Fare'], dtype='object')), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])), ('clf', LogisticRegression(random_state=0))])
ColumnTransformer(n_jobs=-1, transformers=[('cont_pipe', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), Index(['Age', 'Fare'], dtype='object')), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])
Index(['Age', 'Fare'], dtype='object')
MinMaxScaler()
SimpleImputer()
['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch']
OneHotEncoder(drop='first', max_categories=5, sparse_output=False)
TargetEncoder(random_state=0)
LogisticRegression(random_state=0)
Custom Transformer¶
Building a machine learning model involves creativity. Sometimes we need to engineer our own features using methods that don't exist out-of-the-box. Take our Name feature for instance.
X_train.Name.sample(n=5, random_state=0)
PassengerId 44 Laroche, Miss. Simonne Marie Anne Andree 257 Thorne, Mrs. Gertrude Maybelle 232 Larsson, Mr. Bengt Edvin 213 Perkin, Mr. John Henry 290 Connolly, Miss. Kate Name: Name, dtype: object
With a little research, you'll find that the titles in the passengers' names give us an idea of their age. This is useful because some of our passengers' ages were missing. We could extract the title from the name, and then estimate the missing ages using the known ages and the titles. To do that we'll need to write a custom function to get the titles.
import re
def get_title(
text: str,
title_pattern: str = r"Mrs?|Miss|Master",
) -> str | None:
"""Get a passenger's title if present.
If more than one title found, return title with
the least number of characters.
If no title found, return None.
The defalut title_pattern will detect:
- Mr
- Mrs
- Miss
- Master
"""
possible_titles: set[str] = set(re.findall(pattern=title_pattern, string=text))
title: list[str] = sorted(possible_titles, key=len)
if title:
return title.pop(0)
# Assert function extracts expected title.
assert get_title("Turpin, Mr. William John Robert") == "Mr"
# Assert function returns nothing if no title present.
assert get_title("Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)") is None
# Assert function returns title with least number of characters.
assert get_title("Mr. and Mrs. Smith") == "Mr"
The get_title
function works well,
but vectorizing it will allow us to provide an array of names as input instead of a single text.
# Vectorize get_title allowing input to be array-like.
# Note that output dtypes will all be the same (None -> "None")
get_title_vec = np.vectorize(get_title)
np.unique(get_title_vec(X_train.Name), return_counts=True)
(array(['Master', 'Miss', 'Mr', 'Mrs', 'None'], dtype='<U6'), array([ 33, 133, 384, 98, 20], dtype=int64))
We assign the extracted titles as a column in our data so we can use them to impute age.
# Get titles from names.
X_train_title = X_train.assign(Title=get_title_vec(X_train.Name))
X_val_title = X_val.assign(Title=get_title_vec(X_val.Name))
We can't use the titles directly as our imputer will only accept numeric values.
We will one-hot-encode them, dropping the "None" title as it's handled when the other titles equal zero.
Taking advantage of the tools we've learned,
we'll use a ColumnTransformer
to keep age and title together, dropping everything else.
# One-hot-encode title and join with age.
age_trf = ColumnTransformer(
transformers=[
("age", "passthrough", ["Age"]),
("ohe", OneHotEncoder(drop=["None"], sparse_output=False), ["Title"]),
],
remainder="drop",
)
Next we define a knn_impute_pipe
to process title and age, then feed the results into the
KNNImputer
.
# Import KNNImputer.
from sklearn.impute import KNNImputer
# Impute missing values (age) using title and age.
knn_impute_pipe = Pipeline(
steps=[
("age_trf", age_trf),
("knn_impute", KNNImputer()),
],
)
To continue preprocessing fare the same way, we'll need to separate it from age with a ColumnTransformer
.
# Separate fare from age in the preprocessing steps.
col_trf = ColumnTransformer(
transformers=[
("fare", cont_pipe, ["Fare"]),
("age", knn_impute_pipe, ["Age", "Title"]),
("cat_ord", cat_ord, cat_ord_cols),
],
n_jobs=-1,
)
Lastly we define our pipe
and see if anything changed.
pipe = Pipeline(
steps=[
("col_trf", col_trf),
("clf", LogisticRegression(random_state=0)),
],
)
pipe.fit(X_train_title, y_train)
pipe.score(X_val_title, y_val)
0.8071748878923767
A tiny improvement, but I'll take it.
All the above works, but it required us to create a new data set to hold the title column.
We could have appended the field to our original,
but I like to leave the original in its raw state so I can track changes through the pipeline.
This leaves us with trying to figure out how to put the get_title_vec
function into our pipe
.
Introducing the FunctionTransformer
.
# Import FunctionTransformer.
from sklearn.preprocessing import FunctionTransformer
Converting a function is simple; set the func
argument to the function you want to convert and you're done.
The caveat is that your function should be vectorized, i.e. able to handle arrays as input, and return an array with the same shape as the input.
# Convert get_title_vec into an sklearn transformer
title_func = FunctionTransformer(func=get_title_vec)
We can now define a Pipeline
to extract titles.
# Pipeline to get titles, then one-hot-encode.
title_pipe = Pipeline(
steps=[
("title_func", title_func),
("ohe", OneHotEncoder(drop=["None"], sparse_output=False)),
],
)
Pass title_pipe
into a ColumnTransformer
to keep age and name together.
# Pipeline to get titles and passthrough age.
age_title_trf = ColumnTransformer(
transformers=[
("title_pipe", title_pipe, ["Name"]),
("age", "passthrough", ["Age"])
],
remainder="drop",
)
Create a Pipeline
to impute age using the output of our age_title_trf
.
# Pipeline to impute age given ages of neighbors with given titles.
age_pipe = Pipeline(
steps=[
("age_title_trf", age_title_trf),
("impute_knn", KNNImputer()),
],
)
Combine all our preprocessing transformers into a single ColumnTransformer
.
# Separate fare from age in the preprocessing steps.
col_trf = ColumnTransformer(
transformers=[
("fare", cont_pipe, ["Fare"]),
("age", age_pipe, ["Age", "Name"]),
("cat_ord", cat_ord, cat_ord_cols),
],
remainder="drop",
)
And set col_trf
as the first step in our final pipe
.
pipe = Pipeline(
steps=[
("col_trf", col_trf),
("clf", LogisticRegression(random_state=0)),
],
)
pipe.fit(X_train, y_train)
pipe.score(X_val, y_val)
0.8071748878923767
pipe
Pipeline(steps=[('col_trf', ColumnTransformer(transformers=[('fare', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), ['Fare']), ('age', Pipeline(steps=[('age_title_trf', ColumnTransformer(transformers=[('title_pipe', Pipeline(steps=[('title_func', FunctionTransformer(func=<numpy.vectorize object at 0x000002A1831A9450>)), ('ohe', OneHotEnc... sparse_output=False))]), ['Name']), ('age', 'passthrough', ['Age'])])), ('impute_knn', KNNImputer())]), ['Age', 'Name']), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])), ('clf', LogisticRegression(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('col_trf', ColumnTransformer(transformers=[('fare', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), ['Fare']), ('age', Pipeline(steps=[('age_title_trf', ColumnTransformer(transformers=[('title_pipe', Pipeline(steps=[('title_func', FunctionTransformer(func=<numpy.vectorize object at 0x000002A1831A9450>)), ('ohe', OneHotEnc... sparse_output=False))]), ['Name']), ('age', 'passthrough', ['Age'])])), ('impute_knn', KNNImputer())]), ['Age', 'Name']), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])), ('clf', LogisticRegression(random_state=0))])
ColumnTransformer(transformers=[('fare', Pipeline(steps=[('scale', MinMaxScaler()), ('impute', SimpleImputer())]), ['Fare']), ('age', Pipeline(steps=[('age_title_trf', ColumnTransformer(transformers=[('title_pipe', Pipeline(steps=[('title_func', FunctionTransformer(func=<numpy.vectorize object at 0x000002A1831A9450>)), ('ohe', OneHotEncoder(drop=['None'], sparse_output=False))]), ['Name']), ('age', 'passthrough', ['Age'])])), ('impute_knn', KNNImputer())]), ['Age', 'Name']), ('cat_ord', FeatureUnion(transformer_list=[('ohe', OneHotEncoder(drop='first', max_categories=5, sparse_output=False)), ('tgt', TargetEncoder(random_state=0))]), ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch'])])
['Fare']
MinMaxScaler()
SimpleImputer()
['Age', 'Name']
ColumnTransformer(transformers=[('title_pipe', Pipeline(steps=[('title_func', FunctionTransformer(func=<numpy.vectorize object at 0x000002A1831A9450>)), ('ohe', OneHotEncoder(drop=['None'], sparse_output=False))]), ['Name']), ('age', 'passthrough', ['Age'])])
['Name']
FunctionTransformer(func=<numpy.vectorize object at 0x000002A1831A9450>)
OneHotEncoder(drop=['None'], sparse_output=False)
['Age']
passthrough
KNNImputer()
['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch']
OneHotEncoder(drop='first', max_categories=5, sparse_output=False)
TargetEncoder(random_state=0)
LogisticRegression(random_state=0)
Conclusion¶
To wrap this up I'd like to draw some parallels to The Zen of Python.
import this
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
- I find
scikit-learn
to be beautiful, but that's my subjective opinion. You may think it's the ugliest machine learning software to ever exist, and that's okay. - As a beginner I didn't find the
scikit-learn
docs to be as explicit as I'd want—where are the type hints? I hope this post reduces some of the implicitness. - I think
scikit-learn
'sPipeline
and friends are simple, but my path to understanding them was complex. - Our final
pipe
is both nested and dense, but at least it's readable! - Don't use all the mentioned features if it's not practical. Don't be a purist.
- Test your code before using it. It may seem silly, but if I hadn't tested my
get_title
function I wouldn't have found instances where the original function failed. Do it. It will save you time in the end. - Think through (write down?) your process before translating to code. This will also save you time.
- Machine learning is an iterative process. What you build first will probably be your worst.
- Commit clean, tested code knowing that there will be a version two.
- Keep it simple, stupid.
I hope this post has shed some light on the Pipeline
, FeatureUnion
, ColumnTransformer
, and FunctionTransformer
.
If you have any questions, comments, or concerns please post a comment below.
Thanks for reading!!!