Contributing
As I was working to keep my promise from a
previous entry,
I came across a scenario that I thought was worth a blog post.
I was using the
mlxtend
package to show how one might perform a basket analysis on question tags when I discovered a feature that I expected to exist, didn't.
I'll elaborate.
The Missing Feature¶
I connected to the API as I had previously written about and pulled questions.
from os import getenv
from stackapi import StackAPI
key = getenv("STACK_API_KEY")
SITE = StackAPI("stackoverflow", key=key)
questions = SITE.fetch("questions")
questions["items"][0]
{'tags': ['html', 'css', 'flexbox', 'responsive-design', 'centering'], 'owner': {'account_id': 26330658, 'reputation': 73, 'user_id': 19991177, 'user_type': 'registered', 'profile_image': 'https://www.gravatar.com/avatar/1379e1c185626a10b0ddac93c5326254?s=256&d=identicon&r=PG', 'display_name': 'TheNickster', 'link': 'https://stackoverflow.com/users/19991177/thenickster'}, 'is_answered': True, 'view_count': 18, 'answer_count': 2, 'score': 0, 'last_activity_date': 1711251161, 'creation_date': 1711235354, 'question_id': 78212821, 'content_license': 'CC BY-SA 4.0', 'link': 'https://stackoverflow.com/questions/78212821/how-do-i-center-score-text-for-a-basketball-scoreboard', 'title': 'How do I Center Score Text for a Basketball Scoreboard?'}
In the question items there's a field called "tags", which I want to use for the analysis.
The tags are presented as a list of strings.
To keep them tied to their questions and make analysis a bit easier,
I decided to convert the list of question items to a pandas.DataFrame
.
import pandas as pd
# Configuration settings
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 6
df = pd.DataFrame(questions["items"])
# Question Ids are unique to the row.
df = df.set_index("question_id")
# Results may vary as the most recent questions are returned each call.
print(df.tags.head())
question_id 78212821 [html, css, flexbox, responsive-design, center... 76143172 [php, symfony, twig] 35707320 [ruby-on-rails, mongodb, ruby-on-rails-4] 48057197 [php, apache, xampp, php-7.1, php-7.2] 49476559 [java, compiler-errors, java-9, java-module, m... Name: tags, dtype: object
Preprocessing of the tags would be handled by the mlxtend
library.
I chose to use the
TransactionEncoder
, which is similar to a OneHotEncoder
,
but for converting item lists (think lists of lists; nested lists) into transaction data rather than an array (one value per cell) into columns.
from mlxtend.preprocessing.transactionencoder import TransactionEncoder
encoder = TransactionEncoder()
tag_encodings = encoder.fit_transform(df.tags)
tag_encodings
array([[False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], ..., [False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False]])
The returned results are an array.
No problem with that.
But while browsing
the example in the User Guide,
I noticed how they converted the array into a
pandas.DataFrame
.
tag_df = pd.DataFrame(
tag_encodings,
index=df.index, # I added the index to align with the input data.
columns=encoder.columns_,
)
print(tag_df.head())
.net .net-6.0 .net-attributes ... zooming zsh zustand question_id ... 78212821 False False False ... False False False 76143172 False False False ... False False False 35707320 False False False ... False False False 48057197 False False False ... False False False 49476559 False False False ... False False False [5 rows x 900 columns]
There's nothing wrong with how this was done, but I wondered why the
set_output
method wasn't taken advantage of.
That's when I realized it's not exposed in mlxtend
.
try:
encoder = TransactionEncoder().set_output(transform="pandas")
except Exception as e:
print(repr(e))
AttributeError("This 'TransactionEncoder' has no attribute 'set_output'")
"That's odd," I thought.
I'm pretty sure scikit-learn
is a requirement for mlxtend
.
Surely the supported version is greater than 1.2?
After looking at the requirements.txt file,
I was relieved to see that the package did in fact use the newest version of scikit-learn
.
But why didn't set_output
work?
The reason wasn't obvious after digging through the TransactionEncoder
's source code.
Switching to how set_output
works in scikit-learn
, I found what I was looking for in the documentation for the
TransformerMixin
class:
Mixin class for all transformers in scikit-learn.
This mixin defines the following functionality:
- a
fit_transform
method that delegates tofit
andtransform
;- a
set_output
method to outputX
as a specific container type.If
get_feature_names_out
is defined, thenBaseEstimator
will automatically wraptransform
andfit_transform
to follow theset_output
API. See the Developer API forset_output
for details.
OneToOneFeatureMixin
andClassNamePrefixFeaturesOutMixin
are helpful mixins for definingget_feature_names_out
.
The current version of TransactionEncoder
does inherit from scikit-learn
's TransformerMixin
,
but does not define the get_feature_names_out
method.
Implementing the method would allow the TransactionEncoder
to output a pandas.DataFrame
by default.
I'm up for the challenge 😎.
New Issue (Feature)¶
If you haven't contributed to an open source project before, here are some general guidelines I like to follow:
- Check if a related issue has already been logged. Nobody wants to deal with closing duplicate tickets. Or worse, not closing them and having to deal with duplicate work that's already been completed.
- Read the package's contribution guidelines and code of conduct. If there's an existing process in place, follow it.
I usually perform a few searches over the open issues with various keywords to see if anything comes up. For this particular issue I tried "set_output", "TransactionEncoder", and "get_feature_names_out". The first and third yielded no results, and the second had some unrelated to the format of the output. I'm good to proceed.
mlxtend
's issue template has four major categories:
- Bug report
- Documentation improvement
- Feature request
- Other
- Usage question
Since the get_feature_names_out
method doesn't exist in the TransactionEncoder
,
I think this should be a feature request.
I started off with a title: "Integrate scikit-learn's set_output
method into TransactionEncoder
."
I want my feature request to be specific and small enough that it can be easily merged,
as well as not break any preexisting code (though I do forsee a scikit-learn
version bump).
Next, I need to fill out the following four sections:
- Describe the workflow you want to enable
- Describe your proposed solution
- Describe alternatives you've considered, if relevant
- Additional context
Here's what I put for each:
Describe the workflow you want to enable¶
In scikit-learn
version 1.2,
the set_output
API was introduced.
I would like to expose the API inside of the
mlxtend.preprocessing.transactionencoder.TransactionEncoder
class.
This would allow the user to set the output of :method:TransactionEncoder.fit_transform
and :method:TransactionEncoder.transform
to a
pandas.DataFrame
by default,
rather than having to manually create the object after transformation.
Describe your proposed solution¶
My proposed solution is to define the
:method:get_feature_names_out
in :class:TransactionEncoder
as this is required to expose the :method:set_output
.
See :class:TransformerMixin
and Developer API for set_output
for more details.
Describe alternatives you've considered, if relevant¶
Continue using the method described in the
User Guide
—convert the output of the transformer to a pandas.DataFrame
manually.
Additional context¶
- This would require the minimum version of
scikit-learn
to increase from 1.0.2 to 1.2.2. - I'm willing to take on the PR for this work.
Submit¶
After doing my due diligence, I submitted the feature request/issue.
You can keep tabs on it here 👉
Integrate scikit-learn's set_output
method into TransactionEncoder
.
While I wait for one of the package maintainers to green-light my request,
I'll scope out how difficult it will be to implement the get_feature_names_out
method.
I should also see if I need to write or update any unit tests.
Catch you in part deuce ✌️.