Contributing
As I was working to keep my promise from a
previous entry,
I came across a scenario that I thought was worth a blog post.
I was using the
mlxtend
package to show how one might perform a basket analysis on question tags when I discovered a feature that I expected to exist, didn't.
I'll elaborate.
The Missing Feature¶
I connected to the API as I had previously written about and pulled questions.
from os import getenv
from stackapi import StackAPI
key = getenv("STACK_API_KEY")
SITE = StackAPI("stackoverflow", key=key)
questions = SITE.fetch("questions")
questions["items"][0]
{'tags': ['html', 'css', 'flexbox', 'responsive-design', 'centering'],
'owner': {'account_id': 26330658,
'reputation': 73,
'user_id': 19991177,
'user_type': 'registered',
'profile_image': 'https://www.gravatar.com/avatar/1379e1c185626a10b0ddac93c5326254?s=256&d=identicon&r=PG',
'display_name': 'TheNickster',
'link': 'https://stackoverflow.com/users/19991177/thenickster'},
'is_answered': True,
'view_count': 18,
'answer_count': 2,
'score': 0,
'last_activity_date': 1711251161,
'creation_date': 1711235354,
'question_id': 78212821,
'content_license': 'CC BY-SA 4.0',
'link': 'https://stackoverflow.com/questions/78212821/how-do-i-center-score-text-for-a-basketball-scoreboard',
'title': 'How do I Center Score Text for a Basketball Scoreboard?'}
In the question items there's a field called "tags", which I want to use for the analysis.
The tags are presented as a list of strings.
To keep them tied to their questions and make analysis a bit easier,
I decided to convert the list of question items to a pandas.DataFrame.
import pandas as pd
# Configuration settings
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 6
df = pd.DataFrame(questions["items"])
# Question Ids are unique to the row.
df = df.set_index("question_id")
# Results may vary as the most recent questions are returned each call.
print(df.tags.head())
question_id 78212821 [html, css, flexbox, responsive-design, center... 76143172 [php, symfony, twig] 35707320 [ruby-on-rails, mongodb, ruby-on-rails-4] 48057197 [php, apache, xampp, php-7.1, php-7.2] 49476559 [java, compiler-errors, java-9, java-module, m... Name: tags, dtype: object
Preprocessing of the tags would be handled by the mlxtend library.
I chose to use the
TransactionEncoder, which is similar to a OneHotEncoder,
but for converting item lists (think lists of lists; nested lists) into transaction data rather than an array (one value per cell) into columns.
from mlxtend.preprocessing.transactionencoder import TransactionEncoder
encoder = TransactionEncoder()
tag_encodings = encoder.fit_transform(df.tags)
tag_encodings
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
The returned results are an array.
No problem with that.
But while browsing
the example in the User Guide,
I noticed how they converted the array into a
pandas.DataFrame.
tag_df = pd.DataFrame(
tag_encodings,
index=df.index, # I added the index to align with the input data.
columns=encoder.columns_,
)
print(tag_df.head())
.net .net-6.0 .net-attributes ... zooming zsh zustand question_id ... 78212821 False False False ... False False False 76143172 False False False ... False False False 35707320 False False False ... False False False 48057197 False False False ... False False False 49476559 False False False ... False False False [5 rows x 900 columns]
There's nothing wrong with how this was done, but I wondered why the
set_output
method wasn't taken advantage of.
That's when I realized it's not exposed in mlxtend.
try:
encoder = TransactionEncoder().set_output(transform="pandas")
except Exception as e:
print(repr(e))
AttributeError("This 'TransactionEncoder' has no attribute 'set_output'")
"That's odd," I thought.
I'm pretty sure scikit-learn is a requirement for mlxtend.
Surely the supported version is greater than 1.2?
After looking at the requirements.txt file,
I was relieved to see that the package did in fact use the newest version of scikit-learn.
But why didn't set_output work?
The reason wasn't obvious after digging through the TransactionEncoder's source code.
Switching to how set_output works in scikit-learn, I found what I was looking for in the documentation for the
TransformerMixin class:
Mixin class for all transformers in scikit-learn.
This mixin defines the following functionality:
- a
fit_transformmethod that delegates tofitandtransform;- a
set_outputmethod to outputXas a specific container type.If
get_feature_names_outis defined, thenBaseEstimatorwill automatically wraptransformandfit_transformto follow theset_outputAPI. See the Developer API forset_outputfor details.
OneToOneFeatureMixinandClassNamePrefixFeaturesOutMixinare helpful mixins for definingget_feature_names_out.
The current version of TransactionEncoder does inherit from scikit-learn's TransformerMixin,
but does not define the get_feature_names_out method.
Implementing the method would allow the TransactionEncoder to output a pandas.DataFrame by default.
I'm up for the challenge 😎.
New Issue (Feature)¶
If you haven't contributed to an open source project before, here are some general guidelines I like to follow:
- Check if a related issue has already been logged. Nobody wants to deal with closing duplicate tickets. Or worse, not closing them and having to deal with duplicate work that's already been completed.
- Read the package's contribution guidelines and code of conduct. If there's an existing process in place, follow it.
I usually perform a few searches over the open issues with various keywords to see if anything comes up. For this particular issue I tried "set_output", "TransactionEncoder", and "get_feature_names_out". The first and third yielded no results, and the second had some unrelated to the format of the output. I'm good to proceed.
mlxtend's issue template has four major categories:
- Bug report
- Documentation improvement
- Feature request
- Other
- Usage question
Since the get_feature_names_out method doesn't exist in the TransactionEncoder,
I think this should be a feature request.
I started off with a title: "Integrate scikit-learn's set_output method into TransactionEncoder."
I want my feature request to be specific and small enough that it can be easily merged,
as well as not break any preexisting code (though I do forsee a scikit-learn version bump).
Next, I need to fill out the following four sections:
- Describe the workflow you want to enable
- Describe your proposed solution
- Describe alternatives you've considered, if relevant
- Additional context
Here's what I put for each:
Describe the workflow you want to enable¶
In scikit-learn version 1.2,
the set_output API was introduced.
I would like to expose the API inside of the
mlxtend.preprocessing.transactionencoder.TransactionEncoder class.
This would allow the user to set the output of :method:TransactionEncoder.fit_transform and :method:TransactionEncoder.transform to a
pandas.DataFrame by default,
rather than having to manually create the object after transformation.
Describe your proposed solution¶
My proposed solution is to define the
:method:get_feature_names_out
in :class:TransactionEncoder as this is required to expose the :method:set_output.
See :class:TransformerMixin
and Developer API for set_output
for more details.
Describe alternatives you've considered, if relevant¶
Continue using the method described in the
User Guide
—convert the output of the transformer to a pandas.DataFrame manually.
Additional context¶
- This would require the minimum version of
scikit-learnto increase from 1.0.2 to 1.2.2. - I'm willing to take on the PR for this work.
Submit¶
After doing my due diligence, I submitted the feature request/issue.
You can keep tabs on it here 👉
Integrate scikit-learn's set_output method into TransactionEncoder.
While I wait for one of the package maintainers to green-light my request,
I'll scope out how difficult it will be to implement the get_feature_names_out method.
I should also see if I need to write or update any unit tests.
Catch you in part deuce ✌️.