contributing

As I was working to keep my promise from a previous entry, I came across a scenario that I thought was worth a blog post. I was using the mlxtend package to show how one might perform a basket analysis on question tags when I discovered a feature that I expected to exist, didn't. I'll elaborate.

The Missing Feature¶

I connected to the API as I had previously written about and pulled questions.

In [1]:

from os import getenv

from stackapi import StackAPI


key = getenv("STACK_API_KEY")
SITE = StackAPI("stackoverflow", key=key)
questions = SITE.fetch("questions")

questions["items"][0]

Out[1]:

{'tags': ['html', 'css', 'flexbox', 'responsive-design', 'centering'],
 'owner': {'account_id': 26330658,
  'reputation': 73,
  'user_id': 19991177,
  'user_type': 'registered',
  'profile_image': 'https://www.gravatar.com/avatar/1379e1c185626a10b0ddac93c5326254?s=256&d=identicon&r=PG',
  'display_name': 'TheNickster',
  'link': 'https://stackoverflow.com/users/19991177/thenickster'},
 'is_answered': True,
 'view_count': 18,
 'answer_count': 2,
 'score': 0,
 'last_activity_date': 1711251161,
 'creation_date': 1711235354,
 'question_id': 78212821,
 'content_license': 'CC BY-SA 4.0',
 'link': 'https://stackoverflow.com/questions/78212821/how-do-i-center-score-text-for-a-basketball-scoreboard',
 'title': 'How do I Center Score Text for a Basketball Scoreboard?'}

In the question items there's a field called "tags", which I want to use for the analysis. The tags are presented as a list of strings. To keep them tied to their questions and make analysis a bit easier, I decided to convert the list of question items to a pandas.DataFrame.

In [2]:

import pandas as pd

# Configuration settings
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 6


df = pd.DataFrame(questions["items"])
# Question Ids are unique to the row.
df = df.set_index("question_id")
# Results may vary as the most recent questions are returned each call.
print(df.tags.head())

question_id
78212821    [html, css, flexbox, responsive-design, center...
76143172                                 [php, symfony, twig]
35707320            [ruby-on-rails, mongodb, ruby-on-rails-4]
48057197               [php, apache, xampp, php-7.1, php-7.2]
49476559    [java, compiler-errors, java-9, java-module, m...
Name: tags, dtype: object

Preprocessing of the tags would be handled by the mlxtend library. I chose to use the TransactionEncoder, which is similar to a OneHotEncoder, but for converting item lists (think lists of lists; nested lists) into transaction data rather than an array (one value per cell) into columns.

In [3]:

from mlxtend.preprocessing.transactionencoder import TransactionEncoder


encoder = TransactionEncoder()
tag_encodings = encoder.fit_transform(df.tags)
tag_encodings

Out[3]:

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

The returned results are an array. No problem with that. But while browsing the example in the User Guide, I noticed how they converted the array into a pandas.DataFrame.

In [4]:

tag_df = pd.DataFrame(
    tag_encodings,
    index=df.index,  # I added the index to align with the input data.
    columns=encoder.columns_,
)
print(tag_df.head())

              .net  .net-6.0  .net-attributes  ...  zooming    zsh  zustand
question_id                                    ...                         
78212821     False     False            False  ...    False  False    False
76143172     False     False            False  ...    False  False    False
35707320     False     False            False  ...    False  False    False
48057197     False     False            False  ...    False  False    False
49476559     False     False            False  ...    False  False    False

[5 rows x 900 columns]

There's nothing wrong with how this was done, but I wondered why the set_output method wasn't taken advantage of. That's when I realized it's not exposed in mlxtend.

In [5]:

try:
    encoder = TransactionEncoder().set_output(transform="pandas")

except Exception as e:
    print(repr(e))

AttributeError("This 'TransactionEncoder' has no attribute 'set_output'")

"That's odd," I thought. I'm pretty sure scikit-learn is a requirement for mlxtend. Surely the supported version is greater than 1.2?

After looking at the requirements.txt file, I was relieved to see that the package did in fact use the newest version of scikit-learn. But why didn't set_output work?

The reason wasn't obvious after digging through the TransactionEncoder's source code. Switching to how set_output works in scikit-learn, I found what I was looking for in the documentation for the TransformerMixin class:

Mixin class for all transformers in scikit-learn.

This mixin defines the following functionality:

a fit_transform method that delegates to fit and transform;

a set_output method to output X as a specific container type.

If get_feature_names_out is defined, then BaseEstimator will automatically wrap transform and fit_transform to follow the set_output API. See the Developer API for set_output for details.

OneToOneFeatureMixin and ClassNamePrefixFeaturesOutMixin are helpful mixins for defining get_feature_names_out.

The current version of TransactionEncoder does inherit from scikit-learn's TransformerMixin, but does not define the get_feature_names_out method. Implementing the method would allow the TransactionEncoder to output a pandas.DataFrame by default. I'm up for the challenge 😎.

New Issue (Feature)¶

If you haven't contributed to an open source project before, here are some general guidelines I like to follow:

Check if a related issue has already been logged. Nobody wants to deal with closing duplicate tickets. Or worse, not closing them and having to deal with duplicate work that's already been completed.
Read the package's contribution guidelines and code of conduct. If there's an existing process in place, follow it.

I usually perform a few searches over the open issues with various keywords to see if anything comes up. For this particular issue I tried "set_output", "TransactionEncoder", and "get_feature_names_out". The first and third yielded no results, and the second had some unrelated to the format of the output. I'm good to proceed.

mlxtend's issue template has four major categories:

Bug report
Documentation improvement
Feature request
Other
Usage question

Since the get_feature_names_out method doesn't exist in the TransactionEncoder, I think this should be a feature request.

I started off with a title: "Integrate scikit-learn's set_output method into TransactionEncoder." I want my feature request to be specific and small enough that it can be easily merged, as well as not break any preexisting code (though I do forsee a scikit-learn version bump).

Next, I need to fill out the following four sections:

Describe the workflow you want to enable
Describe your proposed solution
Describe alternatives you've considered, if relevant
Additional context

Here's what I put for each:

Describe the workflow you want to enable¶

In scikit-learn version 1.2, the set_output API was introduced. I would like to expose the API inside of the mlxtend.preprocessing.transactionencoder.TransactionEncoder class. This would allow the user to set the output of :method:TransactionEncoder.fit_transform and :method:TransactionEncoder.transform to a pandas.DataFrame by default, rather than having to manually create the object after transformation.

Describe your proposed solution¶

My proposed solution is to define the :method:get_feature_names_out in :class:TransactionEncoder as this is required to expose the :method:set_output. See :class:TransformerMixin and Developer API for set_output for more details.

Describe alternatives you've considered, if relevant¶

Continue using the method described in the User Guide —convert the output of the transformer to a pandas.DataFrame manually.

Additional context¶

This would require the minimum version of scikit-learn to increase from 1.0.2 to 1.2.2.
I'm willing to take on the PR for this work.

Submit¶

After doing my due diligence, I submitted the feature request/issue. You can keep tabs on it here 👉 Integrate scikit-learn's set_output method into TransactionEncoder. While I wait for one of the package maintainers to green-light my request, I'll scope out how difficult it will be to implement the get_feature_names_out method. I should also see if I need to write or update any unit tests. Catch you in part deuce ✌️.