basket-analysis-pt1

TL;DR¶

In this post I navigate the Stack Overflow API and begin collecting data for a basket analysis on question tags. My goal is to identify question tag patterns with the questions answered by recipients of tag badges. By the end, I will have defined tag badges and shown how to gather the questions a user answered on their way to earning a tag badge.

Stack Overflow has this thing called reputation. It's like a reward given to members of the site when they contribute well-written answers and/or questions. If you write a solid answer to a question, other users with a minimum reputation can upvote your post, granting you 10 rep each time (up to the daily limit of 200). On the flip side, if you write a poor answer, users can downvote it, each time reducing your current rep by two—all the way down to one rep.

Each time your reputation increases, you get closer to receiving a new privilege. You could think of them as checkpoints with the last one granted at 25,000. At the time of writing, I'm only at 3,182 rep with my next privilege —approve tag wiki edits—coming at 5,000.

Note, this is linked directly to my profile so the values may change

In addition to reputation, there are badges and tag badges. Those are represented by the bronze, silver, and gold numbers on my user profile. If we take reputation to be a measure of how well you answer and ask questions (and maybe how often if compared with time 🤔), then tag badges measure your area of expertise. Or maybe the area you're most interested in 🤷.

In November 2022 I earned a bronze python tag badge and a bronze pandas tag badge. Earning these motivated me to earn more, but I wasn't sure if I should focus only on silver tag badges, or if I could pick up another bronze badge along the way. The requirements for a bronze and silver tag badge are:

Bronze: Earn at least 100 total score for at least 20 non-community wiki answers in the tag

Silver: Earn at least 400 total score for at least 80 non-community wiki answers in the tag

On your profile activity page, Stack Overflow gives its recommendation on which tag badge you should try to earn next. With a total score of 47 (out of 100) and 73 (out of 20) answers, mine is the bronze dataframe tag badge. Before writing this, I had been attempting to kill three birds with one stone: earn silver python and pandas tag badges and a bronze dataframe tag badge at the same time. My thought was that because pandas is a python package and dataframe is one of the primary objects in pandas, I would be able to find questions that had all three tags fairly often.

After some time, I've decided to reconsider my strategy as new questions with all three tags don't appear as frequently as I had hoped. I'm going to look at users who have already earned silver python or pandas tag badges, or a bronze dataframe badge. Maybe the questions they've answered will help guide me towards what badge(s) I should chase next.

Tag Badges¶

To start, I need to learn more about the tag badges I'm currently trying to earn. I establish a connection to the Stack Exchange API like I did in my post, Stack Overflow's API.

In [1]:

from os import getenv

import pandas as pd
from stackapi import StackAPI

# Configuration settings.
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 4

# For connecting to the API.
key = getenv("STACK_API_KEY")
SITE = StackAPI(name="stackoverflow", key=key)

Exploring the API docs, I found an endpoint for fetching tag badges: /badges/tags. Looking at the page I can see that I need to include the inname parameter to filter by tag badge name, and both the min and max parameters to limit rank. I define a collection of tuples—one for each badge—and iterate over them, fetching and filtering the results to the items I'm interested in.

In [2]:

tags = (("python", "silver"), ("pandas", "silver"), ("dataframe", "bronze"))
tag_ids = {}
for tag, rank in tags:
    results = SITE.fetch(endpoint="badges/tags", inname=tag, max=rank, min=rank)
    items = results["items"]
    df = pd.DataFrame.from_records(data=items, index="badge_id")
    tag_ids[tag] = df.loc[df.name.eq(tag)].index.to_list()

print(tag_ids)

{'python': [50], 'pandas': [2426], 'dataframe': [5915]}

I extract the badge_id value for each badge to help me limit my API calls going forward.

Badge Recipients¶

Badges by themselves don't do a lot for me; I need to find users who have earned them. Back on the API docs there is an endpoint for fetching badge recipients: /badges/{ids}/recipients. The ids parameter expects a badge_id. We can use the three badge_ids we fetched in the previous section, iterating over them and extending our list of recipients.

In [3]:

recipients = []
for tag, ids in tag_ids.items():
    results = SITE.fetch(endpoint="badges/{ids}/recipients", ids=ids)
    items = results["items"]
    recipients.extend(items)

recipients = pd.json_normalize(data=recipients)
recipients = recipients.set_index(keys="user.user_id")

print(recipients.name.value_counts())

name
python       500
dataframe    435
pandas       193
Name: count, dtype: int64

We fetched:

500 recipients of the silver python tag badge
435 of the bronze dataframe tag badge
193 of the silver pandas tag badge

Note that the max number of result items returned in one fetch is 500. This means only 435 and 193 users have earned the bronze dataframe and silver pandas tag badges, respectively. We can see how many users earned each badge by looking at the award_count field returned by the /badges/tags endpoint.

Reverse Engineering¶

There isn't an endpoint to return the questions a user wrote answers for that contributed to them earning a tag badge. We'll have to do a little grunt work, starting with determining when the a user earned its silver pandas tag badge. The /users/{ids}/timeline endpoint will return users' public actions and accomplishment in chronological order. We can use that to get the date and time a user earned a tag badge.

To make sure we get all timeline events, we'll increment the page parameter by 5 on each iteration. This will fetch us 500 items each time, stopping if items is empty.

In [4]:

from functools import partial

# The first silver pandas tag badge recipient.
pd_users = recipients.loc[recipients.name.eq("pandas")]
user = pd_users.index[0]

fetch_timeline = partial(SITE.fetch, endpoint="users/{ids}/timeline")
page = 1
timeline = []
results = fetch_timeline(ids=user, page=page)
items = results["items"]
while items:
    # Repeat until no `items` are returned
    timeline.extend(items)
    page += 5
    results = fetch_timeline(ids=user, page=page)
    items = results["items"]

timeline = pd.json_normalize(timeline)
badge_date = timeline.loc[
    timeline.timeline_type.eq("badge")
    & timeline.badge_id.isin(tag_ids["pandas"]),
    "creation_date"
].iloc[0]

print(badge_date)

1709611287

Per the docs,

All dates in the API are in unix epoch time, which is the number of seconds since midnight UTC January 1st, 1970.

We can convert the badge_date to a more human readable format using the datetime library.

In [5]:

from datetime import datetime as dt

human_readable_badge_date = dt.utcfromtimestamp(badge_date).strftime("%Y-%m-%d %H:%M:%S")

print(human_readable_badge_date)

2024-03-05 04:01:27

User Answers¶

With the badge_date, we can limit our answer to those posted by the user before and including the badge_date. The endpoint to find answers written by a specific user is /users/{ids}/answers.

In [6]:

fetch_answers = partial(
    SITE.fetch,
    endpoint="users/{ids}/answers",
    sort="creation",
    max=badge_date,
)
page = 1
answers = []
results = fetch_answers(ids=user, page=page)
items = results["items"]
while items:
    # Repeat until no `items` are returned
    answers.extend(items)
    page += 5
    results = fetch_answers(ids=user, page=page)
    items = results["items"]

answers = pd.json_normalize(answers)
answers = answers.set_index(keys="answer_id")

print(answers.index.nunique())

This particular user wrote 894 answers between their account creation and the badge_date!

Questions¶

The final piece of data collection proces: the questions. The questions tags control how an answer and its score contributes to earning a tag badge. To get the questions to our users' answers, we use the /questons/{ids} endpoint.

It's worth noting that a user can write multiple answers to the same question. This doesn't appear to be against Stack Exchange policy—see Are multiple answers by the same user acceptable? —so it's best to remove duplicates before fetching results to avoid running the same query.

In [7]:

# Convert to set as some users may provide multiple answers to the same question.
question_ids = [*set(answers.question_id)]

questions = []
# The endpoint only accepts 100 ids at a time, so we increment in batches of 100.
for i in range(0, len(question_ids), 100):
    ids = question_ids[i:i + 100]
    results = SITE.fetch(endpoint="questions/{ids}", ids=ids)
    items = results["items"]
    questions.extend(items)

questions = pd.json_normalize(questions)
questions = questions.set_index("question_id")

If you'll recall, we collected all answers posted by the user before the begin_date. This includes answers to questions unrelated to the silver pandas tag badge. We limit to questions of interest by filtering the tags values to those that contain the string "pandas".

questions.tags is a pandas.Series so we use the operator.contains function.

In [8]:

from operator import contains

contains_pandas = lambda a: contains(a, "pandas")
pd_questions = questions.loc[questions.tags.apply(contains_pandas)]
pd_tags = pd_questions.tags

print(pd_tags.index.nunique())

This user wrote answers to 346 questions tagged with pandas.

Conclusion¶

Getting the question tags for a single user wasn't exactly a walk in the park. Before doing this for the other recipients, it'd be best to convert some of this logic into reusable functions. We'll work on that in the next post, but feel free to try it out on your own 🙂.