Basket Analysis, Part I: Navigating the API
TL;DR¶
In this post I navigate the Stack Overflow API and begin collecting data for a basket analysis on question tags. My goal is to identify question tag patterns with the questions answered by recipients of tag badges. By the end, I will have defined tag badges and shown how to gather the questions a user answered on their way to earning a tag badge.
Stack Overflow has this thing called reputation. It's like a reward given to members of the site when they contribute well-written answers and/or questions. If you write a solid answer to a question, other users with a minimum reputation can upvote your post, granting you 10 rep each time (up to the daily limit of 200). On the flip side, if you write a poor answer, users can downvote it, each time reducing your current rep by two—all the way down to one rep.
Each time your reputation increases, you get closer to receiving a new privilege. You could think of them as checkpoints with the last one granted at 25,000. At the time of writing, I'm only at 3,182 rep with my next privilege —approve tag wiki edits—coming at 5,000.
In addition to reputation, there are badges and tag badges. Those are represented by the bronze, silver, and gold numbers on my user profile. If we take reputation to be a measure of how well you answer and ask questions (and maybe how often if compared with time 🤔), then tag badges measure your area of expertise. Or maybe the area you're most interested in 🤷.
In November 2022 I earned a bronze python
tag badge and a
bronze pandas
tag badge.
Earning these motivated me to earn more, but I wasn't sure if I should focus only on silver tag badges,
or if I could pick up another bronze badge along the way.
The requirements for a bronze and silver tag badge are:
Bronze: Earn at least 100 total score for at least 20 non-community wiki answers in the tag
Silver: Earn at least 400 total score for at least 80 non-community wiki answers in the tag
On your profile activity page,
Stack Overflow gives its recommendation on which tag badge you should try to earn next.
With a total score of 47 (out of 100) and 73 (out of 20) answers, mine is the
bronze dataframe
tag badge.
Before writing this, I had been attempting to kill three birds with one stone: earn silver python
and pandas
tag badges and a bronze dataframe
tag badge at the same time.
My thought was that because pandas
is a python
package and dataframe
is one of the primary objects in pandas
,
I would be able to find questions that had all three tags fairly often.
After some time,
I've decided to reconsider my strategy as new questions with all three tags don't appear as frequently as I had hoped.
I'm going to look at users who have already earned silver python
or pandas
tag badges, or a bronze dataframe
badge.
Maybe the questions they've answered will help guide me towards what badge(s) I should chase next.
Tag Badges¶
To start, I need to learn more about the tag badges I'm currently trying to earn. I establish a connection to the Stack Exchange API like I did in my post, Stack Overflow's API.
from os import getenv
import pandas as pd
from stackapi import StackAPI
# Configuration settings.
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 4
# For connecting to the API.
key = getenv("STACK_API_KEY")
SITE = StackAPI(name="stackoverflow", key=key)
Exploring the API docs, I found an endpoint for fetching tag badges:
/badges/tags
.
Looking at the page I can see that I need to include the inname
parameter to filter by tag badge name,
and both the min
and max
parameters to limit rank.
I define a collection of tuples—one for each badge—and iterate over them,
fetching and filtering the results to the items I'm interested in.
tags = (("python", "silver"), ("pandas", "silver"), ("dataframe", "bronze"))
tag_ids = {}
for tag, rank in tags:
results = SITE.fetch(endpoint="badges/tags", inname=tag, max=rank, min=rank)
items = results["items"]
df = pd.DataFrame.from_records(data=items, index="badge_id")
tag_ids[tag] = df.loc[df.name.eq(tag)].index.to_list()
print(tag_ids)
{'python': [50], 'pandas': [2426], 'dataframe': [5915]}
I extract the badge_id
value for each badge to help me limit my API calls going forward.
Badge Recipients¶
Badges by themselves don't do a lot for me; I need to find users who have earned them.
Back on the API docs there is an endpoint for fetching badge recipients:
/badges/{ids}/recipients
.
The ids
parameter expects a badge_id
.
We can use the three badge_ids
we fetched in the previous section, iterating over them and extending our list of recipients.
recipients = []
for tag, ids in tag_ids.items():
results = SITE.fetch(endpoint="badges/{ids}/recipients", ids=ids)
items = results["items"]
recipients.extend(items)
recipients = pd.json_normalize(data=recipients)
recipients = recipients.set_index(keys="user.user_id")
print(recipients.name.value_counts())
name python 500 dataframe 435 pandas 193 Name: count, dtype: int64
We fetched:
- 500 recipients of the silver
python
tag badge - 435 of the bronze
dataframe
tag badge - 193 of the silver
pandas
tag badge
Note that the max number of result items returned in one fetch is 500. This means only 435 and 193 users have earned the bronze
dataframe
and silverpandas
tag badges, respectively. We can see how many users earned each badge by looking at theaward_count
field returned by the/badges/tags
endpoint.
Reverse Engineering¶
There isn't an endpoint to return the questions a user wrote answers for that contributed to them earning a tag badge.
We'll have to do a little grunt work, starting with determining when the a user earned its silver pandas
tag badge.
The /users/{ids}/timeline
endpoint will return users' public actions and accomplishment in chronological order.
We can use that to get the date and time a user earned a tag badge.
To make sure we get all timeline events, we'll increment the page
parameter by 5 on each iteration.
This will fetch us 500 items each time, stopping if items
is empty.
from functools import partial
# The first silver pandas tag badge recipient.
pd_users = recipients.loc[recipients.name.eq("pandas")]
user = pd_users.index[0]
fetch_timeline = partial(SITE.fetch, endpoint="users/{ids}/timeline")
page = 1
timeline = []
results = fetch_timeline(ids=user, page=page)
items = results["items"]
while items:
# Repeat until no `items` are returned
timeline.extend(items)
page += 5
results = fetch_timeline(ids=user, page=page)
items = results["items"]
timeline = pd.json_normalize(timeline)
badge_date = timeline.loc[
timeline.timeline_type.eq("badge")
& timeline.badge_id.isin(tag_ids["pandas"]),
"creation_date"
].iloc[0]
print(badge_date)
1709611287
Per the docs,
All dates in the API are in unix epoch time, which is the number of seconds since midnight UTC January 1st, 1970.
We can convert the badge_date
to a more human readable format using the
datetime
library.
from datetime import datetime as dt
human_readable_badge_date = dt.utcfromtimestamp(badge_date).strftime("%Y-%m-%d %H:%M:%S")
print(human_readable_badge_date)
2024-03-05 04:01:27
User Answers¶
With the badge_date
, we can limit our answer to those posted by the user before and including the badge_date
.
The endpoint to find answers written by a specific user is
/users/{ids}/answers
.
fetch_answers = partial(
SITE.fetch,
endpoint="users/{ids}/answers",
sort="creation",
max=badge_date,
)
page = 1
answers = []
results = fetch_answers(ids=user, page=page)
items = results["items"]
while items:
# Repeat until no `items` are returned
answers.extend(items)
page += 5
results = fetch_answers(ids=user, page=page)
items = results["items"]
answers = pd.json_normalize(answers)
answers = answers.set_index(keys="answer_id")
print(answers.index.nunique())
894
This particular user wrote 894 answers between their account creation and the badge_date
!
Questions¶
The final piece of data collection proces: the questions.
The questions tags control how an answer and its score contributes to earning a tag badge.
To get the questions to our users' answers,
we use the /questons/{ids}
endpoint.
It's worth noting that a user can write multiple answers to the same question. This doesn't appear to be against Stack Exchange policy—see Are multiple answers by the same user acceptable? —so it's best to remove duplicates before fetching results to avoid running the same query.
# Convert to set as some users may provide multiple answers to the same question.
question_ids = [*set(answers.question_id)]
questions = []
# The endpoint only accepts 100 ids at a time, so we increment in batches of 100.
for i in range(0, len(question_ids), 100):
ids = question_ids[i:i + 100]
results = SITE.fetch(endpoint="questions/{ids}", ids=ids)
items = results["items"]
questions.extend(items)
questions = pd.json_normalize(questions)
questions = questions.set_index("question_id")
If you'll recall, we collected all answers posted by the user before the begin_date
.
This includes answers to questions unrelated to the silver pandas
tag badge.
We limit to questions of interest by filtering the tags
values to those that contain the string "pandas".
questions.tags
is apandas.Series
so we use theoperator.contains
function.
from operator import contains
contains_pandas = lambda a: contains(a, "pandas")
pd_questions = questions.loc[questions.tags.apply(contains_pandas)]
pd_tags = pd_questions.tags
print(pd_tags.index.nunique())
346
This user wrote answers to 346 questions tagged with pandas
.
Conclusion¶
Getting the question tags for a single user wasn't exactly a walk in the park. Before doing this for the other recipients, it'd be best to convert some of this logic into reusable functions. We'll work on that in the next post, but feel free to try it out on your own 🙂.