I thought it’d be cool to add my most recent blog post to my GitHub profile using a custom GitHub action and workflow. To do that, I’d need to know which post was most recently published along with its title, URL, etc. After looking at the GoodReads workflow on my profile, I figured I could get that information from my blog’s RSS feed.

RSS is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format.

—Wikipedia

Accessing my blog’s RSS feed is as simple as adding /feed.xml to the end of its URL. Check it out here 👉 https://it176131.github.io/feed.xml.

Why .xml and not .rss? It turns out that RSS is an extension of XML so in theory we could use either extension. However, .rss doesn’t work on the URL, so .xml it is.

pydantic-xml

I mentioned using pydantic in a previous post for parsing and validating JSON files. If you explore the library’s main page, you’ll hopefully come across a link to the awesome-pydantic repo which contains a list of projects that use pydantic. Under the Utilities section you’ll find a package called pydantic-xml, which extends pydantic to allow parsing and validation of XML.

I don’t have much experience parsing XML, but I do know how to use pydantic. How hard could it be to transfer my pydantic knowledge to pydantic-xml?

XML != JSON

Validating JSON with pydantic requires you to define a class, then supply the JSON as an argument. pydantic-xml is similar, but there are some gotchas.

Namespaces

They’re for allowing multiple markup languages to be combined, without having to worry about conflicts of element and attribute names.

What are XML namespaces for?

The first couple of lines in my blog’s XML look like this:

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    ...
</feed>

Based on what I saw in the pydantic-xml Quickstart, I’d expect the model class to be defined like so:

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel
from rich.console import Console


class Feed(BaseXmlModel):
    """Validate the RSS feed/XML from my blog."""
    
    ...


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

But no. This raises the following error:

pydantic_xml.errors.ParsingError: root element not found (actual: {http://www.w3.org/2005/Atom}feed, expected: Feed)

Confused, but not beaten, I read on. After trying the example code a few times and altering my own XML, I found that I needed to either:

  1. make my “F” lowercase in the class definition to match the XML tag name, i.e. Feed ➡️ feed, or
  2. add tag="feed" to the Feed signature line.

I chose the latter as I prefer uppercased class names.

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel
from rich.console import Console


- class Feed(BaseXmlModel):
+ class Feed(BaseXmlModel, tag="feed"):
    """Validate the RSS feed/XML from my blog."""
    
    ...


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

Running my updated code, I was greeted with another error 😣:

pydantic_xml.errors.ParsingError: root element not found (actual: {http://www.w3.org/2005/Atom}feed, expected: feed)

This one appears identical to the previous, but with one small difference—the expected value is now “feed” instead of “Feed”. That means adding tag="feed" correctly told the underlying parser to look for a “feed” tag, but for some reason it can’t find it.

Looking at the actual tag in the error message, {http://www.w3.org/2005/Atom}feed, I noticed that the URL is assigned to an attribute called xmlns, which stands for XML namespace. Searching for “namespace” in the pydantic-xml docs, I found the missing link: I need to include the tag’s namespace in my class signature.

+ from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel
from rich.console import Console

+ NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


- class Feed(BaseXmlModel, tag="feed"):
+ class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP):
    """Validate the RSS feed/XML from my blog."""
    
    ...


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)  # >>> Feed()

[!NOTE]

The key in the NSMAP is an empty string. This sets a default namespace for a model and its sub-fields.

pydantic-xml Default namespace

That modification allowed me to parse the first couple lines of the XML. Now to get to my entries.

Order Matters

Accessing fields in JSON is similar to interacting with a Python dict:

some_dict = {"a": 0, "b": 1}
print(some_dict["b"])  # >>> 0

Because of this, you can ignore the fields you don’t care about when defining a pydantic class:

from pydantic.main import BaseModel


class Model(BaseModel):
    """Demo model."""
    
    b: int


if __name__ == "__main__":
    some_dict = {"a": 0, "b": 1}
    model = Model(**some_dict)
    print(model)  # >>> Model(b=1)

This doesn’t work (by default) when defining a pydantic-xml class.

These are the next few lines in my XML, from the <feed> tag up through the first <entry> tag:

[!NOTE]

The <content> and <summary> descriptions were shortened for brevity.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator>
    <link href="/feed.xml" rel="self" type="application/atom+xml"/>
    <link href="/" rel="alternate" type="text/html"/>
    <updated>2024-12-13T02:39:38+00:00</updated>
    <id>/feed.xml</id>
    <title type="html">My Blog</title>
    <subtitle>Where I write things...</subtitle>
    <author>
        <name>Ian Thompson</name>
    </author>
    <entry>
        <title type="html">isort + git: Cleaner Import Statements for Those Who Don’t Like pre-commit</title>
        <link href="/2024/12/12/isort.html" rel="alternate" type="text/html"
              title="isort + git: Cleaner Import Statements for Those Who Don’t Like pre-commit"/>
        <published>2024-12-12T00:00:00+00:00</published>
        <updated>2024-12-12T00:00:00+00:00</updated>
        <id>/2024/12/12/isort</id>
        <content type="html" xml:base="/2024/12/12/isort.html">...</content>
        <author>
            <name>Ian Thompson</name>
        </author>
        <summary type="html">...</summary>
    </entry>
    ...
</feed>

If I were to define the Feed.entry attribute like I would in pydantic:

from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


# NOTE -- we have to declare the _same_ `nsmap` for our `Entry` class as
# we did in the `Feed` class, otherwise we'll run into the same errors
# from before.
class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP):
    """A blog post entry from the RSS feed."""

    ...


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP):
    """Validate the RSS feed/XML from my blog."""
    
    entry: Entry


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

We will raise the following error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Feed
entry
  [line -1]: Field required [type=missing, input_value={}, input_type=dict]

This is a pydantic ValidationError—which I’m quite familiar with—but it’s not immediately known how to fix it, let alone understand why it was raised. Through some trial and error, I found that adding an attribute for the first element after <feed>, i.e., <generator>, and removing the entry element results in a successful run:

from typing import Final

import httpx
from httpx import Response
- from pydantic_xml.model import BaseXmlModel
+ from pydantic_xml.model import BaseXmlModel, element
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


# NOTE -- we have to declare the _same_ `nsmap` for our `Entry` class as
# we did in the `Feed` class, otherwise we'll run into the same errors
# from before.
class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP):
    """A blog post entry from the RSS feed."""

    ...


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP):
    """Validate the RSS feed/XML from my blog."""
    
+     # We define `generator` to be a dictionary element to capture its 
+     # attribute keys and values.
+     generator: dict[str, str] = element()
-     entry: Entry


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)  # >>> Feed(generator={'uri': 'https://jekyllrb.com/', 'version': '3.10.0'})

But why? Because of how the pydantic-xml model searches for its subfields. According to the pydantic-xml docs there are three search methods:

Both the strict and ordered search methods require the model’s subfields to mirror the order in the XML document, but the latter offers a bit more flexibility by allowing “unknown” fields to be skipped. Or in our case, fields we don’t care about. That means setting search_mode="ordered" in our Feed signature should allow us to skip all the way to our Entry subfield.

from typing import Final

import httpx
from httpx import Response
+ from pydantic_xml.model import BaseXmlModel
- from pydantic_xml.model import BaseXmlModel, element
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


# NOTE -- we have to declare the _same_ `nsmap` for our `Entry` class as
# we did in the `Feed` class, otherwise we'll run into the same errors
# from before.
class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP):
    """A blog post entry from the RSS feed."""

    ...


- class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP):
+ class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

-     # We define `generator` to be a dictionary element to capture its 
-     # attribute keys and values.
-     generator: dict[str, str] = element()
+     entry: Entry


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)  # >>> Feed(entry=Entry())

And it does.

The downside to this is we lose everything between the <feed> and <entry> tags. In pydantic, we’d be able to see everything else in the model_extra if we set the model_config to allow for it. Unfortunately, this doesn’t work in pydantic-xml.

Duplicate Field Names

This is the next entry in my XML, minus the content and summary.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator>
    <link href="/feed.xml" rel="self" type="application/atom+xml"/>
    <link href="/" rel="alternate" type="text/html"/>
    <updated>2024-12-13T02:39:38+00:00</updated>
    <id>/feed.xml</id>
    <title type="html">My Blog</title>
    <subtitle>Where I write things...</subtitle>
    <author>
        <name>Ian Thompson</name>
    </author>
    <entry>
        <title type="html">isort + git: Cleaner Import Statements for Those Who Don’t Like pre-commit</title>
        <link href="/2024/12/12/isort.html" rel="alternate" type="text/html"
              title="isort + git: Cleaner Import Statements for Those Who Don’t Like pre-commit"/>
        <published>2024-12-12T00:00:00+00:00</published>
        <updated>2024-12-12T00:00:00+00:00</updated>
        <id>/2024/12/12/isort</id>
        <content type="html" xml:base="/2024/12/12/isort.html">...</content>
        <author>
            <name>Ian Thompson</name>
        </author>
        <summary type="html">...</summary>
    </entry>
    <entry>
        <title type="html">PyCharm: Projects &amp;amp; Environments</title>
        <link href="/2024/12/03/pycharm-projects-envs.html" rel="alternate" type="text/html"
              title="PyCharm: Projects &amp;amp; Environments"/>
        <published>2024-12-03T00:00:00+00:00</published>
        <updated>2024-12-03T00:00:00+00:00</updated>
        <id>/2024/12/03/pycharm-projects-envs</id>
        <content type="html" xml:base="/2024/12/03/pycharm-projects-envs.html">...</content>
        <author>
            <name>Ian Thompson</name>
        </author>
        <summary type="html">...</summary>
    </entry>
</feed>

You’ll notice that each <entry> tag is at the same level. In JSON, this would be the equivalent of having multiple fields with the same name:

{
  "Field": "Value1",
  "Field": "Value2"
}

And validating this with pydantic would yield some interesting results:

from typing import Annotated

from pydantic.fields import Field
from pydantic.main import BaseModel
from rich.console import Console


class Model(BaseModel):
    field: Annotated[str, Field(alias="Field")]
    field: Annotated[str, Field(alias="Field")]


if __name__ == "__main__":
    json_as_python_dict = {
        "Field": "Value1",
        "Field": "Value2",
    }
    model = Model(**json_as_python_dict)
    console = Console()
    console.print(model)  # >>> Model(field="Value2")

While having multiple keys with the same name in JSON is technically allowed, it’s not good practice. And using Python to read the JSON resolves it to a dict which can’t have duplicate keys. In fact, defining a dictionary with duplicate keys is roughly equivalent to merging two dict objects with the same key(s). This results in a “last seen wins” value-assignment to the duplicate key (see PEP 584 for more details).

What about XML? Duplicate tag names are allowed; how does pydantic-xml handle them? It turns out that by simply differentiating the attribute names, e.g. “entry1” and “entry2,” the pydantic-xml model will assign the tags in the order it discovers them.

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel, element
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP):
    """A blog post entry from the RSS feed."""

    # NOTE -- I'm validating some of the entry subfields to 
    # differentiate from other entries.
    title: str = element()
    published: datetime = element()
    updated: datetime = element()


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

    entry1: Entry
    entry2: Entry


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

Feed(
    entry1=Entry(
        title='isort + git: Cleaner Import Statements for Those Who Don’t Like 
pre-commit',
        published=datetime.datetime(2024, 12, 12, 0, 0, tzinfo=TzInfo(UTC)),
        updated=datetime.datetime(2024, 12, 12, 0, 0, tzinfo=TzInfo(UTC))
    ),
    entry2=Entry(
        title='PyCharm: Projects &amp; Environments',
        published=datetime.datetime(2024, 12, 3, 0, 0, tzinfo=TzInfo(UTC)),
        updated=datetime.datetime(2024, 12, 3, 0, 0, tzinfo=TzInfo(UTC))
    )
)

That’s kind of convenient, but what if I don’t know how many <entry> tags are in the XML? The pydantic-xml model has that covered with a thing called homogenous collections.

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel, element
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP):
    """A blog post entry from the RSS feed."""

    # NOTE -- I'm validating some of the entry subfields to 
    # differentiate from other entries.
    title: str = element()
    published: datetime = element()
    updated: datetime = element()


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

-     entry1: Entry
-     entry2: Entry
+     entries: list[Entry]


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

Feed(
    entries=[
        Entry(
            title='isort + git: Cleaner Import Statements for Those Who Don’t 
Like pre-commit',
            published=datetime.datetime(2024, 12, 12, 0, 0, 
tzinfo=TzInfo(UTC)),
            updated=datetime.datetime(2024, 12, 12, 0, 0, tzinfo=TzInfo(UTC))
        ),
        Entry(
            title='PyCharm: Projects &amp; Environments',
            published=datetime.datetime(2024, 12, 3, 0, 0, tzinfo=TzInfo(UTC)),
            updated=datetime.datetime(2024, 12, 3, 0, 0, tzinfo=TzInfo(UTC))
        )
    ]
)

This is probably my favorite feature in pydantic-xml.

Bonus Features

As I wrap up my blog’s RSS feed class, I’d like to highlight a few features of pydantic-xml. First, you don’t have to create a new model class for every XML subfield. For example, here is how I’d validate the <author> tag with a model:

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel, element
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


class Author(BaseXmlModel, tag="author", nsmap=NSMAP):
    """A blog post author from the RSS feed."""

    name: str = element(tag="name")


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP, search_mode="ordered"):
    """A blog post entry from the RSS feed."""

    title: str = element()
    published: datetime = element()
    updated: datetime = element()
    author: Author


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

    entries: list[Entry] = element()


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

And here is how I’d do it without an Author model:

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
- from pydantic_xml.model import BaseXmlModel, element
+ from pydantic_xml.model import BaseXmlModel, element, wrapped
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}
- 
- 
- class Author(BaseXmlModel, tag="author", nsmap=NSMAP):
-     """A blog post author from the RSS feed."""
- 
-     name: str = element(tag="name")


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP, search_mode="ordered"):
    """A blog post entry from the RSS feed."""

    title: str = element()
    published: datetime = element()
    updated: datetime = element()
-     author: Author
+     author: str = wrapped(path="author", entity=element(tag="name"))


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

    entries: list[Entry] = element()


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

This second approach uses the wrapped function. When a field doesn’t have a lot of important attributes that I’d like to validate, e.g. the <author> tag, I’d use wrapped instead of defining a new model class.

[!NOTE]

This changes the type of Feed.entry.author:

Feed(
    entries=[
        Entry(
            ...,
-             author=Author(name='Ian Thompson')
+             author='Ian Thompson'
        )
    ]
)

A more direct/less verbose way of using wrapped would be to only supply an argument to the path parameter:

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
from pydantic_xml.model import BaseXmlModel, element, wrapped
from rich.console import Console

NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP, search_mode="ordered"):
    """A blog post entry from the RSS feed."""

    title: str = element()
    published: datetime = element()
    updated: datetime = element()
-     author: str = wrapped(path="author", entity=element(tag="name"))
+     author: str = wrapped(path="author/name")


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

    entries: list[Entry] = element()


if __name__ == "__main__":
    BLOG_URL = "https://it176131.github.io/feed.xml"
    resp: Response = httpx.get(url=BLOG_URL)
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model)

I see the entity parameter as more useful when accessing tag attributes, rather than tags themselves.

Closing

And with that I leave you with my final Feed model. I hope this article shed some light on pydantic-xml and encouraged you to try it out for yourself. Happy validating!

from datetime import datetime
from typing import Final

import httpx
from httpx import Response
from pydantic.networks import HttpUrl
from pydantic_xml.model import (
    attr, BaseXmlModel, computed_element, element, wrapped
)
from rich.console import Console

BLOG_URL = "https://it176131.github.io"
NSMAP: Final[dict[str, str]] = {"": "http://www.w3.org/2005/Atom"}


class Entry(BaseXmlModel, tag="entry", nsmap=NSMAP, search_mode="ordered"):
    """A blog post entry from the RSS feed."""

    title: str = element()
    relative_url: str = wrapped(path="link", entity=attr(name="href"))
    published: datetime = element()
    updated: datetime = element()
    author: str = wrapped(path="author/name")

    @computed_element
    def link(self: "Entry") -> HttpUrl:
        """Resolve <entry.link[href]> to full URL."""
        return HttpUrl(url=f"{BLOG_URL}{self.relative_url}")


class Feed(BaseXmlModel, tag="feed", nsmap=NSMAP, search_mode="ordered"):
    """Validate the RSS feed/XML from my blog."""

    # We limit to the first <entry> from the RSS feed as it is the most
    # recently published.
    entry: Entry


if __name__ == "__main__":
    resp: Response = httpx.get(url=f"{BLOG_URL}/feed.xml")
    xml: bytes = resp.content
    console = Console()
    model = Feed.from_xml(source=xml)
    console.print(model.model_dump_json(indent=2))

{
  "entry": {
    "title": "isort + git: Cleaner Import Statements for Those Who Don’t Like pre-commit",
    "relative_url": "/2024/12/12/isort.html",
    "published": "2024-12-12T00:00:00Z",
    "updated": "2024-12-12T00:00:00Z",
    "author": "Ian Thompson",
    "link": "https://it176131.github.io/2024/12/12/isort.html"
  }
}