spacy-extensions

Recently I was working with spaCy and wanted to break a Doc object up into its paragraphs. I thought this to be very similar to the existing SentenceRecognizer and Sentencizer implementations and figured someone must have already done this. After quite a bit of searching, I didn't find any promising results on the modeling side, but did come across this gist:

Simple. Straightforward. The only thing I'd like more is if I could reference the paragraphs of a Doc via an attribute or property. Something akin to Doc.sents. Lucky for me, the spaCy devs thought of this and made it easy to do.

Extensions¶

Tony Talks Iamtonytalks GIFfrom Tony Talks GIFs

Per the spaCy docs:

spaCy allows you to set any custom attributes and methods on the Doc, Span and Token, which become available as Doc._, Span._ and Token._—for example, Token._.my_attr. This lets you store additional information relevant to your application, add new features and functionality to spaCy, and implement your own models trained with other machine learning libraries. It also lets you take advantage of spaCy’s data structures and the Doc object as the “single source of truth”.

There are three main types of extensions, which can be defined using the Doc.set_extension, Span.set_extension and Token.set_extension methods.

I'm interested in extracting paragraphs from a Doc, so I'll use the Doc.set_extension method. To have the extension use the paragraphs function from the gist, we need to supply it as an argument to the getter parameter. This is known as a property extension. From the docs:

Property extensions. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you retrieve the attribute, you can also access values of previously added attribute extensions. For example, a Doc getter can average over Token attributes. For Span extensions, you’ll almost always want to use a property—otherwise, you’d have to write to every possible Span in the Doc to set up the values correctly.

In [1]:

from typing import Generator

import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.span import Span


# I changed the parameter name `document` to `doc`
# added type hints, and added some whitespace.
def paragraphs(doc: Doc) -> Generator[Span, None, None]:
    start = 0
    for token in doc:
        if token.is_space and token.text.count("\n") > 1:
            yield doc[start:token.i]
            start = token.i

    yield doc[start:]


# We set the `paras` extension globally.
# This means _all_ `Doc` objects will have 
# a `_.paras` attribute.
Doc.set_extension(name="paras", getter=paragraphs)
blank = spacy.blank("en")

# Some example text with two paragraphs.
text = """This is a sentence. This is a second sentence. Here is a third.

This is the start of a new paragraph. This is the end of the paragraph."""
doc = blank(text=text)

# Iterate and print each paragraph in `doc`,
# extracted using the logic defined in the 
# `paragraph` function.
paras = doc._.paras
print(*enumerate(paras), sep="\n")

(0, This is a sentence. This is a second sentence. Here is a third.)
(1, 

This is the start of a new paragraph. This is the end of the paragraph.)

It's not as beautiful as I want—I'd like to strip the newlines from each paragraph—but it gets the job done.

And I'd be remiss if I didn't show how to remove the _.paras attribute (though you shouldn't have to because it's a generator and not adding much in terms of memory).

In [2]:

# Note the semicolon (;) to suppress the output.
Doc.remove_extension("paras");

Hopefully this has shed some light on the set_extension method(s). Thanks for reading!