SpaCy: Extensions
Recently I was working with spaCy and wanted to break a Doc object up into its paragraphs.
I thought this to be very similar to the existing SentenceRecognizer and Sentencizer implementations and figured someone must have already done this.
After quite a bit of searching, I didn't find any promising results on the modeling side, but did come across this gist:
Extensions¶
Per the spaCy docs:
spaCy allows you to set any custom attributes and methods on the
Doc,SpanandToken, which become available asDoc._,Span._andToken._—for example,Token._.my_attr. This lets you store additional information relevant to your application, add new features and functionality to spaCy, and implement your own models trained with other machine learning libraries. It also lets you take advantage of spaCy’s data structures and theDocobject as the “single source of truth”.There are three main types of extensions, which can be defined using the
Doc.set_extension,Span.set_extensionandToken.set_extensionmethods.
I'm interested in extracting paragraphs from a Doc, so I'll use the Doc.set_extension method.
To have the extension use the paragraphs function from the gist, we need to supply it as an argument to the getter parameter.
This is known as a property extension.
From the docs:
Property extensions. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you retrieve the attribute, you can also access values of previously added attribute extensions. For example, a
Docgetter can average overTokenattributes. ForSpanextensions, you’ll almost always want to use a property—otherwise, you’d have to write to every possibleSpanin theDocto set up the values correctly.
from typing import Generator
import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.span import Span
# I changed the parameter name `document` to `doc`
# added type hints, and added some whitespace.
def paragraphs(doc: Doc) -> Generator[Span, None, None]:
    start = 0
    for token in doc:
        if token.is_space and token.text.count("\n") > 1:
            yield doc[start:token.i]
            start = token.i
    yield doc[start:]
# We set the `paras` extension globally.
# This means _all_ `Doc` objects will have 
# a `_.paras` attribute.
Doc.set_extension(name="paras", getter=paragraphs)
blank = spacy.blank("en")
# Some example text with two paragraphs.
text = """This is a sentence. This is a second sentence. Here is a third.
This is the start of a new paragraph. This is the end of the paragraph."""
doc = blank(text=text)
# Iterate and print each paragraph in `doc`,
# extracted using the logic defined in the 
# `paragraph` function.
paras = doc._.paras
print(*enumerate(paras), sep="\n")
(0, This is a sentence. This is a second sentence. Here is a third.) (1, This is the start of a new paragraph. This is the end of the paragraph.)
It's not as beautiful as I want—I'd like to strip the newlines from each paragraph—but it gets the job done.
And I'd be remiss if I didn't show how to remove the _.paras attribute (though you shouldn't have to because it's a generator and not adding much in terms of memory).
# Note the semicolon (;) to suppress the output.
Doc.remove_extension("paras");
Hopefully this has shed some light on the set_extension method(s).
Thanks for reading!