SpaCy: Extensions
Recently I was working with spaCy
and wanted to break a Doc
object up into its paragraphs.
I thought this to be very similar to the existing SentenceRecognizer
and Sentencizer
implementations and figured someone must have already done this.
After quite a bit of searching, I didn't find any promising results on the modeling side, but did come across this gist:
Extensions¶
Per the spaCy
docs:
spaCy allows you to set any custom attributes and methods on the
Doc
,Span
andToken
, which become available asDoc._
,Span._
andToken._
—for example,Token._.my_attr
. This lets you store additional information relevant to your application, add new features and functionality to spaCy, and implement your own models trained with other machine learning libraries. It also lets you take advantage of spaCy’s data structures and theDoc
object as the “single source of truth”.There are three main types of extensions, which can be defined using the
Doc.set_extension
,Span.set_extension
andToken.set_extension
methods.
I'm interested in extracting paragraphs from a Doc
, so I'll use the Doc.set_extension
method.
To have the extension use the paragraphs
function from the gist, we need to supply it as an argument to the getter
parameter.
This is known as a property extension.
From the docs:
Property extensions. Define a getter and an optional setter function. If no setter is provided, the extension is immutable. Since the getter and setter functions are only called when you retrieve the attribute, you can also access values of previously added attribute extensions. For example, a
Doc
getter can average overToken
attributes. ForSpan
extensions, you’ll almost always want to use a property—otherwise, you’d have to write to every possibleSpan
in theDoc
to set up the values correctly.
from typing import Generator
import spacy
from spacy.tokens.doc import Doc
from spacy.tokens.span import Span
# I changed the parameter name `document` to `doc`
# added type hints, and added some whitespace.
def paragraphs(doc: Doc) -> Generator[Span, None, None]:
start = 0
for token in doc:
if token.is_space and token.text.count("\n") > 1:
yield doc[start:token.i]
start = token.i
yield doc[start:]
# We set the `paras` extension globally.
# This means _all_ `Doc` objects will have
# a `_.paras` attribute.
Doc.set_extension(name="paras", getter=paragraphs)
blank = spacy.blank("en")
# Some example text with two paragraphs.
text = """This is a sentence. This is a second sentence. Here is a third.
This is the start of a new paragraph. This is the end of the paragraph."""
doc = blank(text=text)
# Iterate and print each paragraph in `doc`,
# extracted using the logic defined in the
# `paragraph` function.
paras = doc._.paras
print(*enumerate(paras), sep="\n")
(0, This is a sentence. This is a second sentence. Here is a third.) (1, This is the start of a new paragraph. This is the end of the paragraph.)
It's not as beautiful as I want—I'd like to strip the newlines from each paragraph—but it gets the job done.
And I'd be remiss if I didn't show how to remove the _.paras
attribute (though you shouldn't have to because it's a generator and not adding much in terms of memory).
# Note the semicolon (;) to suppress the output.
Doc.remove_extension("paras");
Hopefully this has shed some light on the set_extension
method(s).
Thanks for reading!