spaCy: Listening Components
I've been working with spacy
more and more over the years, and I thought it'd be a good idea to write about pieces of the configuration system. There are mentions of it throughout the docs and in some of the spacy
3.0 videos, but I have yet to find a super detailed breakdown of what's going on—the closest being this blog. This post will hopefully shed some light on the components that share or listen to previous components in the pipeline.
Let's start with a brief demo of spacy
.
Install
spacy
and theen_core_web_sm
model if you want to follow along:$ pip install spacy $ python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hi, my name is Ian and this is my blog.")
print(doc)
Hi, my name is Ian and this is my blog.
Nothing fancy on the surface, but this doc
object that we've created is the product of sending our string of characters through a pipeline of models, or as spacy
likes to call them, components. We can view the pipeline components via the nlp.pipeline
property.
nlp.pipeline
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x25ec707bf50>), ('tagger', <spacy.pipeline.tagger.Tagger at 0x25ec7224290>), ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x25ec6f81540>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler at 0x25ec70c8b90>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x25ec6f05050>), ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x25ec6f81230>)]
And we can get more component information with nlp.analyze_pipes
such as what each assigns, their requirements, their scoring metrics, whether they retokenize, and in what order the components perform their annotations.
# note the semicolon (;) to reduce output after the table.
nlp.analyze_pipes(pretty=True);
============================= Pipeline Overview ============================= # Component Assigns Requires Scores Retokenizes - --------------- ------------------- -------- ---------------- ----------- 0 tok2vec doc.tensor False 1 tagger token.tag tag_acc False 2 parser token.dep dep_uas False token.head dep_las token.is_sent_start dep_las_per_type doc.sents sents_p sents_r sents_f 3 attribute_ruler False 4 lemmatizer token.lemma lemma_acc False 5 ner doc.ents ents_f False token.ent_iob ents_p token.ent_type ents_r ents_per_type ✔ No problems found.
Notice the first component, tok2vec
. This component is responsible for mapping tokens to vectors, i.e., creating an embedding layer, and making them available for later components to use via the doc.tensor
attribute.
Note, this is not the same as a
tokenizer
.
In the en_core_web_sm
pipeline, we can see that the tagger
and parser
components both use the tok2vec
's output by viewing the tok2vec.listening_components
.
tok2vec = nlp.get_pipe("tok2vec")
tok2vec.listening_components
['tagger', 'parser']
On the flip side, we can see which components use a tok2vec
model by checking their configurations via nlp.get_pipe_config
.
[
name for name in nlp.pipe_names
if (model := nlp.get_pipe_config(name).get("model")) is not None
and model.get("tok2vec") is not None
]
['tagger', 'parser', 'ner']
The tagger
and parser
are both present as expected, but so is the ner
component which has its own tok2vec
layer, separate from the tok2vec
at the beginning of the nlp.pipeline
.
ner_tok2vec = nlp.get_pipe_config("ner")["model"]["tok2vec"]
ner_tok2vec
{'@architectures': 'spacy.Tok2Vec.v2', 'embed': {'@architectures': 'spacy.MultiHashEmbed.v2', 'width': 96, 'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'], 'rows': [5000, 1000, 2500, 2500], 'include_static_vectors': False}, 'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2', 'width': 96, 'depth': 4, 'window_size': 1, 'maxout_pieces': 3}}
This is an example of an independent component—it can stand alone without a tok2vec
component being present in the pipeline.
The tagger
and parser
components both listen to or share the tok2vec
component's output in the nlp.pipeline
.
tagger_tok2vec = nlp.get_pipe_config("tagger")["model"]["tok2vec"]
tagger_tok2vec
{'@architectures': 'spacy.Tok2VecListener.v1', 'width': '${components.tok2vec.model.encode:width}', 'upstream': 'tok2vec'}
parser_tok2vec = nlp.get_pipe_config("parser")["model"]["tok2vec"]
parser_tok2vec
{'@architectures': 'spacy.Tok2VecListener.v1', 'width': '${components.tok2vec.model.encode:width}', 'upstream': 'tok2vec'}
Listening to/sharing an upstream component has some pros and cons including speed and flexibility (see my stack overflow answer for an experiment). Sometimes sharing a component can help boost later components metrics, and other times it's easier to have something more independent.
Most trainable components require a tok2vec
layer, so when it comes to adding components to a pipeline, we have options.
- We could add a component with its own
tok2vec
similar to thener
component. - We could add a component and have it listen to the existing
tok2vec
layer. - We could add both a component and have it listen to a new
tok2vec
component, separate from the existing one (uncommon).
Here is an example of the first option: we'll add a senter
component—not to be confused with the disbaled senter
component that comes pretrained—and view its tok2vec
setup.
Note, you could do this with a custom component as well assuming it's registered/in the environment.
# Enable, then removing existing ``senter`` model (disabled by default).
nlp.enable_pipe("senter")
nlp.remove_pipe("senter")
# Adding new ``senter`` model.
nlp.add_pipe("senter", after="parser")
# View ``senter`` tok2vec config.
senter_tok2vec = nlp.get_pipe_config("senter")["model"]["tok2vec"]
senter_tok2vec
{'@architectures': 'spacy.HashEmbedCNN.v2', 'pretrained_vectors': None, 'width': 12, 'depth': 1, 'embed_size': 2000, 'window_size': 1, 'maxout_pieces': 2, 'subword_features': True}
Pretty easy to do as the the senter
component factory comes with its own tok2vec
layer. If we wanted something more like the second option, we'd need to include a config
telling spacy
that we want the senter
to listen to the existing tok2vec
component.
from confection import Config # For interpolating the ``nlp.config``.
# Extracting width from ``tagger``'s interpolated config beacuse it listens to ``tok2vec``.
inter_config = Config(nlp.config).interpolate()
width = inter_config["components"]["tagger"]["model"]["tok2vec"]["width"]
senter_config = {
"model": {
"tok2vec": {
"@architectures": "spacy.Tok2VecListener.v1",
"width": width,
"upstream": "tok2vec",
}
},
}
# Before adding ``senter`` with listener.
tok2vec.listening_components
['tagger', 'parser']
# Removing existing ``senter`` model without listener.
nlp.remove_pipe("senter")
# Adding new ``senter`` model with listener.
nlp.add_pipe("senter", after="parser", config=senter_config)
# After adding ``senter`` with listener.
tok2vec.listening_components
['tagger', 'parser', 'senter']
What's with the nested dictionaries? Why not use a method or more object-oriented approach? These are questions I asked myself when the config system was first introduced. Since then I've grown used to it not because it's easier, but because it's more maintainable (especially when you use it the way it was designed to be used, i.e., not in a notebook).
Because I consider the third option uncommon, I'm not going to show it. But if you wanted to try it for yourself you'd follow these steps:
- Add a secondary
tok2vec
layer with a different name (something liketok2vec.secondary
) - Add a component via the
nlp.add_pipe
method and modify the config to point attok2vec.secondary
instead oftok2vec
in theupstream
field.
If you want to look into what's going on under the hood, I've tracked down the nlp.add_pipe
source code as well as additional documentation specific to the "listener" components. Please have a gander and drop a comment if you'd like to discuss further.
We've walked through adding independent and listener components; how do we take an existing listener component and make it independent? Rolling with the notebook approach first, we would use the nlp.replace_listeners
method.
# Before making the listening ``senter`` component independent.
nlp.get_pipe_config("senter")["model"]["tok2vec"]
{'@architectures': 'spacy.Tok2VecListener.v1', 'width': 96, 'upstream': 'tok2vec'}
nlp.replace_listeners(
tok2vec_name="tok2vec",
pipe_name="senter",
# Each ``listener`` is represented with TOML-like structure.
listeners=["model.tok2vec"]
)
# After making the listening ``senter`` component independent.
nlp.get_pipe_config("senter")["model"]["tok2vec"]
{'@architectures': 'spacy.Tok2Vec.v2', 'embed': {'@architectures': 'spacy.MultiHashEmbed.v2', 'width': '${components.tok2vec.model.encode:width}', 'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE', 'SPACY', 'IS_SPACE'], 'rows': [5000, 1000, 2500, 2500, 50, 50], 'include_static_vectors': False}, 'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2', 'width': 96, 'depth': 4, 'window_size': 1, 'maxout_pieces': 3}}
Almost as easy as adding an independent component to the pipeline!
Now normally you'd only make a component independent if you were going to freeze it. For example, if you wanted the en_core_web_sm
's tagger
component to annotate some text in a new pipeline, but didn't want to change its underlying weights. Because of all the settings that need to be handled, I recommend doing this via the config.
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
# Create a new pipeline with an independent ``tagger`` from the ``en_core_web_sm`` model,
# and a new ``parser`` that will listen to the pipeline's ``tok2vec`` layer.
new_config = {
"nlp": {
"pipeline": ["tok2vec", "tagger", "parser"]
},
"components": {
"tok2vec": {
"factory": "tok2vec",
"model": DEFAULT_TOK2VEC_MODEL,
},
"tagger": {
"source": "en_core_web_sm",
"replace_listeners": ["model.tok2vec"],
},
"parser": {
"factory": "parser",
"model": {
"tok2vec": {
"@architectures": "spacy.Tok2VecListener.v1",
"width": "${components.tok2vec.model:width}",
"upstream": "tok2vec",
}
}
}
},
"training": {
"frozen_components": ["tagger"],
"annotating_components": ["tagger"]
},
}
new_nlp = spacy.blank("en", config=new_config)
And there you have it. A more detailed explanation of how to add an independent component, add a listening component, and make an existing listening component independent. Please leave a comment if you have any questions or would like me to drill deeper into another part of the spacy
config system.