spaCy: Listening Components
I've been working with spacy more and more over the years, and I thought it'd be a good idea to write about pieces of the configuration system. There are mentions of it throughout the docs and in some of the spacy 3.0 videos, but I have yet to find a super detailed breakdown of what's going on—the closest being this blog. This post will hopefully shed some light on the components that share or listen to previous components in the pipeline.
Let's start with a brief demo of spacy.
Install
spacyand theen_core_web_smmodel if you want to follow along:$ pip install spacy $ python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Hi, my name is Ian and this is my blog.")
print(doc)
Hi, my name is Ian and this is my blog.
Nothing fancy on the surface, but this doc object that we've created is the product of sending our string of characters through a pipeline of models, or as spacy likes to call them, components. We can view the pipeline components via the nlp.pipeline property.
nlp.pipeline
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x25ec707bf50>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x25ec7224290>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x25ec6f81540>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x25ec70c8b90>),
('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x25ec6f05050>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x25ec6f81230>)]
And we can get more component information with nlp.analyze_pipes such as what each assigns, their requirements, their scoring metrics, whether they retokenize, and in what order the components perform their annotations.
# note the semicolon (;) to reduce output after the table.
nlp.analyze_pipes(pretty=True);
============================= Pipeline Overview ============================= # Component Assigns Requires Scores Retokenizes - --------------- ------------------- -------- ---------------- ----------- 0 tok2vec doc.tensor False 1 tagger token.tag tag_acc False 2 parser token.dep dep_uas False token.head dep_las token.is_sent_start dep_las_per_type doc.sents sents_p sents_r sents_f 3 attribute_ruler False 4 lemmatizer token.lemma lemma_acc False 5 ner doc.ents ents_f False token.ent_iob ents_p token.ent_type ents_r ents_per_type ✔ No problems found.
Notice the first component, tok2vec. This component is responsible for mapping tokens to vectors, i.e., creating an embedding layer, and making them available for later components to use via the doc.tensor attribute.
Note, this is not the same as a
tokenizer.
In the en_core_web_sm pipeline, we can see that the tagger and parser components both use the tok2vec's output by viewing the tok2vec.listening_components.
tok2vec = nlp.get_pipe("tok2vec")
tok2vec.listening_components
['tagger', 'parser']
On the flip side, we can see which components use a tok2vec model by checking their configurations via nlp.get_pipe_config.
[
name for name in nlp.pipe_names
if (model := nlp.get_pipe_config(name).get("model")) is not None
and model.get("tok2vec") is not None
]
['tagger', 'parser', 'ner']
The tagger and parser are both present as expected, but so is the ner component which has its own tok2vec layer, separate from the tok2vec at the beginning of the nlp.pipeline.
ner_tok2vec = nlp.get_pipe_config("ner")["model"]["tok2vec"]
ner_tok2vec
{'@architectures': 'spacy.Tok2Vec.v2',
'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
'width': 96,
'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'],
'rows': [5000, 1000, 2500, 2500],
'include_static_vectors': False},
'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
'width': 96,
'depth': 4,
'window_size': 1,
'maxout_pieces': 3}}
This is an example of an independent component—it can stand alone without a tok2vec component being present in the pipeline.
The tagger and parser components both listen to or share the tok2vec component's output in the nlp.pipeline.
tagger_tok2vec = nlp.get_pipe_config("tagger")["model"]["tok2vec"]
tagger_tok2vec
{'@architectures': 'spacy.Tok2VecListener.v1',
'width': '${components.tok2vec.model.encode:width}',
'upstream': 'tok2vec'}
parser_tok2vec = nlp.get_pipe_config("parser")["model"]["tok2vec"]
parser_tok2vec
{'@architectures': 'spacy.Tok2VecListener.v1',
'width': '${components.tok2vec.model.encode:width}',
'upstream': 'tok2vec'}
Listening to/sharing an upstream component has some pros and cons including speed and flexibility (see my stack overflow answer for an experiment). Sometimes sharing a component can help boost later components metrics, and other times it's easier to have something more independent.
Most trainable components require a tok2vec layer, so when it comes to adding components to a pipeline, we have options.
- We could add a component with its own
tok2vecsimilar to thenercomponent. - We could add a component and have it listen to the existing
tok2veclayer. - We could add both a component and have it listen to a new
tok2veccomponent, separate from the existing one (uncommon).
Here is an example of the first option: we'll add a senter component—not to be confused with the disbaled senter component that comes pretrained—and view its tok2vec setup.
Note, you could do this with a custom component as well assuming it's registered/in the environment.
# Enable, then removing existing ``senter`` model (disabled by default).
nlp.enable_pipe("senter")
nlp.remove_pipe("senter")
# Adding new ``senter`` model.
nlp.add_pipe("senter", after="parser")
# View ``senter`` tok2vec config.
senter_tok2vec = nlp.get_pipe_config("senter")["model"]["tok2vec"]
senter_tok2vec
{'@architectures': 'spacy.HashEmbedCNN.v2',
'pretrained_vectors': None,
'width': 12,
'depth': 1,
'embed_size': 2000,
'window_size': 1,
'maxout_pieces': 2,
'subword_features': True}
Pretty easy to do as the the senter component factory comes with its own tok2vec layer. If we wanted something more like the second option, we'd need to include a config telling spacy that we want the senter to listen to the existing tok2vec component.
from confection import Config # For interpolating the ``nlp.config``.
# Extracting width from ``tagger``'s interpolated config beacuse it listens to ``tok2vec``.
inter_config = Config(nlp.config).interpolate()
width = inter_config["components"]["tagger"]["model"]["tok2vec"]["width"]
senter_config = {
"model": {
"tok2vec": {
"@architectures": "spacy.Tok2VecListener.v1",
"width": width,
"upstream": "tok2vec",
}
},
}
# Before adding ``senter`` with listener.
tok2vec.listening_components
['tagger', 'parser']
# Removing existing ``senter`` model without listener.
nlp.remove_pipe("senter")
# Adding new ``senter`` model with listener.
nlp.add_pipe("senter", after="parser", config=senter_config)
# After adding ``senter`` with listener.
tok2vec.listening_components
['tagger', 'parser', 'senter']
What's with the nested dictionaries? Why not use a method or more object-oriented approach? These are questions I asked myself when the config system was first introduced. Since then I've grown used to it not because it's easier, but because it's more maintainable (especially when you use it the way it was designed to be used, i.e., not in a notebook).
Because I consider the third option uncommon, I'm not going to show it. But if you wanted to try it for yourself you'd follow these steps:
- Add a secondary
tok2veclayer with a different name (something liketok2vec.secondary) - Add a component via the
nlp.add_pipemethod and modify the config to point attok2vec.secondaryinstead oftok2vecin theupstreamfield.
If you want to look into what's going on under the hood, I've tracked down the nlp.add_pipe source code as well as additional documentation specific to the "listener" components. Please have a gander and drop a comment if you'd like to discuss further.
We've walked through adding independent and listener components; how do we take an existing listener component and make it independent? Rolling with the notebook approach first, we would use the nlp.replace_listeners method.
# Before making the listening ``senter`` component independent.
nlp.get_pipe_config("senter")["model"]["tok2vec"]
{'@architectures': 'spacy.Tok2VecListener.v1',
'width': 96,
'upstream': 'tok2vec'}
nlp.replace_listeners(
tok2vec_name="tok2vec",
pipe_name="senter",
# Each ``listener`` is represented with TOML-like structure.
listeners=["model.tok2vec"]
)
# After making the listening ``senter`` component independent.
nlp.get_pipe_config("senter")["model"]["tok2vec"]
{'@architectures': 'spacy.Tok2Vec.v2',
'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
'width': '${components.tok2vec.model.encode:width}',
'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE', 'SPACY', 'IS_SPACE'],
'rows': [5000, 1000, 2500, 2500, 50, 50],
'include_static_vectors': False},
'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
'width': 96,
'depth': 4,
'window_size': 1,
'maxout_pieces': 3}}
Almost as easy as adding an independent component to the pipeline!
Now normally you'd only make a component independent if you were going to freeze it. For example, if you wanted the en_core_web_sm's tagger component to annotate some text in a new pipeline, but didn't want to change its underlying weights. Because of all the settings that need to be handled, I recommend doing this via the config.
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
# Create a new pipeline with an independent ``tagger`` from the ``en_core_web_sm`` model,
# and a new ``parser`` that will listen to the pipeline's ``tok2vec`` layer.
new_config = {
"nlp": {
"pipeline": ["tok2vec", "tagger", "parser"]
},
"components": {
"tok2vec": {
"factory": "tok2vec",
"model": DEFAULT_TOK2VEC_MODEL,
},
"tagger": {
"source": "en_core_web_sm",
"replace_listeners": ["model.tok2vec"],
},
"parser": {
"factory": "parser",
"model": {
"tok2vec": {
"@architectures": "spacy.Tok2VecListener.v1",
"width": "${components.tok2vec.model:width}",
"upstream": "tok2vec",
}
}
}
},
"training": {
"frozen_components": ["tagger"],
"annotating_components": ["tagger"]
},
}
new_nlp = spacy.blank("en", config=new_config)
And there you have it. A more detailed explanation of how to add an independent component, add a listening component, and make an existing listening component independent. Please leave a comment if you have any questions or would like me to drill deeper into another part of the spacy config system.