Training a language model in spaCy v3
| Nico Lutz1. Intro
On Feb 1 2020, explosion.ai introduced spaCy v3, a huge upgrade to the previous version, featuring new transformer-based pipelines and workflows. Naturally some projects needed to be migrated to spaCy v3. This article shows in tutorial like steps what needs to be done to create a new language model from scratch.
This article assumes basic knowledge of python, spaCy and standard nlp techniques.
2. Setup
Create virtual env:
python -m venv .venv
source .venv/bin/activate
and install spacy
pip install spacy
3. Language Subclass
If you want to retrain an already existing language class you can skip the following steps, otherwise keep on reading. For educational purposes we choose to train a new english model. To start, create a custom language subclass according to your needs.
Create a folder named custom_en
and add an __init__
defining the Defaults
and Language subclass. In addition we need:
- Punctuation: Punctuation chars (.?!)
- Stopwords: Frequent words that appear often in that language (a, the, about, ...)
- Syntax iterators: Functions that compute views of a Doc object based on its syntax, e.g. Noun chunks
- Tokenizer exception: Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
- ...
import spacy
from spacy.language import Language
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS
class CustomEnglishDefaults(Language.Defaults):
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
prefixes = TOKENIZER_PREFIXES
infixes = TOKENIZER_INFIXES
suffixes = TOKENIZER_SUFFIXES
syntax_iterators = SYNTAX_ITERATORS
stop_words = STOP_WORDS
@spacy.registry.languages("custom_en")
class CustomEnglish(Language):
lang = "custom_en"
Defaults = CustomEnglishDefaults
Here we are using spaCy's new
registery to create the custom
Language subclass. Upon program start the custom_en
language class will be
available.
4. SpaCy's training config and projects
Training Machine Learning models can result in a mess of bash scripts and files
to store all the necessary hyper-parameters. SpaCy introduces two new tools to
reduce complexity during training. The first is their new training
config system. Roughly said one
defines a config.cfg
where each section describes a component and all their
corresponding hyper-parameters. These sections can also refer to registered
function to load model architectures, optimizers, augmenters, etc. This makes it
easy to integrate everything and have all the information in one place.
The second tool is spaCy projects which lets
you manage and share end-to-end spaCy workflows. Such as assets and corpora
management, training and packaging. Basically we define all commands, we need
during our machine learning process, in a project.yaml
file and further reduce
complexity by calling just one and not several commands.
An example config.cfg
could look something like this:
[paths]
train = "corpus/train.spacy"
dev = "corpus/dev.spacy"
vectors = null
vocab_data = null
init_tok2vec = null
[system]
seed = 0
gpu_allocator = null
[nlp]
lang = "custon_en"
pipeline = ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
batch_size = 256
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE", "LEMMA"]
rows = [5000,2500,2500,2500,2500]
include_static_vectors = false
[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3
...
As a starting point spaCy provides a widget for creating a default.cfg. Also their extensive documentation on architectures and hyper-parameters helps a lot in defining each component.
In the beginning the project.yaml
can be created by using a provided template
and then add our custom commands as the project goes along.
python -m spacy project clone pipelines/tagger_parser_ud
5. Training data
In reality training data is messy and incomplete. All machine learning engineers
know that here lies the heart of each trained model. This article tries to give
a glimpse into spaCy's training process, so we cheat and take a complete labeled
dataset from the
Universal Treebank.
We can use a helper function to convert these .conllu
files into a trainable
format.
In the project.yaml
:
vars:
config: 'default'
lang: 'custon_en'
treebank: 'UD_English-EWT'
commands:
- name: preprocess
help: "Convert the data to spaCy's format or unzip"
script:
- 'mkdir -p corpus/${vars.norne}'
- 'python -m spacy convert
assets/${vars.treebank}/ud/nno/${vars.train_name}.conllu
corpus/${vars.treebank}/ --converter conllu --n-sents 10
--merge-subtokens'
6. Training
Using spacy project we can run:
python -m spacy project run train
and immediately start seeing losses and accuracies per epochs:
Running command: /Users/nico/Dev/custom_en/.venv/bin/python -m spacy train configs/default.cfg --output training/custom_en --gpu-id -1 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --code ./custom_en/loader.py
ℹ Saving to output directory: training/custom_en
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2021-09-08 09:24:54,019] [INFO] Set up nlp object from config
[2021-09-08 09:24:54,037] [INFO] Pipeline: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
[2021-09-08 09:24:55,612] [INFO] Created vocabulary
[2021-09-08 09:25:26,151] [INFO] Finished initializing nlp object
[2021-09-08 09:26:06,812] [INFO] Initialized pipeline components: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS TAGGER LOSS MORPH... LOSS PARSER LOSS NER LOSS SENTER TAG_ACC POS_ACC MORPH_ACC DEP_UAS DEP_LAS SENTS_F ENTS_F ENTS_P ENTS_R LEMMA_ACC SCORE
--- ------ ------------ ----------- ------------- ----------- -------- ----------- ------- ------- --------- ------- ------- ------- ------ ------ ------ --------- ------
0 0 0.00 175.06 184.99 361.41 97.53 93.00 30.69 28.60 25.11 18.29 7.28 0.30 0.00 0.00 0.00 59.07 0.25
0 1000 27032.84 31342.60 54016.47 104942.62 10414.39 3219.51 93.91 93.
...
Notice that we use the --code
flag to read in our custom language class and
make it available during the training process. After training package the model
and start using it.
This is just the glimpse of the many capabilities of spaCy, we didn't start talking about loading pre-trained vectors or developing custom components. Nonetheless this article has turned into a love letter to spacy. The library is so well written and documented that it is a blast to use it during development. Every aspect of machine learning is though after and tightly integrated into their architecture.