Training a language model in spaCy v3

On Feb 1 2020, explosion.ai introduced spaCy v3, a huge upgrade to the previous version, featuring new transformer-based pipelines and workflows. Naturally some projects needed to be migrated to spaCy v3. This article shows in tutorial like steps what needs to be done to create a new language model from scratch.

This article assumes basic knowledge of python, spaCy and standard nlp techniques.

1. Setup

Create virtual env:

python -m venv .venv
source .venv/bin/activate

and install spacy

pip install spacy

2. Language Subclass

If you want to retrain an already existing language class you can skip the following steps, otherwise keep on reading. For educational purposes we choose to train a new english model. To start, create a custom language subclass according to your needs.

Create a folder named custom_en and add an __init__ defining the Defaults and Language subclass. In addition we need:

  • Punctuation: Punctuation chars (.?!)
  • Stopwords: Frequent words that appear often in that language (a, the, about, …)
  • Syntax iterators: Functions that compute views of a Doc object based on its syntax, e.g. Noun chunks
  • Tokenizer exception: Special-case rules for the tokenizer, for example, contractions like “can’t” and abbreviations with punctuation, like “U.K.”.
import spacy
from spacy.language import Language

from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
from .punctuation import TOKENIZER_SUFFIXES
from .stop_words import STOP_WORDS
from .syntax_iterators import SYNTAX_ITERATORS

class CustomEnglishDefaults(Language.Defaults):
    tokenizer_exceptions = TOKENIZER_EXCEPTIONS
    prefixes = TOKENIZER_PREFIXES
    infixes = TOKENIZER_INFIXES
    suffixes = TOKENIZER_SUFFIXES
    syntax_iterators = SYNTAX_ITERATORS
    stop_words = STOP_WORDS

@spacy.registry.languages("custom_en")
class CustomEnglish(Language):
    lang = "custom_en"
    Defaults = CustomEnglishDefaults

Here we are using spaCy’s new registery to create the custom Language subclass. Upon programm start the custom_en language class will be available.

3. SpaCy’s training config and projects

Training Machine Learning models can result in a mess of bash scripts and files to store all the necessary hyperparameters. SpaCy introduces two new tools to reduce complexity during training. The first is their new training config system. Roughly said one defines a config.cfg where each section describes a component and all their corresponding hyperparameters. These sections can also refer to registered function to load model architectures, optimizers, augmenters, etc. This makes it easy to integrate everything and have all the information in one place.

The second tool is spaCy projects which lets you manage and share end-to-end spaCy workflows. Such as assets and corpora management, training and packaging. Basically we define all commands, we need during our machine learning process, in a project.yaml file and further reduce complexity by calling just one and not several commands.

An example config.cfg could look something like this:

[paths]
train = "corpus/train.spacy"
dev = "corpus/dev.spacy"
vectors = null
vocab_data = null
init_tok2vec = null

[system]
seed = 0
gpu_allocator = null

[nlp]
lang = "custon_en"
pipeline = ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
batch_size = 256
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE", "LEMMA"]
rows = [5000,2500,2500,2500,2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

...

As a starting point spaCy provides a widget for creating a default.cfg. Also their extensive documentation on architectures and hyperparameters helps a lot in defining each component.

In the beginning the project.yaml can be created by using a provided template and then add our custom commands as the project goes along.

python -m spacy project clone pipelines/tagger_parser_ud

4. Training data

In reality training data is messy and incomplete. All machine learning engineers know that here lies the heart of each trained model. This article tries to give a glimpse into spaCy’s training process, so we cheat and take a complete labeled dataset from the Universal Treebank. We can use a helper function to convert these .conllu files into a trainable format.

In the project.yaml:

vars:
  config: "default"
  lang: "custon_en"
  treebank: "UD_English-EWT"

commands:
  - name: preprocess
    help: "Convert the data to spaCy's format or unzip"
    script:
      - "mkdir -p corpus/${vars.norne}"
      - "python -m spacy convert assets/${vars.treebank}/ud/nno/${vars.train_name}.conllu corpus/${vars.treebank}/ --converter conllu --n-sents 10 --merge-subtokens"

...

5. Training

Using spacy project we can run:

python -m spacy project run train

and immediately start seeing losses and accuracies per epochs:

Running command: /Users/nico/Dev/custom_en/.venv/bin/python -m spacy train configs/default.cfg --output training/custom_en --gpu-id -1 --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy --code ./custom_en/loader.py
ℹ Saving to output directory: training/custom_en
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-09-08 09:24:54,019] [INFO] Set up nlp object from config
[2021-09-08 09:24:54,037] [INFO] Pipeline: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
[2021-09-08 09:24:55,612] [INFO] Created vocabulary
[2021-09-08 09:25:26,151] [INFO] Finished initializing nlp object
[2021-09-08 09:26:06,812] [INFO] Initialized pipeline components: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ["tok2vec","tagger","morphologizer","parser","ner", "senter"]
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS PARSER  LOSS NER  LOSS SENTER  TAG_ACC  POS_ACC  MORPH_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  LEMMA_ACC  SCORE
---  ------  ------------  -----------  -------------  -----------  --------  -----------  -------  -------  ---------  -------  -------  -------  ------  ------  ------  ---------  ------
  0       0          0.00       175.06         184.99       361.41     97.53        93.00    30.69    28.60      25.11    18.29     7.28     0.30    0.00    0.00    0.00      59.07    0.25
  0    1000      27032.84     31342.60       54016.47    104942.62  10414.39      3219.51    93.91    93.

...

Notice that we use the --code flag to read in our custom language class and make it available during the training process. After training package the model and start using it.

This is just the glimpse of the many capabilities of spaCy, we didn’t start talking about loading pretrained vectors or developing custom components. Nonetheless this article has turned into a love letter to spacy. The library is so well written and documented that it is a blast to use it during development. Every aspect of machine learning is though after and tightly integrated into their architecture.