spaCy Learning Tutorial

What is spaCy?
spaCy is an industrial-strength natural language processing (NLP) library built for efficiency and production use. It supports tasks like tokenization, tagging, parsing, named entity recognition, and more.

import spacy
print("spaCy is ready for NLP tasks!")

History and development
Developed by Explosion AI, spaCy was released in 2015. It’s written in Python and Cython and designed for speed, scalability, and developer-friendliness.

# Project: https://spacy.io by Explosion AI

spaCy vs other NLP libraries
spaCy is faster and more production-focused compared to NLTK (research-oriented) and TextBlob (simpler API). It includes pretrained pipelines and deep learning integration.

# spaCy is to production what NLTK is to research

Installation and setup
spaCy is easy to install via pip. After installation, you must download a language model (e.g., English).

pip install spacy
python -m spacy download en_core_web_sm

Supported languages
spaCy supports many languages including English, French, German, Spanish, Portuguese, and more through pretrained language models.

nlp = spacy.load("en_core_web_sm")
doc = nlp("Bonjour le monde!")  # Works with multilingual support

Core features overview
spaCy includes tokenization, part-of-speech tagging, dependency parsing, named entity recognition (NER), similarity checks, and rule-based matching.

doc = nlp("Apple is looking at buying U.K. startup.")
for ent in doc.ents:
    print(ent.text, ent.label_)

spaCy architecture basics
spaCy uses an efficient pipeline-based architecture where documents pass through components like tokenizer, tagger, parser, NER, and custom functions.

print(nlp.pipe_names)  # ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Use cases and applications
spaCy is used in chatbots, search engines, information extraction, recommendation systems, and automated text understanding in production systems.

# Extract entities from resumes or support tickets

Community and resources
spaCy has a strong community and resources including documentation, GitHub examples, online courses, and a helpful forum for developers.

# Visit https://spacy.io/usage for docs and tutorials

Running your first spaCy script
Once installed, you can start using spaCy with a few lines of code to parse and analyze text.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("SpaCy makes NLP easy!")
for token in doc:
    print(token.text, token.pos_, token.dep_)

What is tokenization?
Tokenization is the process of breaking text into smaller units called tokens. Tokens are the foundation of NLP analysis and can be words, punctuation, or symbols.

doc = nlp("Let's learn spaCy.")
tokens = [token.text for token in doc]
print(tokens)

spaCy’s tokenizer features
spaCy's tokenizer handles punctuation, whitespace, prefixes, suffixes, and exceptions. It’s highly optimized for performance and language correctness.

doc = nlp("U.S.A. is a country.")
for token in doc:
    print(token.text)

Token attributes and methods
Tokens in spaCy come with attributes like `.text`, `.lemma_`, `.pos_`, `.is_alpha`, etc., allowing rich linguistic analysis.

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_alpha)

Sentence segmentation
spaCy automatically detects sentence boundaries using built-in logic or custom rule-based segmentation.

for sent in doc.sents:
    print(sent.text)

Handling special cases (URLs, emojis)
spaCy handles URLs, hashtags, mentions, and emojis using special rules or can be extended using `tokenizer.add_special_case`.

from spacy.symbols import ORTH
special_case = [{ORTH: "¯\\_(ツ)_/¯"}]
nlp.tokenizer.add_special_case("¯\\_(ツ)_/¯", special_case)
doc = nlp("¯\\_(ツ)_/¯ is a shrug.")
print([token.text for token in doc])

Customizing tokenization rules
You can define custom prefix, suffix, or infix rules to tailor tokenization for your domain (legal, medical, etc.).

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
custom_nlp = English()
custom_tokenizer = Tokenizer(custom_nlp.vocab)
doc = custom_tokenizer("Custom.tokenizer,loaded")
print([token.text for token in doc])

Tokenization performance tips
Tokenization is fast, but for large texts, disable pipeline components (e.g., `ner`) to speed up processing.

doc = nlp.make_doc("Just tokenize me.")

Working with Doc objects
The `Doc` object holds the tokenized text and linguistic annotations. It's spaCy's core data structure for processing text.

print(type(doc))  # <class 'spacy.tokens.doc.Doc'>
print(doc.text)

Accessing and modifying tokens
While tokens are read-only, you can add custom data using extensions or manipulate strings before processing.

for token in doc:
    print(token.text, token.is_stop, token.is_punct)

Tokenization pitfalls and solutions
spaCy may tokenize improperly in rare edge cases (e.g., domain-specific slang). Override rules or preprocess to improve accuracy.

# Use regex cleaning before passing to tokenizer

Understanding POS tagging
Part-of-Speech (POS) tagging involves labeling each word in a sentence with its grammatical role, like noun or verb. It’s critical for parsing, entity recognition, and understanding sentence structure.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("SpaCy makes NLP tasks easy.")
for token in doc:
    print(token.text, token.pos_)

spaCy’s POS tagger overview
spaCy uses statistical models trained on labeled corpora for high-accuracy POS tagging. It offers both universal and detailed tags.

for token in doc:
    print(f"{token.text} - POS: {token.pos_}, Tag: {token.tag_}")

Universal POS tags vs language-specific
Universal tags are language-agnostic (e.g., NOUN, VERB), while fine-grained tags (e.g., NNS, VBD) provide detailed linguistic information based on language.

print(token.pos_, token.tag_)  # e.g., NOUN, NNS

Accessing POS tags in spaCy
spaCy exposes `.pos_`, `.tag_`, and `.dep_` attributes to access POS and dependency labels for each token.

for token in doc:
    print(token.text, token.pos_, token.dep_)

Fine-tuning POS taggers
You can fine-tune spaCy’s tagger on custom annotated corpora to improve performance on domain-specific texts like legal or medical documents.

# Use spaCy's training loop with POS-labeled examples

Evaluating POS tagging accuracy
Use accuracy metrics or compare predicted tags with a gold-standard test set to evaluate tagging performance.

# Calculate accuracy manually or use spaCy Scorer

Common POS tagging errors
Ambiguity and informal language often confuse taggers. For example, “book” can be a noun or verb depending on context.

text = "Let's book a room."
# "book" might be misclassified

Integrating POS tags in pipelines
POS tags are often used as features in downstream NLP tasks like parsing, classification, or entity recognition.

features = [(token.text, token.pos_) for token in doc]

Visualizing POS tags
Use `displaCy` to visually inspect POS and dependency relationships in a sentence.

from spacy import displacy
displacy.render(doc, style="dep")

Use cases of POS tagging
POS tagging is used in grammar checkers, question answering, translation, summarization, and entity recognition systems.

# Extract nouns from sentence
nouns = [token.text for token in doc if token.pos_ == "NOUN"]

Introduction to NER
Named Entity Recognition (NER) detects entities like names, places, and organizations in text. It's crucial for extracting structured data from unstructured documents.

doc = nlp("Elon Musk founded SpaceX in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)

spaCy’s NER capabilities
spaCy’s pre-trained models support English, German, Spanish, etc. They recognize PERSON, ORG, LOC, DATE, and more.

print(spacy.explain("ORG"))  # 'Companies, agencies, institutions'

Pre-trained entity types
spaCy’s models can detect over 18 entity types like GPE, PRODUCT, MONEY, etc., based on context.

for ent in doc.ents:
    print(ent.text, ent.label_, spacy.explain(ent.label_))

Extracting entities from text
You can extract and filter specific types of entities like names, locations, or organizations programmatically.

locations = [ent.text for ent in doc.ents if ent.label_ == "GPE"]

Customizing and extending NER
Use `EntityRuler` to add rule-based patterns or retrain the model with new annotated data for custom entities.

from spacy.pipeline import EntityRuler
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns([{"label": "SOFTWARE", "pattern": "ChatGPT"}])

Training custom NER models
Train spaCy’s NER on labeled examples with new entity types using spaCy’s training CLI or Python API.

# Use spaCy CLI: python -m spacy train config.cfg --paths.train ./train.spacy

Evaluating NER models
Evaluate models with precision, recall, and F1-score. spaCy’s `Scorer` utility helps automate this process.

from spacy.training import Example
example = Example.from_dict(doc, {"entities": [(0, 4, "PERSON")]})

Handling overlapping entities
spaCy does not support overlapping spans in the same entity stream, so complex layouts may need separate pipelines or rules.

# Consider using additional span groups or custom components

Visualizing entities with displaCy
`displaCy` renders entities in a browser or Jupyter notebook, making it easy to debug and present results.

displacy.render(doc, style="ent", jupyter=True)

Applications of NER
NER is used in news mining, resume parsing, financial document analysis, and chatbot knowledge extraction.

# Extract person names for building user profiles
people = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]

What is dependency parsing?
Dependency parsing analyzes grammatical structure by establishing relationships between "head" words and words which modify those heads. It helps in syntactic understanding of sentences.

# Example: Parsing a sentence
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("She enjoys reading books.")
for token in doc:
    print(token.text, "->", token.dep_, "->", token.head.text)

spaCy’s dependency parser
spaCy includes a fast and robust dependency parser trained on large corpora. It produces labeled directed trees representing syntactic structure.

doc = nlp("Cats chase mice.")
for token in doc:
    print(token.text, token.dep_, token.head.text)

Understanding syntactic dependencies
Dependencies like `nsubj`, `dobj`, `ROOT` define grammatical roles (subject, object, etc.), essential for linguistic analysis and information extraction.

# See dependency labels
for token in doc:
    print(token.text, token.dep_)

Accessing dependency labels
Dependency labels can be accessed via `.dep_` and `.head`. These labels guide relation extraction and sentence understanding.

print(doc[1].text, "is", doc[1].dep_, "of", doc[1].head.text)

Visualizing dependency trees
spaCy provides `displacy` for visualizing dependency trees directly in HTML or Jupyter notebooks.

from spacy import displacy
displacy.render(doc, style="dep", jupyter=True)

Training custom parsers
You can train your own dependency parser in spaCy for domain-specific tasks using labeled examples and the `DependencyParser` component.

# Requires annotations: (text, {"heads": [...], "deps": [...]})

Use cases of dependency parsing
It's used in information extraction, question answering, sentiment analysis, and semantic role labeling to grasp sentence structure.

# Extract subject-object pairs for relations

Troubleshooting parsing errors
Parsing errors often stem from ambiguous text or limited training data. Try rule-based fixes or retraining with new examples.

# Manually check token.head and token.dep_ for unexpected results

Combining parsing with other components
Dependency parsing is often used with NER, POS tagging, and text classification in full NLP pipelines.

doc = nlp("Alice emailed Bob")
for ent in doc.ents:
    print(ent.text, ent.label_)

Performance optimization
Use smaller models or limit components for faster inference. Disable unused components like NER if not needed.

nlp = spacy.load("en_core_web_sm", disable=["ner"])

What is lemmatization?
Lemmatization reduces words to their base or dictionary form (lemma). Unlike stemming, it uses vocabulary and grammar rules for accuracy.

doc = nlp("The striped cats are playing.")
print([token.lemma_ for token in doc])

spaCy’s lemmatizer features
spaCy includes a rule-based lemmatizer that handles various parts of speech, with accurate results for many languages.

for token in doc:
    print(token.text, "->", token.lemma_)

Morphological analysis basics
Morphology studies word forms and their grammatical features like tense, number, or case. It enriches token attributes.

print(doc[2].morph)  # Get morphology of token

Accessing lemma and morph attributes
Use `.lemma_` to get the lemma and `.morph` for morphological features. These help in normalization and linguistic tasks.

print(doc[3].text, doc[3].lemma_, doc[3].morph)

Language-specific morphology
spaCy supports language-specific morphological rules. For example, Spanish verbs are lemmatized using different rules than English.

# For Spanish: nlp = spacy.load("es_core_news_sm")

Customizing lemmatization rules
You can override or extend lemmatization rules using custom lookup tables or pipelines.

from spacy.lookups import Lookups
lookups = Lookups()
lookups.add_table("lemma_lookup", {"better": "good"})
nlp.get_pipe("lemmatizer").initialize(lambda: lookups)

Using morphology in NLP tasks
Morphological features enhance POS tagging, sentiment analysis, and parsing by providing detailed grammatical context.

# Morph features like Number=Plur or Tense=Past are informative

Integrating lemmatization in pipelines
Lemmatization is a standard step in NLP pipelines, placed after POS tagging to ensure contextual accuracy.

doc = nlp("She has gone")
print([token.lemma_ for token in doc])

Evaluating lemmatization accuracy
Evaluate with a labeled dataset comparing predicted lemmas to true lemmas using precision/recall metrics.

# Compare token.lemma_ to gold standard lemmas

Applications in information retrieval
Lemmatization improves search engine recall by grouping inflected forms (e.g., “run”, “running”, “ran”) into a single base form.

# Query normalization: convert "running" and "ran" to "run"

Introduction to text classification
Text classification assigns predefined categories to text data, useful for sentiment analysis, topic labeling, and spam detection. It’s a core NLP task that can be approached using supervised learning.

# Basic text classification overview
texts = ["I love AI", "I hate spam"]
labels = [1, 0]  # 1=positive, 0=negative

spaCy’s text categorizer component
spaCy offers a `TextCategorizer` pipeline component to train and predict text categories efficiently within its framework.

import spacy
from spacy.pipeline.textcat import Config
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat")
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

Preparing data for classification
Data preparation includes tokenization, label encoding, and formatting data into spaCy’s `Doc` objects with category annotations.

train_data = [("I love this!", {"cats": {"POSITIVE": True, "NEGATIVE": False}})]

Training text classifiers
Train classifiers using spaCy’s built-in training loops or fine-tune pretrained models for improved accuracy.

# Train model example
nlp.begin_training()
for i in range(10):
    for text, annotations in train_data:
        nlp.update([text], [annotations])

Evaluating classification models
Use metrics like accuracy, precision, recall, and F1 score to evaluate classifier performance on validation data.

# Simple evaluation loop
doc = nlp("I love this!")
print(doc.cats)

Multi-label classification
Assign multiple categories per text sample by setting multiple true labels in spaCy’s categorizer.

{"cats": {"POSITIVE": True, "SPORTS": True}}

Handling imbalanced datasets
Techniques include oversampling, undersampling, or using weighted loss functions to balance training data.

# Use data augmentation or class weights in training

Customizing classification pipelines
Add preprocessing steps like lemmatization, stopword removal, or custom tokenization before classification.

# Insert custom pipeline components before textcat

Using pretrained classifiers
Leverage pretrained models like spaCy’s `en_core_web_sm` which includes text classification components.

nlp = spacy.load("en_core_web_sm")

Deployment of classification models
Export models as packages, serve via APIs, or embed within applications for real-time classification.

nlp.to_disk("model_dir")
# Load with spacy.load("model_dir")

What is rule-based matching?
Rule-based matching searches for token patterns or phrases in text using deterministic rules, useful for entity recognition or custom tagging.

# Match "New York" phrase using rules

Using spaCy’s Matcher
spaCy’s `Matcher` matches token sequences defined by patterns such as token text, POS tags, or attributes.

from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
matcher.add("GPE", [pattern])
doc = nlp("I live in New York")
matches = matcher(doc)
print(matches)

PhraseMatcher vs Matcher
`PhraseMatcher` matches exact phrases efficiently, while `Matcher` allows complex token attribute patterns.

from spacy.matcher import PhraseMatcher
phrases = [nlp(text) for text in ["New York", "San Francisco"]]
phrasematcher = PhraseMatcher(nlp.vocab)
phrasematcher.add("GPE", None, *phrases)

Creating pattern rules
Define token sequences with constraints on text, POS, lemmas, or regular expressions to match relevant text segments.

pattern = [{"LEMMA": "buy"}, {"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
matcher.add("BUY_PATTERN", [pattern])

Combining multiple patterns
Multiple patterns can be added under one matcher label to capture diverse text forms.

matcher.add("GREETINGS", [[{"LOWER": "hi"}], [{"LOWER": "hello"}]])

Using callbacks with Matcher
Callbacks can be attached to execute functions when patterns are matched for dynamic processing.

def on_match(matcher, doc, id, matches):
    print("Match found:", matches)
matcher.add("PATTERN", [pattern], on_match)

Performance considerations
Optimize by limiting patterns, avoiding overlaps, and precompiling frequently used matchers.

# Keep patterns concise and test extensively

Use cases for rule-based matching
Ideal for recognizing domain-specific phrases, dates, IDs, or tagging specific token patterns.

# Extract phone numbers, product codes using Matcher

Integrating matching with pipelines
Add matchers as pipeline components for automatic pattern detection during text processing.

nlp.add_pipe(matcher)

Debugging and testing patterns
Validate pattern matches on test sentences, inspect spans, and adjust rules to reduce false positives.

# Test with doc and print matched spans

Understanding spaCy pipelines
spaCy pipelines are sequences of processing components (tokenizer, tagger, parser) that process text step-by-step.

nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)

Adding custom components
Custom components are Python functions or classes that can be added to pipelines for extra processing.

@spacy.Language.component("custom_component")
def custom_component(doc):
    print("Processing doc")
    return doc
nlp.add_pipe("custom_component", last=True)

Component ordering and dependencies
Order matters; components that depend on annotations from others must come later in the pipeline.

nlp.add_pipe("custom_component", before="ner")

Writing pipeline functions
Components receive `Doc` objects, modify or analyze them, then return the `Doc` for downstream processing.

def custom_component(doc):
    # add custom attributes or annotations
    return doc

Sharing data between components
Use `Doc.user_data` or custom extensions to share info between components.

from spacy.tokens import Doc
Doc.set_extension("is_custom", default=False)

Disabling/enabling components
Temporarily disable or enable components to speed up processing or troubleshoot.

with nlp.select_pipes(disable=["ner"]):
    doc = nlp("Test")

Pipeline optimization
Reduce overhead by disabling unused components and batching texts.

docs = nlp.pipe(texts, batch_size=20)

Saving and loading pipelines
Save custom pipelines to disk and reload later to preserve components and settings.

nlp.to_disk("my_model")
nlp2 = spacy.load("my_model")

Debugging custom pipelines
Use logging, print statements, and pipeline inspection tools to debug component behavior.

print(nlp.pipe_names)

Case studies
Examples include custom entity recognition, sentiment analysis modules, or preprocessing steps integrated into spaCy pipelines.

# Build a custom pipeline for financial document processing

Introduction to word vectors
Word vectors represent words as dense numeric arrays capturing semantic meaning. They allow models to understand similarity beyond exact word matching.

# Word vectors example using spaCy
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("dog cat banana")
print(doc[0].vector[:5])  # first 5 dims of 'dog' vector

spaCy’s vector support
spaCy includes pretrained word vectors in medium and large models, accessible for tokens, spans, and entire documents.

print(doc.vector[:5])  # document vector example

Accessing token and document vectors
Each token has a `.vector` attribute. The entire document also has `.vector` averaging token vectors.

for token in doc:
    print(token.text, token.vector[:3])

Calculating similarity scores
spaCy supports `.similarity()` method to compute cosine similarity between tokens or docs.

print(doc[0].similarity(doc[1]))  # similarity between 'dog' and 'cat'

Using pretrained word embeddings
Use spaCy models like `en_core_web_md` which contain pretrained GloVe embeddings for rich semantic representations.

# Already done by loading "en_core_web_md"

Custom word vectors
You can train and add your own vectors to spaCy pipelines to specialize on domain data.

# Train custom vectors with spacy vectors CLI (outside code)

Vector operations and applications
Vectors support addition, subtraction, averaging, useful for analogy tasks and clustering.

dog = doc[0].vector
cat = doc[1].vector
banana = doc[2].vector
result = dog - cat + banana
print(result[:5])

Integrating vectors with models
Vectors can serve as features for classifiers or parsers, improving semantic understanding.

# Use vectors as input in downstream ML models

Evaluating vector quality
Evaluate via intrinsic tasks (similarity benchmarks) or extrinsic tasks (performance gain in downstream tasks).

# Use datasets like WordSim-353 for evaluation

Real-world examples
Semantic search, document clustering, recommendation systems all leverage word vectors.

# Example: find most similar word in vocab
most_similar = max(nlp.vocab.vectors.keys(), key=lambda k: nlp.vocab.vectors.similarity(k, doc[0].orth))
print(nlp.vocab.strings[most_similar])

spaCy’s training framework
spaCy provides flexible APIs and CLI tools to train custom NLP models like NER, text classification, and dependency parsing.

# CLI command example:
# python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

Preparing training data
Data must be formatted into spaCy’s binary `.spacy` format or JSON with annotations.

# Example JSON format snippet
{
  "text": "Apple is looking at buying U.K. startup.",
  "entities": [(0, 5, "ORG")]
}

Configuring training settings
The config file defines pipeline components, hyperparameters, and training loops for custom model training.

# Minimal config snippet example
[training]
batch_size = 32
[components.ner]
max_epochs = 20

Training NER models
Named Entity Recognition models learn to detect entities from labeled text via supervised learning.

# Example Python snippet for training NER
import spacy
from spacy.training import Example

nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("ORG")

# Training loop omitted for brevity

Training text classifiers
Text classifiers categorize texts into classes such as sentiment or topic.

# Add textcat pipe
textcat = nlp.add_pipe("textcat")
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

Training dependency parsers
Parsers learn syntactic structure, predicting head-dependent relations between words.

# Add parser pipe
parser = nlp.add_pipe("parser")
# Add dependency labels

Using spaCy’s CLI for training
The CLI simplifies training workflows, allowing easy config file editing and execution.

# python -m spacy train config.cfg --output ./model_output

Monitoring training progress
Training logs display losses and metrics to monitor model improvements.

# Logs print after each epoch in CLI

Evaluating trained models
Evaluate with precision, recall, F1 scores using built-in spaCy evaluation scripts.

# nlp.evaluate(...) method used in scripts

Saving and sharing models
Export trained models to disk for reuse and sharing on platforms like Hugging Face Hub.

nlp.to_disk("./my_model")
# Load later
nlp2 = spacy.load("./my_model")

Saving spaCy models
spaCy models can be saved to disk using the `.to_disk()` method, preserving pipelines and weights.

nlp.to_disk("model_dir")

Loading models for inference
Load saved models easily for prediction without retraining.

import spacy
nlp = spacy.load("model_dir")
doc = nlp("This is a test.")

Versioning models
Track versions of models with semantic versioning and metadata for reproducibility.

# Use version tags in filenames or metadata files

Exporting pipelines
Export entire processing pipelines or individual components for reuse.

# Save only NER component
ner = nlp.get_pipe("ner")
ner.to_disk("ner_dir")

Sharing models on Hugging Face Hub
Publish models to Hugging Face Hub to facilitate community sharing and collaboration.

# Use huggingface_hub CLI or API

Model packaging best practices
Package models with environment specs, dependencies, and documentation for portability.

# Use requirements.txt and README.md alongside model files

Managing multiple models
Organize and maintain multiple model versions for different tasks or datasets.

# Use directory structure or model registry tools

Loading components separately
Load specific components when full pipeline is not required for faster inference.

nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.from_disk("ner_dir")

Model caching strategies
Cache loaded models in memory or disk to reduce load times in production.

# Use caching libraries or persistent server processes

Performance considerations
Optimize model size, lazy loading, and batch processing for efficient real-time use.

# Adjust batch size during inference for speed

Introduction to displaCy
displaCy is spaCy’s built-in visualization tool for NLP tasks, supporting dependency trees and named entity rendering in HTML or Jupyter notebooks.

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup.")
displacy.render(doc, style="dep")  # Render dependency tree

Visualizing dependency trees
displaCy shows syntactic structure with arcs between tokens indicating dependencies, useful for grammar analysis.

displacy.render(doc, style="dep")  # Default dependency parse visualization

Visualizing named entities
Named entities are highlighted with labels such as ORG or GPE, helping identify key info.

displacy.render(doc, style="ent")  # Highlight named entities

Customizing visualizations
Customize colors, labels, and rendering options by passing options dict.

options = {"colors": {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}}
displacy.render(doc, style="ent", options=options)

Serving visualizations in web apps
Use `displacy.serve()` to start a local web server and visualize in browser.

displacy.serve(doc, style="dep")

Exporting visualizations
Save HTML output from displaCy to embed or share visualizations.

html = displacy.render(doc, style="ent", page=True)
with open("ent.html", "w") as f:
    f.write(html)

Using displaCy in Jupyter notebooks
Render inline visualizations using `displacy.render()` with `jupyter=True`.

displacy.render(doc, style="dep", jupyter=True)

Interactive visualizations
displaCy supports zooming, panning, and hover tooltips in browser for detailed analysis.

# Interactive features are built-in in served visualizations

Styling and themes
Customize font, colors, and layout via CSS or options to match branding.

# Modify options dict for colors and font size

Integrations and extensions
Extend displaCy to visualize custom entities or relations by modifying spaCy pipeline.

# Use custom extension attributes for visualization

Supported languages overview
spaCy supports multiple languages with dedicated models, covering tokenization, tagging, parsing, and NER.

# List available models
!python -m spacy validate

Language-specific models
Use models trained specifically for language syntax and vocabulary (e.g., `fr_core_news_sm` for French).

nlp_fr = spacy.load("fr_core_news_sm")
doc = nlp_fr("Ceci est une phrase en français.")

Tokenization differences by language
Tokenization rules differ per language; spaCy handles language-specific tokenizers.

print([token.text for token in doc])  # Tokenization respects French rules

Training models for new languages
Train custom models using spaCy's training API for unsupported languages or dialects.

# Create blank model and train on annotated data
nlp = spacy.blank("xx")  # blank multi-language model

Handling multilingual text
Detect language first, then route to appropriate spaCy model for processing.

from langdetect import detect
text = "Hola, ¿cómo estás?"
lang = detect(text)
print(lang)  # es for Spanish

Language detection integration
Combine language detection libraries with spaCy pipelines for dynamic processing.

# Use langdetect or fasttext for detection before processing

Customizing pipelines for languages
Modify tokenization, tagging, and entity rules per language requirements.

# Add custom components for language-specific processing

Challenges in multilingual NLP
Issues include code-switching, lack of annotated data, and varying grammar complexity.

# Research needed; no simple code

Evaluating multilingual models
Use language-specific benchmarks and metrics to assess performance.

# Evaluate with accuracy, F1 on multilingual corpora

Real-world multilingual applications
Use cases include translation, cross-lingual search, and international chatbots.

# Integrate with translation APIs and spaCy pipelines

spaCy and PyTorch integration
spaCy pipelines can use PyTorch models as custom components for NER or text classification.

# Example: wrap PyTorch model as spaCy pipeline component

Using spaCy with TensorFlow
spaCy can preprocess data for TensorFlow models or integrate TensorFlow components.

# Use spaCy tokenizer + TensorFlow model input pipeline

Custom model layers in spaCy
Define custom neural layers compatible with spaCy’s Thinc library for advanced modeling.

# Create custom layer extending Thinc API

Combining spaCy with transformers
Use HuggingFace transformers inside spaCy pipelines to boost accuracy with pretrained language models.

from spacy_transformers import TransformerModel
# Add transformer to spaCy pipeline

Transfer learning in spaCy pipelines
Fine-tune pretrained models with task-specific data via spaCy’s training API.

# Train transformer-based pipeline on your dataset

Using pretrained embeddings
Integrate embeddings like GloVe or BERT for rich vector representations.

# spaCy medium model includes GloVe vectors

Fine-tuning transformer models
Adjust weights of transformer layers on your labeled data for better task-specific performance.

# Use HuggingFace Trainer API alongside spaCy preprocessing

Exporting models for DL frameworks
Export spaCy models or components to ONNX or TorchScript for deployment.

# Use spacy convert or third-party tools

Hybrid NLP models
Combine rule-based, statistical, and deep learning components for flexible pipelines.

# spaCy pipelines allow custom components chaining

Case studies
Real-world examples include chatbots, document classification, and sentiment analysis with spaCy+DL.

# Research papers and open-source projects

Efficient pipeline design
Design NLP pipelines with minimal components and streamlined workflows to avoid redundant processing, reducing runtime and memory use.

# Build lean pipeline by disabling unnecessary components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

Using spaCy’s nlp.pipe for batching
Process texts in batches with `nlp.pipe` to improve speed and reduce overhead.

texts = ["Text one.", "Text two.", "Text three."]
for doc in nlp.pipe(texts, batch_size=50):
    print(doc.text)

Parallel processing with spaCy
Use `n_process` in `nlp.pipe` to run processing across multiple CPU cores.

for doc in nlp.pipe(texts, n_process=4):
    print(doc.text)

Memory management tips
Free unused objects, limit cache sizes, and use generators to avoid memory bloat in large pipelines.

import gc
gc.collect()

Speeding up tokenization
Use optimized tokenizers or customize rules to speed up tokenization, the first and often slowest step.

# Use simpler tokenizer or disable unnecessary tokenization extensions

Optimizing model size
Use smaller pretrained models or prune unused pipeline components for faster load times and less memory.

nlp = spacy.load("en_core_web_sm")  # smaller model

GPU acceleration options
Use spaCy with compatible GPU support to speed up deep learning components like neural pipelines.

import spacy
spacy.require_gpu()
nlp = spacy.load("en_core_web_trf")

Profiling spaCy applications
Use profiling tools like `cProfile` or spaCy's built-in `debug` info to find bottlenecks.

import cProfile
cProfile.run('nlp("Some large text")')

Handling large datasets
Stream process large corpora in batches and consider using disk-based storage for intermediate results.

for doc in nlp.pipe(large_texts, batch_size=1000):
    # process docs
    pass

Best practices for production
Use containerization, monitoring, optimized pipelines, and secure APIs to ensure reliable and scalable NLP services.

# Example: Docker container with FastAPI for model serving

Exporting models for production
Save trained spaCy models to disk using `nlp.to_disk()` for later deployment.

nlp.to_disk("model_dir")
nlp2 = spacy.load("model_dir")

Serving models with REST APIs
Wrap spaCy models in RESTful APIs using frameworks like Flask or FastAPI.

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_core_web_sm")

@app.post("/predict")
def predict(text: str):
    doc = nlp(text)
    return {"entities": [(ent.text, ent.label_) for ent in doc.ents]}

Using FastAPI with spaCy
FastAPI offers async capabilities and easy deployment, ideal for scalable NLP apps.

# See example above for basic FastAPI + spaCy integration

Containerizing NLP apps with Docker
Package NLP apps into Docker containers to ensure environment consistency and easy deployment.

# Dockerfile example
FROM python:3.10-slim
RUN pip install spacy fastapi uvicorn
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Cloud deployment strategies
Deploy using AWS Lambda, Google Cloud Run, or Azure Functions for serverless scalability.

# Deploy FastAPI app to cloud container services

Edge and mobile deployment
Use lightweight spaCy models or convert pipelines to ONNX format for mobile/edge devices.

# Export model with spacy to ONNX for edge inference

Monitoring and logging
Track API usage, latency, errors, and model performance using logging frameworks.

import logging
logging.basicConfig(level=logging.INFO)

Scaling inference services
Use load balancers, horizontal scaling, and asynchronous processing to handle traffic spikes.

# Kubernetes or cloud autoscaling example

Security considerations
Secure APIs with authentication, encryption, and input validation to prevent attacks.

# Use OAuth or API keys for access control

Continuous integration and deployment
Automate tests and deployments with CI/CD pipelines for reliable updates.

# GitHub Actions or Jenkins pipeline for spaCy app

Building a chatbot
Combine spaCy for intent detection with rule-based response logic to create conversational agents.

# Basic intent recognition example
doc = nlp("Book a flight")
if "book" in doc.text.lower():
    print("Booking intent detected")

Information extraction system
Extract structured data from text, such as dates, names, or product info, using entity recognition.

for ent in doc.ents:
    print(ent.text, ent.label_)

Text summarization tool
Use sentence ranking and keyword extraction to generate extractive summaries.

# Summarization via sentence scoring (simplified)

Sentiment analysis pipeline
Integrate spaCy with sentiment tools like TextBlob or VADER for opinion mining.

from textblob import TextBlob
blob = TextBlob(doc.text)
print(blob.sentiment.polarity)

Document classification app
Build a classifier for categorizing documents into topics or types.

# Train textcat component and classify docs

Named entity recognition demo
Visualize detected entities in texts using spaCy’s displacy.

from spacy import displacy
displacy.render(doc, style="ent")

Resume parsing system
Extract candidate details like skills, education, and experience from resumes.

# Custom NER for resumes with spaCy

Social media analytics
Analyze trends, hashtags, and sentiment in tweets or posts.

# Preprocess social data, analyze with spaCy + sentiment

Custom search engine
Combine keyword matching and semantic similarity to build search applications.

# Use similarity queries with spaCy vectors

End-to-end NLP application
Integrate all modules into a full pipeline, deploy as web or desktop app.

# Flask or FastAPI to serve NLP app

Overview of spaCy ecosystem
The spaCy ecosystem includes tools and libraries built around spaCy for data processing, model training, and deployment. It supports various NLP needs beyond the core library.

# Core spaCy + ecosystem libraries like Prodigy, Thinc
import spacy
nlp = spacy.load("en_core_web_sm")

Using spaCy Universe projects
spaCy Universe is a collection of community-maintained plugins and tools that extend spaCy functionality, including connectors, models, and utilities.

# Browse spaCy Universe for tools: https://spacy.io/universe

Integrating with Prodigy annotation tool
Prodigy is a commercial annotation tool designed to work seamlessly with spaCy for fast, active learning-based data labeling.

# Example CLI: prodigy ner.manual en_core_web_sm ./data.jsonl --label PERSON,ORG

spaCy transformers package
This package integrates transformer models like BERT with spaCy pipelines, enabling powerful contextual embeddings and improved accuracy.

import spacy_transformers
nlp = spacy_transformers.load("en_core_web_trf")

Visualization tools beyond displaCy
There are additional visualization libraries like SpaCy Streamlit and third-party projects for interactive NLP exploration.

# Use streamlit_spacy for dashboards

Language models from third parties
Third parties publish custom spaCy-compatible models for specific languages or domains, enriching spaCy’s reach.

# Install third-party models with pip, then load like core models

Extending spaCy with plugins
spaCy supports plugins for custom components, pipeline extensions, and training utilities to tailor workflows.

@Language.component("custom_component")
def custom_component(doc):
    # custom logic
    return doc
nlp.add_pipe("custom_component", last=True)

Community resources and forums
Engage with spaCy’s active community through forums, GitHub discussions, and tutorials to solve problems and share knowledge.

# https://github.com/explosion/spaCy/discussions

Contribution guidelines
spaCy encourages open-source contributions with clear guidelines on coding style, testing, and documentation.

# See CONTRIBUTING.md in spaCy repo for details

Keeping up with updates
Follow spaCy’s release notes, newsletters, and social media to stay informed about new features and improvements.

# https://spacy.io/usage/releases

Transformer architectures in spaCy
SpaCy leverages transformer architectures like BERT and RoBERTa to provide contextualized embeddings, improving many NLP tasks significantly.

# Load transformer pipeline
import spacy_transformers
nlp = spacy_transformers.load("en_core_web_trf")

Explainability in NLP models
Understanding why models make certain decisions is crucial. Techniques like attention visualization and SHAP values help interpret models.

# Use explainer libraries like Captum or SHAP with NLP models

Handling bias and fairness
NLP models can encode societal biases. Research focuses on detecting and mitigating biases to ensure fair AI applications.

# Analyze embeddings for bias using WEAT or similar tests

Semi-supervised learning with spaCy
Semi-supervised methods train models with limited labeled data and large unlabeled corpora, leveraging active learning or self-training.

# Use Prodigy for active learning loops to improve model with minimal labels

Active learning workflows
Active learning prioritizes labeling the most informative examples, reducing annotation effort and improving model quality.

# Select uncertain predictions for annotation

Self-supervised NLP tasks
Self-supervised learning uses raw text to create proxy tasks, enabling models to learn representations without manual labels.

# Examples: masked language modeling, next sentence prediction

Domain adaptation techniques
Adapting models to new domains involves fine-tuning on domain-specific data or using domain-adversarial training.

# Fine-tune base model on medical or legal texts

Multimodal NLP integration
Multimodal models combine text with images, audio, or video to build richer AI applications.

# Combine spaCy text embeddings with image features from CNNs

Latest research trends
Emerging research includes large language models, prompt engineering, and more efficient transformers.

# Explore papers on arXiv or conferences like ACL, NeurIPS

Future of spaCy and NLP
spaCy aims to remain a leading NLP library, integrating new research, expanding language support, and improving usability.

# Follow spaCy roadmap on GitHub for upcoming features

The site is under development.

spaCy Learning Tutorial

Chapter 1: Introduction to spaCy

Chapter 2: Tokenization and Text Processing

Chapter 3: Part-of-Speech (POS) Tagging

Chapter 4: Named Entity Recognition (NER)

Chapter 5: Dependency Parsing

Chapter 6: Lemmatization and Morphology

Chapter 7: Text Classification

Chapter 8: Rule-Based Matching

Chapter 9: Custom Pipelines and Components

Chapter 10: Working with Vectors and Similarity

Chapter 11: Training Custom Models

Chapter 12: Serialization and Model Management

Chapter 13: Visualization with displaCy

Chapter 14: Multilingual NLP with spaCy

Chapter 15: Integration with Deep Learning Frameworks

Chapter 16: Performance Optimization and Scalability

Chapter 17: Deploying spaCy Models

Chapter 18: Practical NLP Projects with spaCy

Chapter 19: spaCy Ecosystem and Extensions

Chapter 20: Advanced Topics and Research Directions