This content originally appeared on DEV Community and was authored by Shrijith Venkatramana
Hello, I'm Shrijith Venkatramana. I’m building LiveReview, a private AI code review tool that runs on your LLM key (OpenAI, Gemini, etc.) with highly competitive pricing -- built for small teams. Do check it out and give it a try!
SpaCy is a powerful library for natural language processing in Python. It focuses on efficiency and production-ready features, making it ideal for developers who need reliable tools for text analysis. While large language models grab headlines, SpaCy handles core NLP tasks with speed and precision. This article dives into practical uses, complete with code examples you can run yourself.
Setting Up SpaCy for Quick Wins
To start using SpaCy, install it via pip and download a language model. Use a small English model for most examples here to keep things fast.
# Install SpaCy if needed: pip install spacy
# Download model: python -m spacy download en_core_web_sm
import spacy
# Load the model
nlp = spacy.load("en_core_web_sm")
# Example: Process a simple text
doc = nlp("SpaCy is a great NLP library.")
# Output the tokens
for token in doc:
print(token.text)
# Expected output:
# SpaCy
# is
# a
# great
# NLP
# library
# .
This setup gives you access to pipelines for tokenization, tagging, and more. Check the SpaCy installation guide for details on other languages or larger models.
Key point: SpaCy models are pre-trained and lightweight, perfect for running on standard hardware without GPU needs.
Tokenizing Text with Precision
Tokenization breaks text into words or subwords. SpaCy handles punctuation, contractions, and special cases better than simple string splits.
Consider this example with complex text:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Dr. Smith visited Washington, D.C. on Jan. 1st, 2023. He said, 'It's amazing!'"
doc = nlp(text)
# Print tokens with their start and end positions
for token in doc:
print(f"{token.text} ({token.idx}-{token.idx + len(token)})")
# Expected output:
# Dr. (0-3)
# Smith (4-9)
# visited (10-17)
# Washington (18-28)
# , (28-29)
# D.C. (30-34)
# on (35-37)
# Jan. (38-42)
# 1st (43-46)
# , (46-47)
# 2023 (48-52)
# . (52-53)
# He (54-56)
# said (57-61)
# , (61-62)
# ' (63-64)
# It (64-66)
# 's (66-68)
# amazing (69-76)
# ! (76-77)
# ' (77-78)
SpaCy preserves abbreviations like "Dr." and "D.C." as single tokens. This accuracy helps in downstream tasks.
Key point: Customize tokenization rules if your domain has unique patterns, like chemical formulas.
Tagging Parts of Speech Accurately
Part-of-speech (POS) tagging assigns labels like noun or verb to each token. SpaCy uses statistical models for this, providing both coarse and fine-grained tags.
Here's how to extract POS info:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog.")
# Print token, POS, and detailed tag
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")
# Expected output:
# The: DET (DT)
# quick: ADJ (JJ)
# brown: ADJ (JJ)
# fox: NOUN (NN)
# jumps: VERB (VBZ)
# over: ADP (IN)
# the: DET (DT)
# lazy: ADJ (JJ)
# dog: NOUN (NN)
# .: PUNCT (.)
Use this for filtering nouns in search engines or analyzing sentence complexity.
Key point: POS tags follow Universal Dependencies standards, making them consistent across languages.
For more on POS, see the SpaCy POS documentation.
Spotting Named Entities in Text
Named Entity Recognition (NER) identifies entities like people, organizations, and dates. SpaCy excels here with pre-trained models.
Try this on news-like text:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. was founded by Steve Jobs in California on April 1, 1976."
doc = nlp(text)
# Print entities with labels
for ent in doc.ents:
print(f"{ent.text}: {ent.label_} ({ent.start_char}-{ent.end_char})")
# Expected output:
# Apple Inc.: ORG (0-10)
# Steve Jobs: PERSON (26-36)
# California: GPE (40-50)
# April 1, 1976: DATE (54-67)
This is useful for extracting key facts from documents or building knowledge graphs.
Key point: Train custom NER models if default ones miss domain-specific entities, like medical terms.
Parsing Dependencies for Structure
Dependency parsing shows relationships between words, like subject-verb links. It creates a tree structure for sentences.
Visualize it with this code:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The cat chased the mouse around the house.")
# Print dependencies
for token in doc:
print(f"{token.text} --> {token.dep_} --> {token.head.text}")
# Expected output (sample, order may vary):
# The --> det --> cat
# cat --> nsubj --> chased
# chased --> ROOT --> chased
# the --> det --> mouse
# mouse --> dobj --> chased
# around --> prep --> chased
# the --> det --> house
# house --> pobj --> around
# . --> punct --> chased
# To visualize (run in Jupyter or save to file)
# displacy.serve(doc, style="dep")
# Expected: Opens a browser with dependency tree diagram
Use parsing for question answering or sentiment analysis on specific phrases.
Key point: Dependencies help detect complex structures, like passive voice.
Explore SpaCy's dependency parser.
Matching Patterns with Rules
SpaCy's Matcher lets you find patterns using rules, great for custom extractions without ML.
Example for phone numbers:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Define pattern for US phone numbers
pattern = [{"SHAPE": "ddd"}, {"ORTH": "-"}, {"SHAPE": "ddd"}, {"ORTH": "-"}, {"SHAPE": "dddd"}]
matcher.add("PHONE_NUMBER", [pattern])
text = "Call me at 123-456-7890 or 987-654-3210."
doc = nlp(text)
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end].text)
# Expected output:
# 123-456-7890
# 987-654-3210
This is faster than regex for linguistic patterns.
Key point: Combine with POS for advanced rules, like adjective-noun pairs.
Classifying Text Categories
SpaCy supports text classification via TextCategorizer. Add it to your pipeline for sentiment or topic labeling.
Build a simple classifier:
import spacy
from spacy.training import Example
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat")
# Add labels
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")
# Train with examples (simplified; in practice, use more data)
train_examples = [
Example.from_dict(nlp.make_doc("I love this product!"), {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
Example.from_dict(nlp.make_doc("This is terrible."), {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}})
]
# Initialize and train (basic loop)
optimizer = nlp.initialize()
for i in range(10): # More iterations in real training
for example in train_examples:
nlp.update([example], sgd=optimizer)
# Test
doc = nlp("This movie was awesome.")
print(doc.cats)
# Expected output (after training):
# {'POSITIVE': 0.8, 'NEGATIVE': 0.2} # Approximate values
Fine-tune for spam detection or review analysis.
Key point: Integrate with scikit-learn for hybrid approaches.
Customizing Pipelines for Specific Needs
Tailor SpaCy's pipeline by adding or removing components. This keeps it lean for production.
Example: Create a custom pipeline for NER only.
import spacy
# Load base model
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
# Process text with only NER
doc = nlp("Microsoft acquired GitHub in 2018.")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# Expected output:
# Microsoft: ORG
# GitHub: ORG
# 2018: DATE
Add your own components, like a custom tokenizer for code snippets.
Key point: Use config files to version pipelines.
For advanced customization, refer to SpaCy's pipeline docs.
SpaCy offers a solid foundation for NLP projects where control and speed matter. Experiment with these features in your apps, and combine them for powerful workflows like entity linking or relation extraction. If you're building real-world tools, start small with the basics and scale up with custom training. This library keeps evolving, so check updates for new capabilities.
This content originally appeared on DEV Community and was authored by Shrijith Venkatramana

Shrijith Venkatramana | Sciencx (2025-09-16T11:28:31+00:00) Beyond LLMs: Awesome NLP Things One can Do With SpaCy. Retrieved from https://www.scien.cx/2025/09/16/beyond-llms-awesome-nlp-things-one-can-do-with-spacy/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.