Spacy


Installation

Install spacy and its models via, (there is a gpu version as well)


pip install spacy
python -m spacy download en_core_web_{sm,md,lg} # small, medium or large
python -m spacy validate
    

Pipeline components


import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Larry Page founded Google')    
for token in doc:
    print(token.text) # Larry\nPage\n...
    

Other attirbutes,

  1. Index: token.i
  2. Alphabet or not: token.is_alpha
  3. Punctuation or not: token.is_punct
  4. Number or not:token.like_num
  5. Part of Speech Tagging: token.pos (id) token.pos_ (tag name)

Named Entity Recognition (NER)

token.ent_type_


from spacy import displacy
displacy.render(doc, style="ent")
    

Dependency Parsing

token.dep_


doc = nlp("This is a sentence")
displacy.render(doc, style="dep")
    

Rule-based matching


from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
matcher.add('CITIES', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
     # Get the matched span by slicing the Doc
     span = doc[start:end]
     print(span.text)
    

Phrase Matching


from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)
    

Dependency Matcher


from spacy.matcher import DependencyMatcher

# "[subject] ... initially founded"
pattern = [
  # anchor token: founded
  {
    "RIGHT_ID": "founded",
    "RIGHT_ATTRS": {"ORTH": "founded"}
  },
  # founded -> subject
  {
    "LEFT_ID": "founded",
    "REL_OP": ">",
    "RIGHT_ID": "subject",
    "RIGHT_ATTRS": {"DEP": "nsubj"}
  },
  # "founded" follows "initially"
  {
    "LEFT_ID": "founded",
    "REL_OP": ";",
    "RIGHT_ID": "initially",
    "RIGHT_ATTRS": {"ORTH": "initially"}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
matches = matcher(doc)
    

Word vectors and similarity

Only available in medium or large models. doc1.similarity(doc2)

Help function

spacy.explain('NNP')

References


  1. YouTube -- Advanced NLP with spaCy - A free online course
  2. spaCy - Industrial-strength Natural Language Processing in Python
  3. How to Train spaCy to Autodetect New Entities (NER) [Complete Guide]
  4. Install spaCy
  5. spaCy Usage Documentation- Rule-based matching