Ultimate Guide to Mastering Named Entity Recognition in NLP

Learn Named Entity Recognition In NLP Using Spacy

Overview

Named entity recognition in NLP is used to detect and categorize essential information from unstructured text. We can call the essential information as named entities, including person names, organization names, locations, medical codes, time, quantities, monetary values, etc. Named entity recognition (NER)  plays a vital role in various industries by automating information retrieval of unstructured text data. For example, Apple’s Siri and Amazon’s Alexa use NER to comprehend and reply to user commands accurately and efficiently. In this article, we will explore named entity recognition in NLP  with the help of practical applications using Python.

Essential Libraries For Named Entity Recognition In NLP

We can use various open-source libraries for named entity recognition in NLP. Let us briefly discuss them one by one:

  • Standford NER: Standford NER is a Java-based library for named entity recognition. It was developed by the NLP research group of Standford University. It offers pre-trained models in various languages for NER.
  • AllenNLP: AllenNLP is a popular Python-based library for named entity recognition in NLP. However, the AllenNLP may require more manual configuration and setup than Spacy.
  • NLTK: NLTK is a Python-based library that provides tools for various NLP tasks, such as named entity recognition (NER). It uses rule-based methods depending on regular expressions and provides a predefined set of named entities for NER tasks.
  • Spacy: Spacy is a popular Python-based library for named entity recognition (NER). The NER component of Spacy uses a transition-based neural network to identify and classify entities. We can use pre-trained models provided by Spacy or train an NER model with the custom dataset for NER tasks.

The libraries discussed above use different approaches for named entity recognition in NLP. Among them, Spacy is the most popular for named entity recognition in NLP, as it is easy to use and offers production-grade performance.

Why use Named Entity Recognition (NER)?

There are diverse applications of named entity recognition in various sectors, such as finance, customer support, resume evaluation, space exploration, education, public health, cyber securities, and environmental science. For example, HR departments may use NER to evaluate resumes by extracting important information for the candidates, such as skills and experience. In cybersecurity, NER may be used to identify and classify cybersecurity entities, thus improving threat detection. Therefore, name entity recognition in NLP is valuable for extracting essential information from unstructured data.

Named Entity Recognition using the Pertained model of Spacy

Spacy is an open-source library with several pre-trained models for natural language processing. We can use pre-trained Spacy models for named entity recognition. The table below shows various named entities for the pre-trained model of Spacy.

Entity TypeDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc.
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LAWNamed documents made into laws.
LANGUAGEAny named language.
DATEAbsolute or relative dates or periods.
TIMETimes smaller than a day.
PERCENTPercentage, including "%"
MONEYMonetary values, including unit.
QUANTITYMeasurements, as of weight or distance.
ORDINAL"first", "second", etc.
CARDINALNumerals that do not fall under another type.

There are various types of pre-trained models: 

  • en_core_web_sm: The en_core_web_sm is a small model that includes vocabulary, syntax, and entities. This model is suitable when you have memory constraints. We can use the model as a starting point for the NER task.
  • en_core_web_md: The en_core_web_md is a medium-sized model that includes vocabulary, syntax, entities, and word vectors. The model is suitable for a wide range of NLP tasks and is more accurate than en_core_web_sm. 
  • en_core_web_lg: The en_core_web_lg is suitable when the priority is accuracy. The model includes vocabulary, syntax, entities, and word vectors, making it ideal for demanding NLP tasks.

NLP pipeline in SpaCy

The SpaCy pipeline is a robust framework for named entity recognition in NLP.  It is a sequence of components or pipes for processing and transforming data. A typical NLP pipeline of Spacy includes several built-in components, such as a tokenizer, tagger, lemmatizer, parser, and entity recognizer. We can speed up NER while turning off unnecessary components such as tagger and parser.

NLP pipeline in SpaCy

How to implement the NER model in Python

The section below describes implementing the NER model in Python using SpaCy. You can also access the code in the link.

Install Spacy

Let us install SpaCy using pip.

				
					!pip install spacy
				
			

Let us download the  medium sized English model for SpaCy using the following command.

				
					!python -m spacy download en_core_web_md
				
			

Import Spacy And Load Language Model

				
					import spacy
nlp = spacy.load("en_core_web_md")
				
			

Process Text Using Language Model

Here, we will process a sample text using language model (en_core_web_md), which tokenizes the text first and then identifies various elements from the text such as named entities, part-of-speech tags, and dependencies.

				
					
text = "Apple is looking at buying U.K. startup for $1 billion. The Indian Space Research Organisation, headquartered in Bengaluru, is the national space agency of India. It operates under the Department of Space, which is directly overseen by the Prime Minister of India. The company XYZ is known for its innovative technology in the field of artificial intelligence. John Doe, an expert in machine learning, joined ABC Corporation last month. The United Nations is an international organization founded in 1945. It is currently made up of 193 Member States."
doc = nlp(text)
				
			

Extract Named Entities

The code below iterates through the named entities identified in the previous steps and prints each entity along with its entity label.

				
					for ent in doc.ents:
     print(ent.text, ent.label_)
				
			
				
					Output:
[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY'), ('The Indian Space Research Organisation', 'ORG'), ('Bengaluru', 'GPE'), ('India', 'GPE'), ('the Department of Space', 'ORG'), ('India', 'GPE'), ('XYZ', 'ORG'), ('John Doe', 'PERSON'), ('ABC Corporation', 'ORG'), ('last month', 'DATE'), ('The United Nations', 'ORG'), ('1945', 'DATE'), ('193', 'CARDINAL')]
				
			

Visualize Named Entities Using displaCy

The named entities can also be visualized using a visualization tool called displacy. With the help of displacy, we can visualize the named entities in a more interactive manner, thus making it easier to understand the result.

				
					from spacy import displacy
doc = nlp(text)
spacy.displacy.render(doc, style="ent", jupyter=True)
				
			
Visualize Named Entity Recognition In NLP Using DisplaCy

Store The Named Entities In Tabular Format

We can further store the data in a tabular format using Pandas’ data frame using the code below.

				
					import pandas as pd
entities = [(ent.text,  ent.label_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=["Entity", "Label"])
				
			
				
					Output:
	Entity	Label
0	Apple	ORG
1	U.K.	GPE
2	$1 billion	MONEY
3	The Indian Space Research Organisation	ORG
4	Bengaluru	GPE
5	India	GPE
6	the Department of Space	ORG
7	India	GPE
8	XYZ	ORG
9	John Doe	PERSON
10	ABC Corporation	ORG
11	last month	DATE
12	The United Nations	ORG
13	1945	DATE
14	193	CARDINAL
				
			

Visualize The Dependency Parse

We can further visualize the dependency parse of the text by using the displacy.render() function with the dependency parameter.

				
					doc = nlp(str(text))
displacy.render(doc, style='dep',jupyter=True)
				
			
Visualize Dependency Parse Using DisplaCy

Limitations of the Pre-trained NER Model

  • The pre-trained models may perform poorly on domain-specific text data (such as financial or medical data) as they are trained on general text data.
  • It is not possible to customize pre-trained models to recognize new entities or to improve the performance of recognizing existing entities.
  • The accuracy of the pre-trained model may not be sufficient for many real-world cases, which may lead to incorrect entity recognition.

 

Conclusion

In this article, we have learned how to apply a pre-trained Spacy model for identifying and recognizing named entities from sample text data. We found that the named entity recognition component of Spacy can successfully identify and classify entities, such as people, organizations, locations, and dates from the text data. 

However, the pre-trained spacy model has several limitations, such as limited accuracy for specific use cases, limited domain-specific knowledge, and lack of customization options for identifying new entities. However, we can develop custom models for named entity recognition in NLP that may ensure better accuracy and domain-specific relevance.

Frequently Asked Questions

References

3 thoughts on “Ultimate Guide to Mastering Named Entity Recognition in NLP”

Leave a Comment

Your email address will not be published. Required fields are marked *