11-07-2024, 06:12 PM
Practical Text Analytics using spaCy v3.0
Published 10/2024
Duration: 2h | .MP4 1280x720, 30 fps® | AAC, 44100 Hz, 2ch | 1.15 GB
Genre: eLearning | Language: English
How to extract information WITHOUT building custom Machine Learning models
What you'll learn
Understand the spaCy document object
How spaCy pipelines work
How to use Rule based Matching for Information Extraction
A system for practical, iterative Text Analytics using the itables library
Requirements
Intermediate Knowledge of Python programming
Basic knowledge of the pandas dataframe library
Description
What is text analytics?
I like this definition: "Text analytics is the process of transforming unstructured text documents into usable, structured data. Text analysis works by breaking apart sentences and phrases into their components, and then evaluating each part's role and meaning using
complex software rules
and
machine learning algorithms
." [Source: Lexalytics website]
In spaCy, you can use machine learning algorithms in two ways
1) pretrained models provided by spaCy and other organizations - for example the en_core_web_md, which I use in this course, is a
pretrained
model provided by Explosion, the company which created spaCy
2) custom machine learning models that you train on your data - which is often referred to in the documentation as "statistical models"
Why not statistical models?
This is what the makers of spaCy say in their documentation:
"For complex tasks, it's usually better to train a statistical entity recognition model. However, statistical models require training data, so for many situations, rule-based approaches are more practical. This is especially true at the start of a project: you can use a rule-based approach as part of a data collection process, to help you "bootstrap" a statistical model.
Training a model is useful if you have some examples and you want your system to be able to
generalize
based on those examples. It works especially well if there are clues in the
local context
. For instance, if you're trying to detect person or company names, your application may benefit from a statistical named entity recognition model.
Rule-based systems are a good choice if there's a more or less
finite number
of examples that you want to find in the data, or if there's a very
clear, structured pattern
you can express with token rules or regular expressions. For instance, country names, IP addresses or URLs are things you might be able to handle well with a purely rule-based approach."
Just to clarify, I am not against developing statistical models - but as the documentation states quite clearly, it is often more practical to start with rules based systems. One of my main aims in this course is to provide a solid understanding of what you can and cannot do using just a rules based system -
in fact I use only one dataset in this entire course
so it is a lot easier for the students to make this distinction.
When you combine a rules based system with the data visualization technique I describe in this course, you will also gain a very good understanding of your dataset. You can then use this understanding to improve your statistical model if you choose to build one.
In my view, most people barely scratch the surface when it comes to using spaCy rules for text analytics. I hope this course will provide them a lot of new insight into how to approach this task.
Who this course is for:
Data Science practitioners who want to use spaCy and Natural Language Processing
Anyone who has a spreadsheet where one of the columns is a paragraph of text and wants to know how to extract useful information from that text to use with the filters you can apply on the OTHER columns (sort, less than, greater than etc) in spreadsheet tools like Excel and Airtable
[To see links please register or login]