Here is a small post on what comes after the natural language processing pipeline (e.g. from using spacy or nltk for example). For example, linking the words to their actual meaning using a sense database (such as WordNet) or extracting basic knowledge from the sentence structure.
While this may seem useless in 2024, one full year after ChatGPT came out, I still think it's an interesting topic worth exploring.
The point of natural language processing is to either extract information from text or generate text from information.
Text for a computer is simply a chain of characters. In order to extract meaning of this text, we have to process it. There are 2 possibilities in python:
The pipeline for NLTK will look something like this:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
text = "Hey! I'm going home."
# 1. tokenize the sentences ==> ['Hey!', "I'm going home."]
sentences = nltk.sent_tokenize(text)
# 2. tokenize the words ==> [['Hey', '!'], ['I', "'m", 'going', 'home', '.']]
sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
# 3. remove stopwords (e.g. 'the' and the punctuation) & punctuation ==> [['Hey'], ['I', 'going', 'home']]
stopwords = set(stopwords.words('english'))
for i, sentence in enumerate(sentences):
sentences[i] = [word for word in sentence if word not in stopwords and word.isalnum()]
# 4. lemmatize the words ==> [['Hey'], ['I', 'going', 'home']]
lemmatizer = WordNetLemmatizer()
for i, sentence in enumerate(sentences):
sentences[i] = [lemmatizer.lemmatize(word) for word in sentence]
# 5. part of speech tagging
for i, sentence in enumerate(sentences):
sentences[i] = nltk.pos_tag(sentence)
# 6. parse the text - forget about it, you'll have to set up a context free grammar
While in Spacy, it's as short as this, while being way more complete and faster (the pipeline has already been pre-built):
import spacy
nlp = spacy.load('en_core_web_sm')
text = "Hey! I'm going home."
doc = nlp(text)
While early LLMs were funny to play with (try playing with GPT-2), they were not very effective. However, the development of LLMs in recent years has opened many doors:
They are rather good at dealing with language and can be used to overcome many NLP problems such as:
This will be a tremendous help when trying to process text in the future. However, one must remember that they are, at their core, only predicting a sequence of tokens. Therefore, they have strong limitations (for reasoning for example) to overcome and should be seen as another tool to integrate in the NLP pipeline.
WordNet is a dictionary which stores the senses of words and the relationships between them.
In a regular dictionary, entries are based on orthograph. When you look at a particular entry, you'll see the different possible meanings below it. See the example below (extract from Oxford Languages).
order
noun
1. the arrangement or disposition of people or things in relation to each other according to a particular sequence, pattern, or method
2. an authoritative command or instruction.
verb
1. give an authoritative instruction to do something
2. request (something) to be made, supplied, or served
This means that in a regular dictionary, you'll find as many entries as there are words. However, this means that some entries will share the same meaning. This is not the case for WordNet where 1 entry = 1 sense. When you look at a particular entry, you'll see the definition and the words corresponding to it, called lemmas.
orderliness.n.02
- definition = a condition of regular or proper arrangement
- lemmas = orderliness, order
command.n.01
- definition = an authoritative direction or instruction to do something
- lemmas = command, bid, bidding, dictation
order.v.01
- definition = give instructions to or direct somebody to do something with authority
- lemmas = order, tell, enjoin, say
order.v.02
- definition = make a request for something
- lemmas = order
The most interesting thing with WordNet is that the relations between senses have also been stored, and especially the parent & child senses:
You can explore the tree of senses in wordnet here.
Explore the WordNet database
As you may have noticed, the ID of the sense has 3 components separated by dots, for command.n.01:
There are 5 types of words in the WordNet database:
The two different types of adjectives can be explained as such:
Another interesting thing is that roots have appeared in the graph. For example, all common nouns have the indirect parent entity (proper nouns do not have parents/children senses). This is, however, not the case for the other POS (part of speech), there are 559 root verbs out of a total of 13767 verbs for example.
entity.n.01 = perceived/known/inferred to have its own distinct existence (living or nonliving)
+- abstraction.n.06 = a general concept formed by extracting common features from specific examples
+- physical_entity.n.01 = an entity that has physical existence
+- thing.n.08 = an entity that is not named specifically
Unfortunately, figuring the sense of a word from a sentence is quite difficult. But there are multiple methods to do this.
Firstly, here with the Lesk algorithm, which uses a basic statistical technique by counting the number of similar words in the sentence and previous anotated examples:
word = "bank"
sentence = "I deposited my money in the bank"
sense = nltk.wsd.lesk(sentence, word)
print(sense.definition())
# a container (usually with a slot in the top) for keeping money at home -- this is wrong
Secondly, here using the algorithm of Babelfy, which uses "semantic signature" (i.e. a more advanced Lesk algorithm):
Babelfy wrongly assigns the word sense "Mars" as the month instead of the planet
Thirdly, here is the use of a machine learning algorithm which works using zero-shot or few-shots machine learning tools:
# pip install classy-classification
import spacy
data = {
"bank.n.01": [
"sloping land (especially the slope beside a body of water)",
"they pulled the canoe up on the bank",
"he sat on the bank of the river and watched the currents",
],
"depository_financial_institution.n.01": [
"a financial institution that accepts deposits and channels the money into lending activities",
"he cashed a check at the bank",
"that bank holds the mortgage on my home",
],
"bank.n.03": [
"a long ridge or pile",
"a huge bank of earth",
],
"bank.n.04": [
"an arrangement of similar objects in a row or in tiers",
"he operated a bank of switches",
],
"bank.n.05": [
"a supply or stock held in reserve for future use (especially in emergencies)",
],
"bank.n.06": [
"the funds held by a gambling house or the dealer in some gambling games",
"he tried to break the bank at Monte Carlo",
],
"bank.n.07": [
"a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force",
],
"savings_bank.n.02": [
"a container (usually with a slot in the top) for keeping money at home",
"the coin bank was empty",
],
"bank.n.09": [
"a building in which the business of banking transacted",
"the bank is on the corner of Nassau and Witherspoon",
],
"bank.n.10": [
"a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)",
"the plane went into a steep bank",
],
}
# see github repo for examples on sentence-transformers and Huggingface
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("classy_classification", config={ "data": data, "model": "spacy" })
print(list(reversed(sorted(nlp("I deposited my money in the bank")._.cats.items(), key=lambda x:x[1]))))
# 1. ('bank.n.01', 0.23873914196413948) -- incorrect
# 2. ('depository_financial_institution.n.01', 0.16675489906573987) -- correct
# 3. ('bank.n.07', 0.08833535400531817) -- incorrect
print('français', wordnet.synset('test.n.05').lemma_names('fra'))
print('italien', wordnet.synset('test.n.05').lemma_names('ita'))
print('espagnol', wordnet.synset('test.n.05').lemma_names('spa'))
print('japonais', wordnet.synset('test.n.05').lemma_names('jpn'))
print('arabe', wordnet.synset('test.n.05').lemma_names('arb'))
print('grec', wordnet.synset('test.n.05').lemma_names('ell'))
print('finois', wordnet.synset('test.n.05').lemma_names('fin'))
# français ['essai', 'test', 'épreuve']
# italien ['cimento', 'esperimento', 'test']
# espagnol ['ensayo', 'prueba']
# japonais ['テスト', 'テスト作業', 'トライアル', 'トライヤル', '試', '試し', '試行', '試験', '験し']
# arabe ['تجْرِبة', 'فحْص']
# grec ['δοκιμή', 'τεστ']
# finois ['koe', 'testi']
Here is the list of "universal POS tags" (list from universaldependencies.org), these are the POS tags used by Spacy but not by NLTK.
Dependency tree and part of speech tags
Above, there is a dependency tree made using spacy. It clearly shows the relationships between words. List of possible dependencies (from clearnlp-guidelines):
NEG (negation modifier)
NPMOD (noun phrase as adverbial modifier)
Knowledge graphs are composed of 2 elements:
concept
or named entity
for example) but they most often do notPartOf
or UsedFor
)Examples of knowledge graphs include:
I've been influenced by the talk on domain driven design (DDD) of Scott Wlaschin: https://fsharpforfunandprofit.com/ddd. The talk is about a programming paradigm where you try to design your model based on how the real world is working. This means there shouldn't be a language barrier when a developper talks about how the program works with a non-developper who has deep knowledge of the company's business.
The idea here is to create an object oriented model to be able to ask more complex questions easily without repeating oneself. For example, once the model learns about a new person, it knows it can ask questions such as "What is <person>'s job?", then ask whether this is a good place to work, etc.
class Person:
name: str = None
gender: str = None
age: int = None
personality: str = None
location: str = None
job: Job = None
family: Family = None
friends: list['Friend'] = []
class Family:
father: 'Person'
mother: 'Person'
brothers: list['Person']
sisters: list['Person']
daughters: list['Person']
sons: list['Person']
family_structure: FamilyStructure
class Company:
name: str
size: TeamSize
location: str
class Job:
job_name: str
colleagues: list['Person']
company: Company
class Hobby:
name: str
location: str
frequency: str
cohobbyists: list[Person]
general_questions: list[str] = [
"Let's change focus a bit... Tell me about yourself.",
"Let's change focus a bit... Tell me about your family.",
"Let's change focus a bit... Tell me about your childhood.",
"Let's change focus a bit... Tell me about your parents.",
"Let's change focus a bit... Tell me about your partner.",
"Let's change focus a bit... Tell me about your job.",
"Let's change focus a bit... Tell me about your friends.",
]
user_conditional_questions: list[tuple[Callable, str]] = [
( (lambda user: user.age is None), 'How old are you?' ),
( (lambda user: user.name is None), 'What\'s your name?' ),
( (lambda user: user.gender is None), 'How do you identify in terms of gender? (e.g., male, female)?' ),
( (lambda user: user.location is None), 'Where do you live?' ),
( (lambda user: user.job is None), 'What\'s your job?' ),
]