A simple explanation about Tokenization.

Mohamed Bakrey Mahmoud
7 min readFeb 16, 2023

--

What is the Tokenization?

It is a process that divides the sentence into a number of words or parts, and each case is called a token, or in other words, it separates each word separately to deal with it, reach its meaning, and know the type of that word, and so on.

It is a list that also works to separate words from sentences so that each word is alone, and there are two types of it:

Word tokenizer: This means separating words from each other.
Sentence tokenizer: separating the completion of each sentence separately.

The idea behind the word tokenizer here is that it is close to the split, but with some differences. The split command separates based on the distances or symbols through which the decoding will take place. As for the token, it does this based on the true meaning of the word. It also displays differences around words attached to each other, such as I`m or also I`d or question marks, and so on.

Advantages of the tokenizer:

Separate leftover words from each other
Joining words related to each other, such as country names contain a long name and are divided into two or more words. Like Los Angeles, here we see that there is a distance between the two words, but here it is known that it is one word for a specific country.

Difficulties faced by the tokenizer:

There are some difficulties in many languages such as German and also French and then Chinese.

In the picture in front of us, we find that the French language pronouns cause a major obstacle in converting and understanding the sentence, and ordinary people make mistakes in it.

As for the German language, you will find the greatest difficulty, which is the difficulty of reading the word, as that matter is very difficult, and it is the length of the German word.

As for the Chinese and Japanese, you will find that there are no spaces between the words and each other

And here the max-match: tool came to solve this part, and it works on starting from the beginning and determining the maximum length that leads to a certain meaning, and if it reaches that meaning, it stops and starts from another letter, As in the picture below.

But that also faced big problems and was not successful enough, especially in the English language.

As we can see, it is carried out in its own stages, but it extracted a meaning other than the required meaning, compared to the matching that it made.

And the matter is done in degrees or stages, so if we have a sentence that is:
We`re moving to L.A. !”
It is divided according to spaces, then according to brackets or quotation marks, then an apostrophe, then punctuation marks, and so on.

Implementation.

Here first install Library Spacy

!pip install -U spacy

Import Library

import spacy
nlp = spacy.load('en_core_web_sm')

First Example

doc = nlp('Tesla is looking at buying U.S. startup for $69 million')
for token in doc:
print(token.text)
# print(token.shape)
print(token.shape_)
print(token.is_alpha)
# print(token.is_stop)
print('---------------')

Output

Tesla
Xxxxx
True
---------------
is
xx
True
---------------
looking
xxxx
True
---------------
at
xx
True
---------------
buying
xxxx
True
---------------
U.S.
X.X.
False
---------------
startup
xxxx
True
---------------
for
xxx
True
---------------
$
$
False
---------------
69
dd
False
---------------
million
xxxx
True
---------------

Explanation
Here, the token tool calculates and separates each word separately, and also checks the number of letters for the word, and then it is clear whether this is from the letters of the alphabet or from the numbers.

doc[0] , doc[1] , doc[2] , doc[3] , doc[4] , doc[5] , doc[6] , doc[7] , Out

Output:

(Tesla, is, looking, at, buying, U.S., startup, for)

Using the code above, it calls each word separately

doc2 = nlp('''
Although commmonly attributed to John Lennon from his song "Beautiful Boy",
the phrase "Life is what happens to us while we are making other plans" was written by cartoonist Allen Saunders and
published in Reader\'s Digest in 1957, when Lennon was 17.
''')
life_quote = doc2[19:31]
print(life_quote)

Output

Life is what happens to us while we are making other plans

Here a part of the above text was shown and the part to be displayed was extracted from the required text.

mystring = '"We\'re moving to L.A.!"'
print(mystring)

output

"We're moving to L.A.!"
doc4 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for token in doc4:
print(token)

Output

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!
EXAMPLE_TEXT = '''
Thomas Gradgrind, sir. A man of realities. A man of facts and calculations. A man who proceeds upon the principle that two and two are four, and nothing over, and who is not to be talked into allowing for anything over. Thomas Gradgrind, sir—peremptorily Thomas—Thomas Gradgrind. With a rule and a pair of scales, and the multiplication table always in his pocket, sir, ready to weigh and measure any parcel of human nature, and tell you exactly what it comes to. It is a mere question of figures, a case of simple arithmetic. You might hope to get some other nonsensical belief into the head of George Gradgrind, or Augustus Gradgrind, or John Gradgrind, or Joseph Gradgrind (all supposititious, non-existent persons), but into the head of Thomas Gradgrind—no, sir!
In such terms Mr. Gradgrind always mentally introduced himself, whether to his private circle of acquaintance, or to the public in general. In such terms, no doubt, substituting the words ‘boys and girls,’ for ‘sir,’ Thomas Gradgrind now presented Thomas Gradgrind to the little pitchers before him, who were to be filled so full of facts.
Indeed, as he eagerly sparkled at them from the cellarage before mentioned, he seemed a kind of cannon loaded to the muzzle with facts, and prepared to blow them clean out of the regions of childhood at one discharge. He seemed a galvanizing apparatus, too, charged with a grim mechanical substitute for the tender young imaginations that were to be stormed away.
‘Girl number twenty,’ said Mr. Gradgrind, squarely pointing with his square forefinger, ‘I don’t know that girl. Who is that girl?’
'''
print(word_tokenize(EXAMPLE_TEXT))

Output

['Thomas', 'Gradgrind', ',', 'sir', '.', 'A', 'man', 'of', 'realities',
'.', 'A', 'man', 'of', 'facts', 'and', 'calculations', '.', 'A', 'man',
'who', 'proceeds', 'upon', 'the', 'principle', 'that', 'two', 'and', 'two',
'are', 'four', ',', 'and', 'nothing', 'over', ',', 'and', 'who', 'is', 'not',
'to', 'be', 'talked', 'into', 'allowing', 'for', 'anything', 'over', '.',
'Thomas', 'Gradgrind', ',', 'sir—peremptorily', 'Thomas—Thomas', 'Gradgrind',
'.', 'With', 'a', 'rule', 'and', 'a', 'pair', 'of', 'scales', ',', 'and',
'the', 'multiplication', 'table', 'always', 'in', 'his', 'pocket', ',',
'sir', ',', 'ready', 'to', 'weigh', 'and', 'measure', 'any', 'parcel',
'of', 'human', 'nature', ',', 'and', 'tell', 'you', 'exactly', 'what',
'it', 'comes', 'to', '.', 'It', 'is', 'a', 'mere', 'question', 'of',
'figures', ',', 'a', 'case', 'of', 'simple', 'arithmetic', '.', 'You',
'might', 'hope', 'to', 'get', 'some', 'other', 'nonsensical', 'belief',
'into', 'the', 'head', 'of', 'George', 'Gradgrind', ',', 'or', 'Augustus',
'Gradgrind', ',', 'or', 'John', 'Gradgrind', ',', 'or', 'Joseph',
'Gradgrind', '(', 'all', 'supposititious', ',', 'non-existent', 'persons', ')', ',', 'but', 'into', 'the', 'head', 'of', 'Thomas', 'Gradgrind—no', ',', 'sir', '!', 'In', 'such', 'terms', 'Mr.', 'Gradgrind', 'always', 'mentally', 'introduced', 'himself', ',', 'whether',
'to', 'his', 'private', 'circle', 'of', 'acquaintance', ',', 'or', 'to', 'the', 'public', 'in', 'general', '.',
'In', 'such', 'terms', ',', 'no', 'doubt', ',', 'substituting', 'the', 'words', '‘', 'boys', 'and', 'girls', ',', '’', 'for', '‘', 'sir', ',', '’', 'Thomas', 'Gradgrind',
'now', 'presented', 'Thomas', 'Gradgrind',
'to', 'the', 'little', 'pitchers', 'before', 'him', ',', 'who', 'were', 'to', 'be', 'filled', 'so', 'full', 'of', 'facts', '.', 'Indeed', ',', 'as', 'he', 'eagerly',
'sparkled', 'at', 'them', 'from', 'the', 'cellarage', 'before', 'mentioned', ',', 'he', 'seemed', 'a', 'kind', 'of', 'cannon', 'loaded', 'to', 'the', 'muzzle', 'with', 'facts', ',', 'and', 'prepared', 'to', 'blow', 'them', 'clean', 'out', 'of', 'the',
'regions', 'of', 'childhood', 'at', 'one', 'discharge', '.', 'He', 'seemed',
'a', 'galvanizing', 'apparatus', ',', 'too', ',', 'charged', 'with', 'a', 'grim', 'mechanical', 'substitute', 'for', 'the', 'tender', 'young', 'imaginations', 'that', 'were', 'to', 'be', 'stormed', 'away', '.', '‘', 'Girl', 'number', 'twenty', ',', '’', 'said', 'Mr.', 'Gradgrind', ',', 'squarely', 'pointing', 'with', 'his', 'square', 'forefinger', ',', '‘', 'I', 'don', '’', 't', 'know',
'that', 'girl', '.', 'Who', 'is', 'that', 'girl', '?', '’']

Summary

Here, as we have seen, the tokenizer tool is a tool that is capable of separating all the words of the English language and is also able to differentiate between words and nouns, and so on. Therefore, it is one of the most important tools that can be used in these types of work, which is searching in texts, as well as dividing and sorting those words.

--

--

Mohamed Bakrey Mahmoud
Mohamed Bakrey Mahmoud

No responses yet