# Basic Text Manipulation and NLTK 


The nltk python package provides a lot of useful features when working with language. It provides corpora, tokenizers, parsers, and other useful functions. For Details see: [https://www.nltk.org/](https://www.nltk.org/)

This notebook goes through some basics of nltk in combination with string manipulation.



`nltk` provides useful corpora, we can use as examples in this notebook. You can download data from nltk bei using `nltk.download()`. For this notebook only the `book` package is necessary. It provides the gutenberg corpus.

In [None]:
import nltk

In [None]:
nltk.download()

A corpus in the nltk package provides textual resources or annotations. 

The nltk function `nltk.corpus.gutenberg.fileids()` returns all the available fileid's of the book corpus we downloaded. These are used to retrieve the texts:


In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
text = nltk.corpus.gutenberg.raw('carroll-alice.txt')
text[:2000]

It is also possible to load pre-tokenized text:

In [None]:
text = nltk.corpus.gutenberg.words('carroll-alice.txt')
text

Or sentences:

In [None]:
text = nltk.corpus.gutenberg.sents('carroll-alice.txt')
text

## Text manipulation

NLTK provides a lot of useful functions for basic text manipulation, like tokenizers or ngram generators. 

#### ngrams
There are functions for ngram generation: `nltk.bigram`, `nltk.trigram` or more general `nltk.ngram`

In [None]:
print("Bigrams:", list(nltk.bigrams(["Where", "is", "my", "money", "man", "?"])))

In [None]:
print("Trigrams:", list(nltk.trigrams(["Where", "is", "my", "money", "man", "?"])))

The more general version of ngrams can generate higher orders

In [None]:
list(nltk.ngrams(["Where", "is", "my", "money", "man", "?"], n=5))

The function works on python lists level. That means actually you can get any ordered subsequence from any list:

In [None]:
list(nltk.ngrams([["first", "element"], ["second", "element"], ["third", "element"]], n=2))

or even

In [None]:
list(nltk.ngrams("Where is my money man ?", n=3))[:5]

#### tokenizers
Two other very useful functions to work with text are the tokenizers. 
Most importantly:
`nltk.word_tokenizer` and `nltk.sent_tokenizer` to split a string of text into words or sentences respectively:

In [None]:
example = "This happened long before she met the Queen of Hearts. The "\
"rabbit-hole went straight on like a tunnel for some way, "\
"and then dipped suddenly down, so suddenly that Alice had not a moment "\
" to think about stopping herself before she found herself falling down "\
" a very deep well. Alice's heart skipped a beat."
        
print("Text: \n", example.split(" "))
print("\n")
print("Words: \n", nltk.word_tokenize(example))
print("\n")
print("Sentences: \n", nltk.sent_tokenize(example))

## Task 1: Cleaning texts 

Typically not all symbols are relevant for processing, although this is highly task specific.
Usually unwanted or not needed symbols comprise of:
 - escaped characters (\n\t\r)
 - repeated white spaces ("      " -> " ")
 - Sometimes punctuation

Therefore the first step in language processing often is to look at the text and clean the string to match the expectation in a way.


For example, let's remove all characters from the string that are not contained in the alphabet or useful punctuation like `"!?.,; -"`.

Useful helper for this is the `string` module which contains precompiled lists like:

In [None]:
import string
print("Punctuation: ", string.punctuation)
print("Digits: ", string.digits)
print("letters: ", string.ascii_letters)
print("letters: ", string.ascii_lowercase)
print("Punctuation: ", string.ascii_uppercase)

In [None]:
text = nltk.corpus.gutenberg.raw('carroll-alice.txt') # RELOAD RAW TEXT
import re  # regular expression operations

def clean(s :str) -> str:
    valid = string.ascii_letters + "!?.,; -"
    s = "".join([c if c in valid else " " for c in s])
    s = re.sub("\s+", " ", s)
    return s

print("Before: \n", [text[:1000]])
clean_text = clean(text)
print("\n\n After:\n",clean_text[:1000])

## Task 2: Counting

On the lowest linguistic level, there are characters and symbols. These are the most basic units of a language. 
However, while many languages share their alphabets, statistical use can differ significantly.

To establish a statistical view on text, we usually start by counting occurrences.
NLTK has a useful class for this:
`nltk.FreqDist(text1)`

In [None]:
f = nltk.FreqDist(clean_text) # try with a word list as input, too!
f

In [None]:
f.most_common(5)

In [None]:
import matplotlib.pyplot as plt
f.plot()

## Task 3 - Word Level 

The next level are tokens or types. 

To count the tokens in a text, we first need to split the text into tokens. We use `nltk.tokenize.word_tokenize`
to for example calculate the average word length in a text:

In [None]:
def average_word_length(text: str) -> float:
    tokens = nltk.tokenize.word_tokenize(text.lower())
    lengths = [len(x) for x in tokens] 
    return sum(lengths)/len(lengths)

In [None]:
average_word_length(text)

## Task 4 - Sentence level 

The next level of composition is sentence level. We can use `nltk.tokenize.sent_tokenize` to split strings into sentences. Write a function to calculate the average number of tokens per sentence:

In [None]:
def average_token_per_sentence(text: str) -> float:
    sentences =  nltk.tokenize.sent_tokenize(text)
    lengths = [len(nltk.tokenize.word_tokenize(sentence)) for sentence in sentences]
    return sum(lengths) / len(lengths)

In [None]:
average_token_per_sentence(text)

## Task 5 - Part Of Speech


Part-Of-Speech Tagging assignes a grammatical function for each word in a sentence.

In [None]:
import nltk
tag = nltk.pos_tag(nltk.word_tokenize(text)) # try parameter tagset="universal"
tag[:10]

The set of tags is taken from the Penntree POS Tags set and can be found here:

In [None]:
from IPython.display import IFrame    
IFrame("https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html", width=800, height=650)

In [None]:
IFrame("https://universaldependencies.org/u/pos/index.html", width=800, height=650)

## Task 6: String Similarity

NLTK provides various functions for similarity measures.
NLTK provides an implementation of the edit distance in `nltk.edit_distance`.

In [None]:
print("Mouse, House", nltk.edit_distance("Mouse", "House"))
print("Alice", "Alice's", nltk.edit_distance("Alice", "Alice's"))
print("Hausbedarfsladenkette, Mausbedarfsladenkette", nltk.edit_distance("Hausbedarfsladenkette", "Mausbedarfsladenkette"))
print("NLP is a lot of fun!", "NLP is a lot of work!", nltk.edit_distance("NLP is a lot of fun!", "NLP is a lot of work!"))
print("work", "workhorse", nltk.edit_distance("work", "workhorse"))

Another similarity measure for strings mentioned in the lecture was the DICE-measure.

A good exercise to get familiar with the basics of python/nltk is to implement a function that implements a DICE measure yourself.

Try to compare `("House", "Mouse")` and `("Alice", "Alice's")` using the dice-coefficient for trigrams and edit distance. What happens?

In [None]:
def dice_score(w1: str, w2: str, n: int) -> float:
    n1 = set(nltk.ngrams(w1, n))
    n2 = set(nltk.ngrams(w2, n))
    dice = 2 * len(n1.intersection(n2)) / (len(n1) + len(n2))
    return dice

In [None]:
n = 3
print("Mouse, House", dice_score("Mouse", "House", n))
print("Aliceiceiceiceice", "Aliciceicee's", dice_score("Alice", "Alice's",n))
print("Hausbedarfsladenkette, Mausbedarfsladenkette", dice_score("Hausbedarfsladenkette", "Mausbedarfsladenkette",n))
print("NLP is a lot of fun!", "NLP is a lot of work!", dice_score("NLP is a lot of fun!", "NLP is a lot of work!", n))
print("work", "workhorse", dice_score("work", "workhorse", 3))