# Morphology

Morphology is the study of the structure and formation of words.

## Base Form Reduction

In order to make meaningful statistics across inflected word forms in NLP, it is necessary to apply base form reduction. 

### Example 
- Sentence 1: Bernd liest ein Buch.
- Sentence 2: Gestern las Bernd eine Zeitung.
- Sentence 3: Lesen ist ein Hobby von Bernd.

Sentence-Word-Matrix: 

|    | Bernd | liest | ein | Buch | Gestern | las | eine | Zeitung | Lesen | ist | Hobby | von |
|----|-------|-------|-----|------|---------|-----|------|---------|-------|-----|-------|-----|
| S1 | 1     | 1     | 1   | 1    | 0       | 0   | 0    | 0       | 0     | 0   | 0     | 0   |
| S2 | 1     | 0     | 0   | 0    | 1       | 1   | 1    | 1       | 0     | 0   | 0     | 0   |
| S3 | 1     | 0     | 1   | 0    | 0       | 0   | 0    | 0       | 1     | 1   | 1     | 1   |

most frequent occurrences with "Bernd":

| Word  | Occurrences with "Bernd" |
|-------|--------------------------|
| ein   | 2                        |
| eine  | 1                        |
| Hobby | 1                        |
| liest | 1                        |
| ...   | 1                        |


--> **Base Form Reduction** 

- Sentence 1: bernd lesen einen buch.
- Sentence 2: gestern lesen bernd einen zeitung.
- Sentence 3: lesen sein einen hobby von bernd.


|    | bernd | lesen | einen | buch | gestern | zeitung | sein | hobby | von |
|----|-------|-------|------ |------|---------|---------|------|-------|-----|
| S1 | 1     | 1     | 1     | 1    | 0       | 0       | 0    | 0     | 0   |
| S2 | 1     | 1     | 1     | 0    | 1       | 1       | 0    | 0     | 0   |
| S3 | 1     | 1     | 1     | 0    | 0       | 0       | 1    | 1     | 1   |


most frequent occurrences with "Bernd":

| Word   | Occurrences with "Bernd" |
|------- |--------------------------|
| lesen  | 3                        |
| einen  | 3                        |
| buch   | 1                        |
| ...    | ....                     |





### Question 1:
Why is it necessary to perform base form reduction?



### Question 2:
What errors can occur during base form reduction?



### Question 3:
What are the different approaches?




### Question 4:
What is the difference between stemming and lemmatization?



## Lemmatization and Stemming with nltk

NLTK also implements **lemmatization** and **stemming** techniques.

**Stemming** returns the word stem of any word. It is context insensitive, tokens with different meaning can be projected to the same word stem.


**Lemmatization** tries to return the lemma of any word. Grammatical context and function is important for that.

In [None]:
import nltk

print(nltk.stem.PorterStemmer().stem("wrote"))
print(nltk.stem.PorterStemmer().stem("writings"))
print(nltk.stem.PorterStemmer().stem("written"))
print(nltk.stem.PorterStemmer().stem("writes"))
print(nltk.stem.PorterStemmer().stem("write"))

In [None]:
nltk.download("wordnet")
nltk.download("omw-1.4")

print(nltk.stem.WordNetLemmatizer().lemmatize("wrote", pos="v"))
print(nltk.stem.WordNetLemmatizer().lemmatize("writings", pos="v"))
print(nltk.stem.WordNetLemmatizer().lemmatize("written", pos="v"))
print(nltk.stem.WordNetLemmatizer().lemmatize("writes", pos = "v"))
print(nltk.stem.WordNetLemmatizer().lemmatize("write", pos = "v"))

In [None]:
print(nltk.stem.WordNetLemmatizer().lemmatize("writing", pos="s")) # if the word is a noun
print(nltk.stem.WordNetLemmatizer().lemmatize("writing", pos="v")) # if the word is a verb

## Programming Task

We use Herman Meville's Moby Dick as a data base. We can download this from the Gutenberg Text Collection via the *nltk* package.

- As a baseline for base form reduction, use the first 4 letters of each word. 
- Use Porter Stemmer as an example for stemming.
- Use the **Spacy** library as an example for lemmatization.
- Calculate the number of base forms to number of type ratio and compare the 3 approaches.

In [None]:
import nltk
from nltk.stem import PorterStemmer

#!conda install -y spacy
#!python -m spacy download en_core_web_sm
import spacy

#nltk.download()
text = nltk.corpus.gutenberg.raw("melville-moby_dick.txt")
tokenized_text = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
sentences = nltk.corpus.gutenberg.sents("melville-moby_dick.txt")

In [None]:
nlp = spacy.load("en_core_web_sm") # choose spacy model to use (needs to be downloaded first, see cell above)
doc = nlp(" ".join(sentences[80]))

# spacy Doc example
for t in doc:
    print(t.text, t.lemma_, t.pos_, t.tag_, t.is_stop)

### Baseline 4-Grams

In [None]:
baseform_four_grams = [word[0:4] if len(word)>4 else word for word in tokenized_text]

### Porter Stemmer

In [None]:
ps = PorterStemmer()
baseform_porter_stemmer = [ps.stem(w) for w in tokenized_text]

### spaCy Lemmatization

In [None]:
baseform_spacy_lemmatization = [token.lemma_ for sen in sentences for token in nlp(" ".join(sen))] # this might take a bit longer

### Ratio of the number of different basic forms to different types

In [None]:
number_of_types = len(set(tokenized_text))

print("4 grams:")
print(len(set(baseform_four_grams))/number_of_types)

print("Porter Stemmer:")
print(len(set(baseform_porter_stemmer))/number_of_types)

print("spaCy Lemmatization:")
print(len(set(baseform_spacy_lemmatization))/number_of_types)