# Comparing Corpora

The lecture talked about domain-specific languages or sub-languages. 
This manifests in various aspects of language like syntax, vocabulary or phrases.

This exercise tries to confirm this by comparing subcorpora in terms of frequencies.

In [None]:
import nltk
import string

Download the brown corpus.

_"The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial"_

_--_ ( https://www.nltk.org/book/ch02.html ) 


To compare the vocabulary and phrases  of each corpus with each other, count ngram frequencies for every category. NLTK provides already pre-tokenized and cleaned text for this. Use `nltk.corpus.brown.words(category=...)` to load the corpus.

The approach is to count frequencies in each corpus and compare the top k of each corpus with each other to spot differences. This allows a rather explorative view on the concept of sublanguage.

In [None]:
nltk.download("brown")

In [None]:
nltk.corpus.brown.categories()

Use for example the categories:

In [None]:
genres = ["news", "religion", "fiction", "humor"]

You can use the word function to retrieve text for a category by:

In [None]:
list(nltk.corpus.brown.words(categories="news"))

# Exercise

Compare the corpora in terms of 

    - Type Token Ratio 
    - Vocabulary (most frequent words or ngrams)
    - Most frequent syntactic structures (use nltk.pos_tag to generate POS-Tags and then look at the most frequent POS-tag ngrams) 
    

Which problems arise if we only count frequencies and look at the most frequent ngrams? How can this be alleviated (hint: use the list `nltk.corpus.stopwords("en")`)?

In [None]:
texts = {genre: list(nltk.corpus.brown.words(categories=genre)) for genre in genres}

In [None]:
for genre, text in texts.items():
    print(f"{genre}: {text[:50]}\n")

#### Type Token Ratio

In [None]:
def ttr(text):
    tokens = [x.lower() for x in text]
    types = set(tokens)
    return 100* len(types) / len(tokens)

In [None]:
def print_ttrs(texts):
    ttrs = [(k, ttr(v)) for k, v in texts.items()]
    print(ttrs)

In [None]:
print_ttrs(texts)

In [None]:
# check corpora sizes
for genre in texts:
    print(f"{genre}: {len(texts[genre])}")

#### Standardised Type Token Ratio

In [None]:
def text_chunks(text, chunk_size):
    for i in range(0, len(text), chunk_size):
        yield text[i:i+chunk_size]

def sttr(text, chunk_size):
    tokens = [x.lower() for x in text]
    ttr = 0.0
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    for chunk in chunks:
        types = set(chunk)
        ttr += len(types)/len(chunk)
    return ttr / len(chunks)

In [None]:
def print_sttr(texts, n):
    sttrs = [(k, sttr(v, n)) for k,v in texts.items()]
    return sttrs 

In [None]:
print_sttr(texts, 1000)

#### N-Gram Frequencies

In [None]:
def preprocess(text, n):
    text = [x.lower() for x in text]
    ngrams = [x for x in nltk.ngrams(text, n)]

    stopwords = set(nltk.corpus.stopwords.words("english")) | set(string.punctuation) | {"``", '"', "''", "--"}
    ngrams = [x for x in ngrams if all(w not in stopwords for w in x)]

    return ngrams

In [None]:
#!pip install pandas
import pandas as pd

In [None]:
def print_freqs(texts, n, topk=100):
    word_freqs = { k:list(nltk.FreqDist(preprocess(v, n)))[:topk] for k, v in texts.items()}
    return pd.DataFrame.from_dict(word_freqs)

In [None]:
print_freqs(texts, 1)

In [None]:
print_freqs(texts, 2)

In [None]:
print_freqs(texts, 3)

In [None]:
print_freqs(texts, 4)

#### POS Patterns

In [None]:
def print_pos_pattern_frequencies(texts, n, topk=100):
    pos_tags = {k:[x[1] for x in nltk.pos_tag(v)] for k, v in texts.items()}
    pos_tags_ngrams = {k: nltk.ngrams(v, n) for k, v in pos_tags.items()}
    frequencies = {k:nltk.FreqDist(v) for k,v in pos_tags_ngrams.items()}
    compare = {k: list(v)[:topk] for k, v in frequencies.items()}
    return pd.DataFrame.from_dict(compare)

In [None]:
print_pos_pattern_frequencies(texts, n=2, topk=10)

In [None]:
print_pos_pattern_frequencies(texts, n=3, topk=10)