|
- We use the Flesch-Kincaid Grade Level metric for testing the written complexity.
- We count the number of Phonemes present as a measure of spoken complexity.
Approach for Definition Extraction/Generation
The definitions consist of two parts, Definiendum and Definiens. The definiendum is the element that is to be defined. The definiens provides the meaning to definiendum. In a properly written simple text piece, definiendum and definiens are often found to be connected by a verb or punctuation mark. We approach the task of definition extraction/generation for a given glossary word under a given context as a 3-step pipeline(Rule-based Mining -> WordNet-based Selection -> GPT-2 based generation) with an exit option at each step. Let's discuss each of them in detail -
Rule-based Mining
In rule-based mining, we define certain grammatical constructs for extracting definition structures from the text for a given keyword. Some of the patterns we form are, for example — X is defined as Y, X is a Y, etc. Here, X is the glossary word or definiendum, and Y is expected to be the meaning or definiens. We use regular expression patterns for implementing this step.
WordNet-based Selection
The next step in the pipeline is to use a WordNet-based selection strategy for extracting relevant definitions for a given glossary word. The below figure illustrates the entire process in detail -
WordNet-based Definition Extraction
We start with finding the set of k senses (a sense is a.k.a meaning) for every glossary word found in the previous step using the WordNet library. We also extract the first context from the text where this glossary word occurs as a potential definition. The context is defined by setting a window of K words around the glossary word. The hypothesis here for selecting only the first context (marked in violet color) is that the author of the book/text is likely to define or explain the word as early as possible in the literature and then re-use it later as and when required. We understand that this hypothesis holds mostly in longer text pieces like books, novels, etc - which is reflective of our dataset.
For each of the unique senses from the set of k senses for a given glossary word, we extract the definition, related example and do a cosine similarity with the first context text. This helps in disambiguation and helps choose the most appropriate sense/meaning/definiens for a given word. As a part of the design implementation, one can either choose to select the top sense as per the similarity score or might want to not select anything at all and fall back to the 3rd step in the definition extraction/generation pipeline.
In the below image we present Definitions (column new_def) based on a WordNet selection scheme.
Definitions extracted as per WordNet selection methodCode for Extracting Definitions from WordNet
Bring this project to life
We start by fetching the first occurring context of the glossary word from the text.
import codecs
import os
import pandas as pd
import glob
import nltk
nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
def get_context(c):
try:
result = text.concordance_list(c)[0]
left_of_query = ' '.join(result.left)
query = result.query
right_of_query = ' '.join(result.right)
return left_of_query + ' ' + query + ' ' + right_of_query
except:
return ''
generated_dfs = []
BASE_DIR = 'data'
for book in glob.glob('book/*'):
book_name = book.split('/')[-1].split('.')[0]
try:
DATA_DIR = codecs.open('book/' + book_name + '.txt', \ 'rb', \ encoding='utf-8').readlines()
true_data = pd.read_csv( 'data/'+book_name+'.csv', \ sep='\t')
full_data = ' '.join([i.lower().strip() for i in DATA_DIR if len(i.strip())>1])
tokens = nltk.word_tokenize(full_data)
text = nltk.Text(tokens)
true_data['firstcontext'] = true_data['word'].map(lambda k: get_context(k))
generated_dfs.append(true_data)
except Exception as e:
pass
final_df = pd.concat(generated_dfs[:], axis=0)
final_df = final_df[final_df['firstcontext']!='']
final_df = final_df[['word', 'def', 'firstcontext']].reset_index()
final_df.head(5)
Getting 1st context for Glossary words extracted from the previous step (chunking+relevance pipeline)
The below image shows the output data frame from the above snippet -
DataFrame with First Context from Text
Next, we load word vectors using gensim KeyedVectors. We also define sentence representation as the average of vectors of the words present in the sentence.
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors
import KeyedVectors
import numpy as np
from gensim.models import KeyedVectors
filepath = "GoogleNews-vectors-negative300.bin"
wv_from_bin = KeyedVectors.load_word2vec_format(filepath, binary=True)
ei = {}
for word, vector in zip(wv_from_bin.index_to_key, wv_from_bin.vectors): coefs = np.asarray(vector, dtype='float32') ei[word] = coefs
def avg_feature_vector(sentence, model, num_features):
words = sentence.split()
#feature vector is initialized as an empty array
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in embeddings_index.keys():
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
Sentence representation with Word2Vec
Next, we concatenate the definition of each sense and examples present in the WordNet library and calculate the semantic relatedness between this and the first context of the glossary word from the text. Lastly, we pick the one that has maximum similarity as the candidate definition.
def similarity(s1, s2):
s1_afv = avg_feature_vector(s1, model=ei, num_features=300)
s2_afv = avg_feature_vector(s2, model=ei, num_features=300)
cos = distance.cosine(s1_afv, s2_afv)
return cos
for idx in range(final_df.shape[0]):
fs = final_df.iloc[idx]['firstcontext']
w = final_df.iloc[idx]['word']
defi = final_df.iloc[idx]['def']
syns = wordnet.synsets(w)
s_dic={}
for sense in syns:
def,ex = sense.definition(), sense.examples()
sense_def = def + ' '.join(ex)
score = similarity(sense_def, fs)
s_dic[def]=score
s_sort = sorted(s_dic.items(), key=lambda k:k[1],reverse=True)[0]
final_df['new_def'][idx]=s_sort[0]
final_df['match'][idx]=s_sort[1]
GPT-2 based Generation
This is the final step in our definition extraction/generation pipeline. Here, we fine-tune a medium-sized, pre-trained GPT-2 model on an openly available definitions dataset from the Urban Dictionary. We pick phrases and their related definitions from the 2.5 million data samples present in the dataset. For fine-tuning we format our data records with special tokens that help our GPT-2 model to act as a conditional language generation model based on specific prefix text.
How to load Kaggle data into a Gradient Notebook
Here, <|startoftext|> and <|endoftext|> are special tokens for the start and stop of the text sequence, and <DEFINE> is the prompt token, telling the model to start generating definition for the existing word.
We start by first loading the urban dictionary dataset of words and definitions, as shown below -
import pandas as pd
train = pd.read_csv( 'urbandict-word-defs.csv', \ nrows=100000, \ error_bad_lines=False )
new_train = train[['word', 'definition']] new_train['word'] = new_train.word.str.lower()
new_train['definition'] = new_train.definition.str.lower()
Loading subset of UrbanDictionary dataset
Next, we select the appropriate device and load the relevant GPT-2 tokenizer and model -
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
import os
from tqdm import tqdm
import logging
logging.getLogger().setLevel(logging.CRITICAL)
import warnings
warnings.filterwarnings('ignore')
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')
Next, we define the dataset class for appropriately formatting each input example. Since we are using an autoregressive model for generating text conditioned on prefix text, we define a trigger token <DEFINE> separating word and associated definition. We also add the start and end text tokens with each input example to make the model aware of starting and ending hints. We will also create a data loader from the dataset with a batch size of 4 and set shuffling to be true, making our model robust to any hidden patterns that might exist in the original dataset.
from torch.utils.data import Dataset, DataLoader
import os
import json
import csv
class GlossaryDataset(Dataset):
def __init__(self, dataframe):
super().__init__()
self.data_list = []
self.end_of_text_token = "<|endoftext|>"
self.start_of_text_token = "<|startoftext|>"
for i in range(dataframe.shape[0]):
data_str = f"{self.start_of_text_token} \ {new_train.iloc[i]['word']} \ <DEFINE> \ {new_train.iloc[i]['definition']} \ {self.end_of_text_token}"
self.data_list.append(data_str)
def __len__(self):
return len(self.data_list)
def __getitem__(self, item):
return self.data_list[item]
dataset = GlossaryDataset(dataframe=new_train)
data_loader = DataLoader(dataset, batch_size=4, shuffle=True)
Next, we define our optimizer, scheduler, and other parameters.
from transformers import AdamW
EPOCHS = 10
LEARNING_RATE = 2e-5
device = 'cpu'
if torch.cuda.is_available():
device = 'cuda'
model = model.to(device)
model.train()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
Finally, we write our training loop for performing forward and backward passes, and execute training.
for epoch in range(EPOCHS):
print (f'Running {epoch} epoch')
for idx,sample in enumerate(data_loader):
sample_tsr = torch.tensor(tokenizer.encode(sample[0]))
sample_tsr = sample_tsr.unsqueeze(0).to(device)
outputs = model(sample_tsr, labels=sample_tsr)
loss = outputs[0]
loss.backward()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.zero_grad()
Training loop for GPT-2 Definition generation
In the below image, we present some of the results from our GPT-2 based definition generation on inference.
Definition generated from our fine-tuned GPT-2 language model
Concluding thoughts
In this blog, we discuss a seed approach for extracting glossaries and related definitions using natural language processing techniques mostly in an unsupervised fashion. And we strongly believe that this can lay the foundations for building more sophisticated and robust pipelines. Also, we think that the evaluation of such tasks on some objective scale such as Precision, Recall is not entirely justified and should be evaluated by humans at the end of the day. Because the author of the book also considers the audience, demography, culture, etc he is targeting while coming up with the glossary list.
I hope you enjoyed reading this article. Thank you!
첫댓글 https://blog.paperspace.com/adaptive-testing-and-debugging-of-nlp-models-research-paper-explained/