Bag of Words Example
Table of contents
Bag of Words bag of words is one of the most basic ways to represent a word, it is simply a word counter. let us look at an example
1 :) John likes to watch movies. Mary likes movies too.
2 :) John also likes to watch football games.
if we count the occurrence of each word
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};
Ok let us see how to implement it in python
# the dataset we use called 20 newsgroup it consist of 20 group with short text
# scikit have builtin way to load the dataset just like iris and etc.
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), shuffle=True, random_state=42)
from pprint import pprint
# here we will only work with data section because bag of word don't need labels
pprint(list(newsgroups_train))
the first thing bag-of-word do is consist of a corpus of all existing word and make a one-hot vector for each of them.
Now each we have a very large vector with one 1 in each row, we represent each sentence as a vector with for existing word it will show 1 in a corresponding bit and 0 not present.
from itertools import islice
def take(n, iterable):
"Return first n items of the iterable as a list"
return list(islice(iterable, n))
from sklearn.feature_extraction.text import CountVectorizer
corpus = newsgroups_train.data
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).todense()
print( take(10, vectorizer.vocabulary_.items()) )
# to demonstrate let us consider this corpus
# we have 24-word unique word with will be of vector size for each word and every sentence is the summation of its unique binary words
corpus = [
'All my cats in a row',
'When my cat sits down, she looks like a Furby toy!',
'The cats from outer space',
'Sunshine loves to sit like this for some reason.'
]
vectorizer = CountVectorizer()
print( vectorizer.fit_transform(corpus).todense() )
print( vectorizer.vocabulary_ )
Preprocssing
in our above code model will consider cat and cats diffrent word and have may useless word in our bow
Stopwords
Stopwords are a collection of common words like 'a','is','the' which would not contribute too much on the meaning of sentence
Stemming
stemming is process of remove prefix and suffix form words, doing this make study and studing a single word,bear in mind that using lemmatization will lead to better result, but it's harder to impelement on your own.
# we use stopwords list from nltk
import nltk
# download stop words if you do not have it with this command {nltk.download('stopwords')}
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
def process(input_text):
# Create a regular expression tokenizer
tokenizer = RegexpTokenizer(r'\w+')
# Create a Snowball stemmer
stemmer = SnowballStemmer('english')
# Get the list of stop words
stop_words = nltk.corpus.stopwords.words('english')
# Tokenize the input string
tokens = tokenizer.tokenize(input_text.lower())
# Remove the stop words
tokens = [x for x in tokens if not x in stop_words]
# Perform stemming on the tokenized words
tokens_stemmed = [stemmer.stem(x) for x in tokens]
return ' '.join(tokens_stemmed)
for index,item in enumerate(newsgroups_train.data):
newsgroups_train.data[index] = process(item)
len(newsgroups_train.data)
import numpy as np
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
print( X_train.shape )
The downsides of Bag of words
-
Vectors assign to each word is meaningless, and for there is a pattern between two semantically similar words and their corresponding vectors.
-
Words are not normalized if we want to a summarized document in our bow model we should show words with the highest score which will be many common words that probably repeated in other documents too.
-
sparseness each word vector is a super long vector with only 1 bit and all other zeros which cause memory inefficiency
nima moradi 21/7/2019 My page twitter