Phase 2 in Weka
Let’s now move to Weka for classification, but let us first get the unigrams from R.
Create a new script from File -> New Script, and save it as Unigrams.
Copy paste the code section into Unigrams.R It is similar to Bigrams.R.
Code Block: Unigrams.R
#create a vector of all the packages you want to load
packs <- c("tm","tau","qdap","RWeka", "wordcloud")
lapply(packs, require, character.only = TRUE)
#load the dataset as a matrix
inputText = read.csv(file = "pathToFile/AirlineTweets.csv", header = TRUE, sep = ",")
input=as.matrix(inputText)
# remove retweet entities
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", inputText[,1])
# remove at people e.g. @ArabWic
some_txt = gsub("@\\w+", "", some_txt)
# remove punctuation
some_txt = gsub("[[:punct:]]", "", some_txt)
# remove numbers
some_txt = gsub("[[:digit:]]", "", some_txt)
# remove html links
some_txt = gsub("http\\w+", "", some_txt)
# remove unnecessary spaces
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)
#convert some_txt into a corpus and remove stopwords and punctuations
corp=Corpus(VectorSource(some_txt))
as.VCorpus(corp)
corp=tm_map(corp, removeWords, stopwords('english'))
corp <- tm_map(corp, removePunctuation)
#create the unigrams and Term document matrix and a wordcloud
UnigramTokenizer <- function(y) NGramTokenizer(y, Weka_control(min = 1, max = 1))
tdm <- TermDocumentMatrix(corp, control = list(tokenize = UnigramTokenizer))
m= inspect(tdm)
library(wordcloud)
set.seed(1234)
v = sort(rowSums(m), decreasing = TRUE)
#you can change the frequency of the terms to a number of your choice.
wordcloud(names(v), v, min.freq = 20)
#prepare to write the csv file
DF <- as.data.frame(m, stringsAsFactors = FALSE)
nrow(DF)
DF=as.matrix(DF)
tdf=as.data.frame(t(DF))
tdf=cbind(tdf,input[,2])
len=ncol(tdf)
header=c(1:(len-1))
header=c(header,"Class")
colnames(tdf)=header
write.csv(tdf, file = "pathToFile/Unigram.csv",row.names=FALSE)
Make sure to change the path to the file.
Say bye to R, and hello to Weka! :)
FUN: Worried you have Alzheimer’s? Take this test to find out.
> alzheimer_test(char1 = c("9", "O", "M", "I", "F","D"),
char2 = c("6", "C", "N", "T", "E", "O"), nr = 5, nc = 10, seed = NULL)
Applying Filters
Let us start by opening Weka! You will get the following screen.
Click on the explorer tab.
You will see the following screen.
Click on the _“Open File” _button. This will open a small window.
Browse to the folder where the Unigram.csv file is located.
Select CSV as the option in the_ “Files Of Type” _dropdown list box (shown below).
Wait for the file to load. Once loaded you will see all of the attributes under the _“Attributes” _area (shown below).
Click on attribute 1. You can see the minimum/maximum value along with other details under “Selected Attribute”. You can see that the type is “numeric”.
Let us now convert the class variable to Nominal. Click on Choose -> filters -> unsupervised -> attributes (scroll down) -> StringToNominal (shown below).
We can stick with the default settings, as the filter is applied to the last column. You can now select the last attribute from the list of attributes. The type of it will be“nominal”.
Classification in Weka
You can run various classifiers from Weka.
To start, click on “Classify” tab.
Choose any classifier you want from the list that opens, when you click on the “Choose” button. Leave it at the default for the first time.
You can choose between Cross-validation and Percentage split. Leave it at the default for the first time.
Click on “Start” to run the classifier.
This will run the ZeroR classifier by default. Why not choose other classifiers?
You will find bayes->NaiveBayes, function -> SVM, lazy -> IbK (KNN) and a whole lot of other classifiers, when you click on the choose button. We will use the J48 tree classifier in the next steps.
We have been using the cross-validation techniques for so long. Let us move to % split next.
Let us now split our data, 90% into the train set and 10% into the test set.
Let us try this technique.
Click on Percentage Split and set it to 90%.
Click on More Options button (just below it).
In the pop-up window, check the option Preserve Order for % split.
Choose the J48 classifier fromChoose -> trees -> J48.
Startthe classifier.
Congratulations! Your model just achieved an accuracy of ~85%!!
Visualize the Tree
Click on J48 under the Result List column and right-click it.
Click on Visualize tree.
In the pop-up window that appears, you can see an extremely cluttered tree.
Right-click anywhere on the window and select Auto-Scale, Fit-to-Screen OR Center on top node.
You can now view the tree (similar to the one below).