Introduction
Although data science projects often employ large amounts of numeric data, some projects examine patterns within text and require a different set of tools. In this code-through tutorial, we are going to explore several packages in R that enable researchers to analyze qualitative data sets and discover cool patterns. We are also going to create two word clouds based on movie plot summaries. In order to complete this tutorial, you will need access to R or RStudio. If you are not familiar with either of these software or would like a refresher on the basics, check out the first steps of my previous code-through here. I walk you through the entire set-up process beginning with the software. If you're all set to go with R and RStudio, then proceed below:
Set-Up
Library of Packages
Before we begin analyzing and our data and creating word clouds, first, we need to load a few packages into our 'library'. Specifically we need the following packages:
library(dplyr) # helps organize our data library(kableExtra) # creates elegant tables for output library(quanteda) # processes our textual data for anaylsis library(wordcloud2) # creates the wordclouds
You can easily install all of these packages with the following code:
install.packages("nameofpackage")
Data
Now that we have our packages installed, the next thing that we need is our dataset. For the purposes of this code-through we are going to be using a dataset of over 5000 movies from IMDB that we will access from data.world.
We will click the blue Explore this dataset
button and then scroll down to select the IMDBdata_MainData.csv
dataset.
Next, we will click the download button where we will select the download URL for R.
Data.World provides us with the exact code we need to get started and import our data. Therefore we will copy and paste into a codeblock as follows:
df <- read.csv("https://query.data.world/s/rr46ndg7fyne54q7oonmvzxbaxg3zn", header = TRUE, stringsAsFactors = FALSE)
Preview Dataset
A quick peek at the column names of the dataset reveal the various fields available for us to use to explore the 5,000+ movies in the set.
colnames(df)
[1] "Title" "Year" "Rated" "Released" [5] "Runtime" "Genre" "Director" "Writer" [9] "Actors" "Plot" "Language" "Country" [13] "Awards" "Poster" "Ratings.Source" "Ratings.Value" [17] "Metascore" "imdbRating" "imdbVotes" "imdbID" [21] "Type" "DVD" "BoxOffice" "Production" [25] "Website" "Response" "tomatoURL"
For ease of reference, we will change all of these titles to be lowercase with the following function:
colnames(df) <- tolower(colnames(df)) colnames(df)
[1] "title" "year" "rated" "released" [5] "runtime" "genre" "director" "writer" [9] "actors" "plot" "language" "country" [13] "awards" "poster" "ratings.source" "ratings.value" [17] "metascore" "imdbrating" "imdbvotes" "imdbid" [21] "type" "dvd" "boxoffice" "production" [25] "website" "response" "tomatourl"
For our analysis, we will focus specifically on the movie titles, genres, and plots, so we will create a smaller dataset with only these variables:
dat <- df[c("title", "genre", "plot")]
A preview of this dataset gives us a glimpse into what we're working with:
head(dat) %>% kable() %>% kable_styling()
title | genre | plot |
---|---|---|
Code Name: K.O.Z. | Crime, Mystery | A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government. |
Saving Christmas | Comedy, Family | Kirk is enjoying the annual Christmas party extravaganza thrown by his sister until he realizes he needs to help out Christian, his brother-in-law, who has a bad case of the bah-humbugs. ... |
Superbabies: Baby Geniuses 2 | Comedy, Family, Sci-Fi | A group of smart-talking toddlers find themselves at the center of a media mogul's experiment to crack the code to baby talk. The toddlers must race against time for the sake of babies everywhere. |
Daniel der Zauberer | Comedy, Crime, Fantasy | Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols. |
Manos: The Hands of Fate | Horror | A family gets lost on the road and stumbles upon a hidden, underground, devil-worshiping cult led by the fearsome Master and his servant Torgo. |
Pledge This! | Comedy | At South Beach University, a beautiful sorority president takes in a group of unconventional freshman girls seeking acceptance into her house. |
Now we will create an even smaller dataset of only romance movies using the grep()
function which allows us to search through text for specfic key words. Here, we will search for all movies that have romance listed as either its only genre or at least one of its genres:
grep(pattern = "romance", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>% kable() %>% kable_styling()
x |
---|
Horror, Romance, Thriller |
Comedy, Romance, Sport |
Comedy, Romance |
Comedy, Musical, Romance |
Drama, Romance, Thriller |
Drama, Music, Romance |
Using the grepl()
function we will find all 927 movies that belong to the romance genre:
romance <- grepl("romance", dat$genre, ignore.case = T) sum(romance)
[1] 927
We will use this criteria to segment out the romance movies:
dat.romance <- dat[romance, c("title", "genre", "plot")] dat.romance %>% head(10) %>% kable() %>% kable_styling()
title | genre | plot | |
---|---|---|---|
9 | Birdemic: Shock and Terror | Horror, Romance, Thriller | A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic? |
10 | Dream.net | Comedy, Romance, Sport | Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ... |
14 | The Hottie & the Nottie | Comedy, Romance | A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend. |
16 | From Justin to Kelly | Comedy, Musical, Romance | A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing. |
23 | Ben & Arthur | Drama, Romance, Thriller | A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church. |
32 | Glitter | Drama, Music, Romance | A young singer dates a disc jockey who helps her get into the music business, but their relationship become complicated as she ascends to super stardom. |
36 | Space Mutiny | Action, Adventure, Romance | A pilot is the only hope to stop the mutiny of a spacecraft by its security crew, who plot to sell the crew of the ship into slavery. |
51 | Gigli | Comedy, Crime, Romance | The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage. |
80 | A Story About Love | Romance | Two young people stand on a street corner in a run-down part of New York, kissing. Despite the lawlessness of the district they are left unmolested. A short distance away walk Maria and ... |
97 | The Bat People | Horror, Romance | After being bitten by a bat in a cave, a doctor undergoes an accelerating transformation into a man-bat, which ruins his vacation and causes considerable distress for his wife. |
Now that we have a collection of romance movie titles, we will focus on examining the plots of the movies to see if there are any similarities or common keywords across the summaries:
corp.romance <- corpus(dat.romance, docid_field = "title", text_field = "plot") corp.romance
Corpus consisting of 927 documents and 1 docvar.
corp.romance[1:5] %>% kable() %>% kable_styling()
x | |
---|---|
Birdemic: Shock and Terror | A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic? |
Dream.net | Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ... |
The Hottie & the Nottie | A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend. |
From Justin to Kelly | A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing. |
Ben & Arthur | A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church. |
# summarize corpus summary(corp.romance)[1:10, ] %>% kable() %>% kable_styling()
Text | Types | Tokens | Sentences | genre |
---|---|---|---|---|
Birdemic: Shock and Terror | 32 | 36 | 2 | Horror, Romance, Thriller |
Dream.net | 31 | 40 | 2 | Comedy, Romance, Sport |
The Hottie & the Nottie | 21 | 23 | 1 | Comedy, Romance |
From Justin to Kelly | 27 | 29 | 1 | Comedy, Musical, Romance |
Ben & Arthur | 30 | 32 | 1 | Drama, Romance, Thriller |
Glitter | 28 | 28 | 1 | Drama, Music, Romance |
Space Mutiny | 24 | 30 | 1 | Action, Adventure, Romance |
Gigli | 27 | 32 | 1 | Comedy, Crime, Romance |
A Story About Love | 32 | 39 | 3 | Romance |
The Bat People | 27 | 32 | 1 | Horror, Romance |
# Process text # remove mission statements that are less than 1 sentence long corp.romance <- corpus_trim(corp.romance, what = "sentences", min_ntoken = 1) corp.romance
Corpus consisting of 927 documents and 1 docvar.
# remove punctuation tokens.romance <- tokens(corp.romance, what = "word", remove_punct = TRUE) head(tokens.romance)
tokens from 6 documents. Birdemic: Shock and Terror : [1] "A" "horde" "of" "mutated" "birds" [6] "descends" "upon" "the" "quiet" "town" [11] "of" "Half" "Moon" "Bay" "California" [16] "With" "the" "death" "toll" "rising" [21] "Two" "citizens" "manage" "to" "fight" [26] "back" "but" "will" "they" "survive" [31] "Birdemic" Dream.net : [1] "Regina" "the" "once" "popular" "girl" [6] "has" "to" "make" "new" "friends" [11] "at" "her" "new" "conservative" "school" [16] "Problems" "arrive" "when" "she" "becomes" [21] "enemies" "with" "Lívia" "the" "school's" [26] "queen" "bee" "and" "falls" "in" [31] "love" "with" The Hottie & the Nottie : [1] "A" "woman" "agrees" "to" "go" [6] "on" "a" "date" "with" "a" [11] "man" "only" "if" "he" "finds" [16] "a" "suitor" "for" "her" "unattractive" [21] "best" "friend" From Justin to Kelly : [1] "A" "waitress" "from" "Texas" "and" [6] "a" "college" "student" "from" "Pennsylvania" [11] "meet" "during" "spring" "break" "in" [16] "Fort" "Lauderdale" "Florida" "and" "come" [21] "together" "through" "their" "shared" "love" [26] "of" "singing" Ben & Arthur : [1] "A" "pair" "of" "recently" "married" [6] "gay" "men" "are" "threatened" "by" [11] "one" "of" "the" "partners" "brother" [16] "a" "religious" "fanatic" "who" "plots" [21] "to" "murder" "them" "after" "being" [26] "ostracized" "by" "his" "church" Glitter : [1] "A" "young" "singer" "dates" "a" [6] "disc" "jockey" "who" "helps" "her" [11] "get" "into" "the" "music" "business" [16] "but" "their" "relationship" "become" "complicated" [21] "as" "she" "ascends" "to" "super" [26] "stardom"
# convert to lower case tokens.romance <- tokens_tolower(tokens.romance, keep_acronyms = TRUE) head(tokens.romance)
tokens from 6 documents. Birdemic: Shock and Terror : [1] "a" "horde" "of" "mutated" "birds" [6] "descends" "upon" "the" "quiet" "town" [11] "of" "half" "moon" "bay" "california" [16] "with" "the" "death" "toll" "rising" [21] "two" "citizens" "manage" "to" "fight" [26] "back" "but" "will" "they" "survive" [31] "birdemic" Dream.net : [1] "regina" "the" "once" "popular" "girl" [6] "has" "to" "make" "new" "friends" [11] "at" "her" "new" "conservative" "school" [16] "problems" "arrive" "when" "she" "becomes" [21] "enemies" "with" "lívia" "the" "school's" [26] "queen" "bee" "and" "falls" "in" [31] "love" "with" The Hottie & the Nottie : [1] "a" "woman" "agrees" "to" "go" [6] "on" "a" "date" "with" "a" [11] "man" "only" "if" "he" "finds" [16] "a" "suitor" "for" "her" "unattractive" [21] "best" "friend" From Justin to Kelly : [1] "a" "waitress" "from" "texas" "and" [6] "a" "college" "student" "from" "pennsylvania" [11] "meet" "during" "spring" "break" "in" [16] "fort" "lauderdale" "florida" "and" "come" [21] "together" "through" "their" "shared" "love" [26] "of" "singing" Ben & Arthur : [1] "a" "pair" "of" "recently" "married" [6] "gay" "men" "are" "threatened" "by" [11] "one" "of" "the" "partners" "brother" [16] "a" "religious" "fanatic" "who" "plots" [21] "to" "murder" "them" "after" "being" [26] "ostracized" "by" "his" "church" Glitter : [1] "a" "young" "singer" "dates" "a" [6] "disc" "jockey" "who" "helps" "her" [11] "get" "into" "the" "music" "business" [16] "but" "their" "relationship" "become" "complicated" [21] "as" "she" "ascends" "to" "super" [26] "stardom"
In order to make the data set more precise, we will remove all of the articles from the data set:
tokens.romance <- tokens_remove(tokens.romance, c(stopwords("english"), "nbsp"), padding = F) head(tokens.romance)
tokens from 6 documents. Birdemic: Shock and Terror : [1] "horde" "mutated" "birds" "descends" "upon" [6] "quiet" "town" "half" "moon" "bay" [11] "california" "death" "toll" "rising" "two" [16] "citizens" "manage" "fight" "back" "survive" [21] "birdemic" Dream.net : [1] "regina" "popular" "girl" "make" "new" [6] "friends" "new" "conservative" "school" "problems" [11] "arrive" "becomes" "enemies" "lívia" "school's" [16] "queen" "bee" "falls" "love" The Hottie & the Nottie : [1] "woman" "agrees" "go" "date" "man" [6] "finds" "suitor" "unattractive" "best" "friend" From Justin to Kelly : [1] "waitress" "texas" "college" "student" "pennsylvania" [6] "meet" "spring" "break" "fort" "lauderdale" [11] "florida" "come" "together" "shared" "love" [16] "singing" Ben & Arthur : [1] "pair" "recently" "married" "gay" "men" [6] "threatened" "one" "partners" "brother" "religious" [11] "fanatic" "plots" "murder" "ostracized" "church" Glitter : [1] "young" "singer" "dates" "disc" "jockey" [6] "helps" "get" "music" "business" "relationship" [11] "become" "complicated" "ascends" "super" "stardom"
In order to consolidate similar words, we will stem the data set:
# stem the words in the token list: tokens.romance <- tokens_wordstem(tokens.romance) head(tokens.romance)
tokens from 6 documents. Birdemic: Shock and Terror : [1] "hord" "mutat" "bird" "descend" "upon" [6] "quiet" "town" "half" "moon" "bay" [11] "california" "death" "toll" "rise" "two" [16] "citizen" "manag" "fight" "back" "surviv" [21] "birdem" Dream.net : [1] "regina" "popular" "girl" "make" "new" "friend" "new" [8] "conserv" "school" "problem" "arriv" "becom" "enemi" "lívia" [15] "school" "queen" "bee" "fall" "love" The Hottie & the Nottie : [1] "woman" "agre" "go" "date" "man" "find" [7] "suitor" "unattract" "best" "friend" From Justin to Kelly : [1] "waitress" "texa" "colleg" "student" "pennsylvania" [6] "meet" "spring" "break" "fort" "lauderdal" [11] "florida" "come" "togeth" "share" "love" [16] "sing" Ben & Arthur : [1] "pair" "recent" "marri" "gay" "men" "threaten" [7] "one" "partner" "brother" "religi" "fanat" "plot" [13] "murder" "ostrac" "church" Glitter : [1] "young" "singer" "date" "disc" "jockey" [6] "help" "get" "music" "busi" "relationship" [11] "becom" "complic" "ascend" "super" "stardom"
We can see which words commonly occur together:
# find frequently co-occuring words (typically compound words) ngram.romance <- tokens_ngrams(tokens.romance, n = 2) %>% dfm() ngram.romance %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()
feature | frequency | rank | docfreq | group |
---|---|---|---|---|
fall_love | 57 | 1 | 57 | all |
new_york | 37 | 2 | 36 | all |
young_man | 32 | 3 | 32 | all |
young_woman | 28 | 4 | 28 | all |
high_school | 28 | 4 | 27 | all |
best_friend | 27 | 6 | 26 | all |
world_war | 14 | 7 | 14 | all |
york_citi | 12 | 8 | 12 | all |
true_love | 10 | 9 | 10 | all |
war_ii | 9 | 10 | 9 | all |
ngram.romance3 <- tokens_ngrams(tokens.romance, n = 3) %>% dfm() ngram.romance3 %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()
feature | frequency | rank | docfreq | group |
---|---|---|---|---|
new_york_citi | 12 | 1 | 12 | all |
world_war_ii | 9 | 2 | 9 | all |
two_best_friend | 5 | 3 | 5 | all |
woman_fall_love | 4 | 4 | 4 | all |
fall_love_woman | 4 | 4 | 4 | all |
meet_fall_love | 3 | 6 | 3 | all |
drama_center_around | 3 | 6 | 3 | all |
high_school_crush | 3 | 6 | 3 | all |
young_woman_find | 3 | 6 | 3 | all |
experi_chang_live | 3 | 6 | 3 | all |
Finally we can see which words are most common in the romance movie plot summaries:
tokens.romance %>% dfm(stem = T) %>% topfeatures() %>% kable() %>% kable_styling()
x | |
---|---|
love | 182 |
young | 146 |
woman | 144 |
man | 138 |
life | 112 |
fall | 93 |
find | 93 |
friend | 91 |
two | 90 |
new | 88 |
Unsurprisingly(!!) love is the number one most frequently used word followed by young, woman, and man.
Word Clouds
Now that we’ve previewed our data step-by-step, we can create a wordcloud. We will repeat some of our previous steps in a more automated format using the tm()
package. If you do not have it installed, once again you can install it via install.packages("tm")
library(tm) # package for processing text
Romance Movies
For our first word cloud, we will create a new dataset creating only our plots from romance movies:
# Create a vector containing only the text romance.plot <- dat.romance$plot
Then we will create a new Corpus consisting solely of the plot:
# Create a corpus romance.docs <- Corpus(VectorSource(romance.plot)) romance.docs
<<SimpleCorpus>> Metadata: corpus specific: 1, document level (indexed): 0 Content: documents: 927
Afterwords, we will go through the same steps that we did above to clean our data set:
romance.docs <- romance.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>% tm_map(stripWhitespace) romance.docs <- tm_map(romance.docs, content_transformer(tolower)) romance.docs <- tm_map(romance.docs, removeWords, stopwords("english"))
And once again find our most frequently used words:
romance.dtm <- TermDocumentMatrix(romance.docs) romance.matrix <- as.matrix(romance.dtm) romance.words <- sort(rowSums(romance.matrix), decreasing = TRUE) romance.df <- data.frame(word = names(romance.words), freq = romance.words) romance.df %>% head(10) %>% kable() %>% kable_styling()
word | freq | |
---|---|---|
love | love | 174 |
young | young | 146 |
woman | woman | 133 |
man | man | 132 |
life | life | 110 |
two | two | 90 |
new | new | 88 |
friends | friends | 62 |
one | one | 62 |
family | family | 62 |
Finally, we will create a color pallete and voila! we’ve created a word cloud based on over 900 romance movie plots:
pallete <- mutate(romance.df, color = cut(freq, breaks = c(0, 40, 80, 120, 160, Inf), labels = c("#FFCAE5", "#BF1168", "#EA72C4", "#FF167F", "A40000"), include.lowest = TRUE)) romance.word.cloud <- wordcloud2(data = pallete, color = pallete$color) romance.word.cloud
If we would like to save our widget, then we can save it as an html, png, and pdf file.
# save it in html library("htmlwidgets") library(webshot) saveWidget(romance.word.cloud, "romancewordcloud.html", selfcontained = F) # and in png or pdf webshot("romancewordcloud.html", "romancewordcloud.png", delay = 60, vwidth = 960, vheight = 480)
Crime Movies
Finally, for a comparison word cloud, we will create a second data set for crime movies to find out how keywords for this genre differ from romance movies:
grep(pattern = "crime", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>% kable() %>% kable_styling()
x |
---|
Crime, Mystery |
Comedy, Crime, Fantasy |
Action, Crime, Drama |
Comedy, Crime, Romance |
Crime, Drama, Music |
Comedy, Crime, Family |
We find that there are 930 movies with crime listed as a genre:
crime <- grepl("crime", dat$genre, ignore.case = T) sum(crime)
[1] 930
Once again, we will use this criteria to segment out the crime movies:
dat.crime <- dat[crime, c("title", "genre", "plot")] dat.crime %>% head(10) %>% kable() %>% kable_styling()
title | genre | plot | |
---|---|---|---|
1 | Code Name: K.O.Z. | Crime, Mystery | A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government. |
4 | Daniel der Zauberer | Comedy, Crime, Fantasy | Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols. |
46 | Final Justice | Action, Crime, Drama | Homicidal Sheriff Thomas Jefferson Geronimo is tasked with escorting a mobster to Malta; when the prisoner escapes, Geronimo goes rogue to catch him. |
51 | Gigli | Comedy, Crime, Romance | The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage. |
63 | Girl in Gold Boots | Crime, Drama, Music | A young woman leaves her job as a waitress and travels to Los Angeles, where she strives to become the top star in the glamorous world of go-go dancing. |
69 | Baby Geniuses | Comedy, Crime, Family | Scientist hold talking, super-intelligent babies captive, but things take a turn for the worse when a mix-up occurs between a baby genius and its twin. |
75 | Tees Maar Khan | Comedy, Crime | Posing as a movie producer, a conman attempts to trick an entire village into helping him rob a treasure-laden train. |
86 | Mitchell | Action, Crime, Drama | A sleazy, incompetent detective tries to simultaneously take down heroin dealers and a socialite who murdered a burglar. |
96 | I Accuse My Parents | Crime, Drama | Young man goes to work for gangsters to impress his nightclub-singer girlfriend. |
99 | Ghosts Can't Do It | Comedy, Crime, Fantasy | Elderly Scott kills himself after a heart attack wrecks his body, but then comes back as a ghost and convinces his loving young hot wife Kate to pick and kill a young man in order for Scott to possess his body and be with her again. |
Now we extract the crime plots:
# Create a vector containing only the text crime.plot <- dat.crime$plot
Create a crime corpus:
# Create a corpus crime.docs <- Corpus(VectorSource(crime.plot)) crime.docs
<<SimpleCorpus>> Metadata: corpus specific: 1, document level (indexed): 0 Content: documents: 930
crime.docs <- crime.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>% tm_map(stripWhitespace) crime.docs <- tm_map(crime.docs, content_transformer(tolower)) crime.docs <- tm_map(crime.docs, removeWords, stopwords("english"))
crime.dtm <- TermDocumentMatrix(crime.docs) crime.matrix <- as.matrix(crime.dtm) crime.words <- sort(rowSums(crime.matrix), decreasing = TRUE) crime.df <- data.frame(word = names(crime.words), freq = crime.words) crime.df %>% head(10) %>% kable() %>% kable_styling()
word | freq | |
---|---|---|
man | man | 109 |
young | young | 89 |
two | two | 86 |
police | police | 83 |
one | one | 77 |
life | life | 69 |
murder | murder | 60 |
new | new | 59 |
crime | crime | 57 |
drug | drug | 52 |
For crime moves, we find that our most frequent word is man and young comes in second place again. The third most frequently used word here is 'two', but police is not far behind for fourth place, as would be expected in a crime movie. Now we create a new color pallete and just like that we have a second word cloud based on over 900 crime movie plots:
crime.pallete <- mutate(crime.df, color = cut(freq, breaks = c(0, 25, 50, 75, 100, Inf), labels = c("#C9D1D6", "#616D7A", "#5B83AD", "#1E1E1E", "#13365B"), include.lowest = TRUE)) crime.word.cloud <- wordcloud2(data = crime.pallete, color = crime.pallete$color) crime.word.cloud
# save it in html library("htmlwidgets") library(webshot) saveWidget(crime.word.cloud, "crimewordcloud.html", selfcontained = F) # and in png or pdf webshot("crimewordcloud.html", "crimewordcloud.png", delay = 60, vwidth = 960, vheight = 480)
Comparison and Conclusion:
Now that we have completed this tutorial, we have been able to discover some similarities in individual words between the two word clouds:
However, the collective data from both sets of plots clearly displays how some of the same keywords can carry very different meanings between genres. We have also learned two methods for processing data: the first method using grep()
, grepl()
and quanteda()
is quite a bit longer than using the tm()
methods, but it has its perks of letting us see our data at every stage. Similarly tm()
does not provide as much insight into into textual analysis, but for the purposes of quick projects like the wordclouds, it makes the process much easier!
Resources
Now that you have created your first wordcloud, I encourage you to try out the features and give it a try yourself! The following links contain tutorials and resources to guide you each step of the way:
-
Create a word cloud with r This tutorial from ‘towards data science’ by Celine Van den Rul served as the inspiration for this code-through and includes great step-by-step details on all of the features of
wordcloud2()
as well as its more basic predecessorwordcloud()
. Both packages have unique benefits! -
Wordcloud This article focuses on the first
wordcloud()
and has lots of great insight. -
R Graph Gallery: Wordcloud2 The R Graph Gallery has excellent tutorials on various data visualization strategies in R and is a wonderful guide for Wordcloud2.
-
CRAN also has wonderful resources for
WordCloud2()
available for view here in PDF form and here in website form. -
Webshot is a neat package that allows us to create static saved images of the ever-changing word clouds that we create.
-
Coolors is a great site to find colors for your word cloud palettes.
-
Last, but certainly not least, Data.World is an excellent place to find data on just about any subject and the pre-configured R URL download links are especially helpful!
About The Author
This code-through tutorial was created by Courtney Stowers for CPP 527 Data Science 2 in the Master of Science in Program Evaluation and Data Analytics program at Arizona State University. She can be contacted at: courtney.stowers@asu.edu
.