Cinema and Clouds

Introduction

Although data science projects often employ large amounts of numeric data, some projects examine patterns within text and require a different set of tools. In this code-through tutorial, we are going to explore several packages in R that enable researchers to analyze qualitative data sets and discover cool patterns. We are also going to create two word clouds based on movie plot summaries. In order to complete this tutorial, you will need access to R or RStudio. If you are not familiar with either of these software or would like a refresher on the basics, check out the first steps of my previous code-through here. I walk you through the entire set-up process beginning with the software. If you're all set to go with R and RStudio, then proceed below:

Contents

Comparison and Conclusion

Resources

Set-Up

Library of Packages

Before we begin analyzing and our data and creating word clouds, first, we need to load a few packages into our 'library'. Specifically we need the following packages:

library(dplyr)  # helps organize our data
library(kableExtra)  # creates elegant tables for output
library(quanteda)  # processes our textual data for anaylsis
library(wordcloud2)  # creates the wordclouds

You can easily install all of these packages with the following code:

install.packages("nameofpackage")

Data

Now that we have our packages installed, the next thing that we need is our dataset. For the purposes of this code-through we are going to be using a dataset of over 5000 movies from IMDB that we will access from data.world.

We will click the blue Explore this dataset button and then scroll down to select the IMDBdata_MainData.csv dataset. Next, we will click the download button where we will select the download URL for R.

Data.World provides us with the exact code we need to get started and import our data. Therefore we will copy and paste into a codeblock as follows:

df <- read.csv("https://query.data.world/s/rr46ndg7fyne54q7oonmvzxbaxg3zn", header = TRUE,
    stringsAsFactors = FALSE)

Preview Dataset

A quick peek at the column names of the dataset reveal the various fields available for us to use to explore the 5,000+ movies in the set.

colnames(df)

 [1] "Title"          "Year"           "Rated"          "Released"      
 [5] "Runtime"        "Genre"          "Director"       "Writer"        
 [9] "Actors"         "Plot"           "Language"       "Country"       
[13] "Awards"         "Poster"         "Ratings.Source" "Ratings.Value" 
[17] "Metascore"      "imdbRating"     "imdbVotes"      "imdbID"        
[21] "Type"           "DVD"            "BoxOffice"      "Production"    
[25] "Website"        "Response"       "tomatoURL"

For ease of reference, we will change all of these titles to be lowercase with the following function:

colnames(df) <- tolower(colnames(df))
colnames(df)

 [1] "title"          "year"           "rated"          "released"      
 [5] "runtime"        "genre"          "director"       "writer"        
 [9] "actors"         "plot"           "language"       "country"       
[13] "awards"         "poster"         "ratings.source" "ratings.value" 
[17] "metascore"      "imdbrating"     "imdbvotes"      "imdbid"        
[21] "type"           "dvd"            "boxoffice"      "production"    
[25] "website"        "response"       "tomatourl"

For our analysis, we will focus specifically on the movie titles, genres, and plots, so we will create a smaller dataset with only these variables:

dat <- df[c("title", "genre", "plot")]

A preview of this dataset gives us a glimpse into what we're working with:

head(dat) %>% kable() %>% kable_styling()

title	genre	plot
Code Name: K.O.Z.	Crime, Mystery	A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government.
Saving Christmas	Comedy, Family	Kirk is enjoying the annual Christmas party extravaganza thrown by his sister until he realizes he needs to help out Christian, his brother-in-law, who has a bad case of the bah-humbugs. ...
Superbabies: Baby Geniuses 2	Comedy, Family, Sci-Fi	A group of smart-talking toddlers find themselves at the center of a media mogul's experiment to crack the code to baby talk. The toddlers must race against time for the sake of babies everywhere.
Daniel der Zauberer	Comedy, Crime, Fantasy	Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols.
Manos: The Hands of Fate	Horror	A family gets lost on the road and stumbles upon a hidden, underground, devil-worshiping cult led by the fearsome Master and his servant Torgo.
Pledge This!	Comedy	At South Beach University, a beautiful sorority president takes in a group of unconventional freshman girls seeking acceptance into her house.

Now we will create an even smaller dataset of only romance movies using the grep() function which allows us to search through text for specfic key words. Here, we will search for all movies that have romance listed as either its only genre or at least one of its genres:

grep(pattern = "romance", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>%
    kable() %>% kable_styling()

x
Horror, Romance, Thriller
Comedy, Romance, Sport
Comedy, Romance
Comedy, Musical, Romance
Drama, Romance, Thriller
Drama, Music, Romance

Using the grepl() function we will find all 927 movies that belong to the romance genre:

romance <- grepl("romance", dat$genre, ignore.case = T)
sum(romance)

[1] 927

We will use this criteria to segment out the romance movies:

dat.romance <- dat[romance, c("title", "genre", "plot")]
dat.romance %>% head(10) %>% kable() %>% kable_styling()

	title	genre	plot
9	Birdemic: Shock and Terror	Horror, Romance, Thriller	A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic?
10	Dream.net	Comedy, Romance, Sport	Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ...
14	The Hottie & the Nottie	Comedy, Romance	A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend.
16	From Justin to Kelly	Comedy, Musical, Romance	A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing.
23	Ben & Arthur	Drama, Romance, Thriller	A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church.
32	Glitter	Drama, Music, Romance	A young singer dates a disc jockey who helps her get into the music business, but their relationship become complicated as she ascends to super stardom.
36	Space Mutiny	Action, Adventure, Romance	A pilot is the only hope to stop the mutiny of a spacecraft by its security crew, who plot to sell the crew of the ship into slavery.
51	Gigli	Comedy, Crime, Romance	The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage.
80	A Story About Love	Romance	Two young people stand on a street corner in a run-down part of New York, kissing. Despite the lawlessness of the district they are left unmolested. A short distance away walk Maria and ...
97	The Bat People	Horror, Romance	After being bitten by a bat in a cave, a doctor undergoes an accelerating transformation into a man-bat, which ruins his vacation and causes considerable distress for his wife.

Now that we have a collection of romance movie titles, we will focus on examining the plots of the movies to see if there are any similarities or common keywords across the summaries:

corp.romance <- corpus(dat.romance, docid_field = "title", text_field = "plot")
corp.romance

Corpus consisting of 927 documents and 1 docvar.

corp.romance[1:5] %>% kable() %>% kable_styling()

	x
Birdemic: Shock and Terror	A horde of mutated birds descends upon the quiet town of Half Moon Bay, California. With the death toll rising, Two citizens manage to fight back, but will they survive Birdemic?
Dream.net	Regina, the once popular girl has to make new friends at her new, conservative school. Problems arrive when she becomes enemies with Lívia, the school's queen bee, and falls in love with ...
The Hottie & the Nottie	A woman agrees to go on a date with a man only if he finds a suitor for her unattractive best friend.
From Justin to Kelly	A waitress from Texas and a college student from Pennsylvania meet during spring break in Fort Lauderdale, Florida and come together through their shared love of singing.
Ben & Arthur	A pair of recently married gay men are threatened by one of the partners' brother, a religious fanatic who plots to murder them after being ostracized by his church.

# summarize corpus
summary(corp.romance)[1:10, ] %>% kable() %>% kable_styling()

Text	Types	Tokens	Sentences	genre
Birdemic: Shock and Terror	32	36	2	Horror, Romance, Thriller
Dream.net	31	40	2	Comedy, Romance, Sport
The Hottie & the Nottie	21	23	1	Comedy, Romance
From Justin to Kelly	27	29	1	Comedy, Musical, Romance
Ben & Arthur	30	32	1	Drama, Romance, Thriller
Glitter	28	28	1	Drama, Music, Romance
Space Mutiny	24	30	1	Action, Adventure, Romance
Gigli	27	32	1	Comedy, Crime, Romance
A Story About Love	32	39	3	Romance
The Bat People	27	32	1	Horror, Romance

# Process text

# remove mission statements that are less than 1 sentence long
corp.romance <- corpus_trim(corp.romance, what = "sentences", min_ntoken = 1)
corp.romance

Corpus consisting of 927 documents and 1 docvar.

# remove punctuation
tokens.romance <- tokens(corp.romance, what = "word", remove_punct = TRUE)
head(tokens.romance)

tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "A"          "horde"      "of"         "mutated"    "birds"     
 [6] "descends"   "upon"       "the"        "quiet"      "town"      
[11] "of"         "Half"       "Moon"       "Bay"        "California"
[16] "With"       "the"        "death"      "toll"       "rising"    
[21] "Two"        "citizens"   "manage"     "to"         "fight"     
[26] "back"       "but"        "will"       "they"       "survive"   
[31] "Birdemic"  

Dream.net :
 [1] "Regina"       "the"          "once"         "popular"      "girl"        
 [6] "has"          "to"           "make"         "new"          "friends"     
[11] "at"           "her"          "new"          "conservative" "school"      
[16] "Problems"     "arrive"       "when"         "she"          "becomes"     
[21] "enemies"      "with"         "Lívia"        "the"          "school's"    
[26] "queen"        "bee"          "and"          "falls"        "in"          
[31] "love"         "with"        

The Hottie & the Nottie :
 [1] "A"            "woman"        "agrees"       "to"           "go"          
 [6] "on"           "a"            "date"         "with"         "a"           
[11] "man"          "only"         "if"           "he"           "finds"       
[16] "a"            "suitor"       "for"          "her"          "unattractive"
[21] "best"         "friend"      

From Justin to Kelly :
 [1] "A"            "waitress"     "from"         "Texas"        "and"         
 [6] "a"            "college"      "student"      "from"         "Pennsylvania"
[11] "meet"         "during"       "spring"       "break"        "in"          
[16] "Fort"         "Lauderdale"   "Florida"      "and"          "come"        
[21] "together"     "through"      "their"        "shared"       "love"        
[26] "of"           "singing"     

Ben & Arthur :
 [1] "A"          "pair"       "of"         "recently"   "married"   
 [6] "gay"        "men"        "are"        "threatened" "by"        
[11] "one"        "of"         "the"        "partners"   "brother"   
[16] "a"          "religious"  "fanatic"    "who"        "plots"     
[21] "to"         "murder"     "them"       "after"      "being"     
[26] "ostracized" "by"         "his"        "church"    

Glitter :
 [1] "A"            "young"        "singer"       "dates"        "a"           
 [6] "disc"         "jockey"       "who"          "helps"        "her"         
[11] "get"          "into"         "the"          "music"        "business"    
[16] "but"          "their"        "relationship" "become"       "complicated" 
[21] "as"           "she"          "ascends"      "to"           "super"       
[26] "stardom"

# convert to lower case
tokens.romance <- tokens_tolower(tokens.romance, keep_acronyms = TRUE)
head(tokens.romance)

tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "a"          "horde"      "of"         "mutated"    "birds"     
 [6] "descends"   "upon"       "the"        "quiet"      "town"      
[11] "of"         "half"       "moon"       "bay"        "california"
[16] "with"       "the"        "death"      "toll"       "rising"    
[21] "two"        "citizens"   "manage"     "to"         "fight"     
[26] "back"       "but"        "will"       "they"       "survive"   
[31] "birdemic"  

Dream.net :
 [1] "regina"       "the"          "once"         "popular"      "girl"        
 [6] "has"          "to"           "make"         "new"          "friends"     
[11] "at"           "her"          "new"          "conservative" "school"      
[16] "problems"     "arrive"       "when"         "she"          "becomes"     
[21] "enemies"      "with"         "lívia"        "the"          "school's"    
[26] "queen"        "bee"          "and"          "falls"        "in"          
[31] "love"         "with"        

The Hottie & the Nottie :
 [1] "a"            "woman"        "agrees"       "to"           "go"          
 [6] "on"           "a"            "date"         "with"         "a"           
[11] "man"          "only"         "if"           "he"           "finds"       
[16] "a"            "suitor"       "for"          "her"          "unattractive"
[21] "best"         "friend"      

From Justin to Kelly :
 [1] "a"            "waitress"     "from"         "texas"        "and"         
 [6] "a"            "college"      "student"      "from"         "pennsylvania"
[11] "meet"         "during"       "spring"       "break"        "in"          
[16] "fort"         "lauderdale"   "florida"      "and"          "come"        
[21] "together"     "through"      "their"        "shared"       "love"        
[26] "of"           "singing"     

Ben & Arthur :
 [1] "a"          "pair"       "of"         "recently"   "married"   
 [6] "gay"        "men"        "are"        "threatened" "by"        
[11] "one"        "of"         "the"        "partners"   "brother"   
[16] "a"          "religious"  "fanatic"    "who"        "plots"     
[21] "to"         "murder"     "them"       "after"      "being"     
[26] "ostracized" "by"         "his"        "church"    

Glitter :
 [1] "a"            "young"        "singer"       "dates"        "a"           
 [6] "disc"         "jockey"       "who"          "helps"        "her"         
[11] "get"          "into"         "the"          "music"        "business"    
[16] "but"          "their"        "relationship" "become"       "complicated" 
[21] "as"           "she"          "ascends"      "to"           "super"       
[26] "stardom"

In order to make the data set more precise, we will remove all of the articles from the data set:

tokens.romance <- tokens_remove(tokens.romance, c(stopwords("english"), "nbsp"),
    padding = F)
head(tokens.romance)

tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "horde"      "mutated"    "birds"      "descends"   "upon"      
 [6] "quiet"      "town"       "half"       "moon"       "bay"       
[11] "california" "death"      "toll"       "rising"     "two"       
[16] "citizens"   "manage"     "fight"      "back"       "survive"   
[21] "birdemic"  

Dream.net :
 [1] "regina"       "popular"      "girl"         "make"         "new"         
 [6] "friends"      "new"          "conservative" "school"       "problems"    
[11] "arrive"       "becomes"      "enemies"      "lívia"        "school's"    
[16] "queen"        "bee"          "falls"        "love"        

The Hottie & the Nottie :
 [1] "woman"        "agrees"       "go"           "date"         "man"         
 [6] "finds"        "suitor"       "unattractive" "best"         "friend"      

From Justin to Kelly :
 [1] "waitress"     "texas"        "college"      "student"      "pennsylvania"
 [6] "meet"         "spring"       "break"        "fort"         "lauderdale"  
[11] "florida"      "come"         "together"     "shared"       "love"        
[16] "singing"     

Ben & Arthur :
 [1] "pair"       "recently"   "married"    "gay"        "men"       
 [6] "threatened" "one"        "partners"   "brother"    "religious" 
[11] "fanatic"    "plots"      "murder"     "ostracized" "church"    

Glitter :
 [1] "young"        "singer"       "dates"        "disc"         "jockey"      
 [6] "helps"        "get"          "music"        "business"     "relationship"
[11] "become"       "complicated"  "ascends"      "super"        "stardom"

In order to consolidate similar words, we will stem the data set:

# stem the words in the token list:
tokens.romance <- tokens_wordstem(tokens.romance)
head(tokens.romance)

tokens from 6 documents.
Birdemic: Shock and Terror :
 [1] "hord"       "mutat"      "bird"       "descend"    "upon"      
 [6] "quiet"      "town"       "half"       "moon"       "bay"       
[11] "california" "death"      "toll"       "rise"       "two"       
[16] "citizen"    "manag"      "fight"      "back"       "surviv"    
[21] "birdem"    

Dream.net :
 [1] "regina"  "popular" "girl"    "make"    "new"     "friend"  "new"    
 [8] "conserv" "school"  "problem" "arriv"   "becom"   "enemi"   "lívia"  
[15] "school"  "queen"   "bee"     "fall"    "love"   

The Hottie & the Nottie :
 [1] "woman"     "agre"      "go"        "date"      "man"       "find"     
 [7] "suitor"    "unattract" "best"      "friend"   

From Justin to Kelly :
 [1] "waitress"     "texa"         "colleg"       "student"      "pennsylvania"
 [6] "meet"         "spring"       "break"        "fort"         "lauderdal"   
[11] "florida"      "come"         "togeth"       "share"        "love"        
[16] "sing"        

Ben & Arthur :
 [1] "pair"     "recent"   "marri"    "gay"      "men"      "threaten"
 [7] "one"      "partner"  "brother"  "religi"   "fanat"    "plot"    
[13] "murder"   "ostrac"   "church"  

Glitter :
 [1] "young"        "singer"       "date"         "disc"         "jockey"      
 [6] "help"         "get"          "music"        "busi"         "relationship"
[11] "becom"        "complic"      "ascend"       "super"        "stardom"

We can see which words commonly occur together:

# find frequently co-occuring words (typically compound words)
ngram.romance <- tokens_ngrams(tokens.romance, n = 2) %>% dfm()
ngram.romance %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()

feature	frequency	rank	docfreq	group
fall_love	57	1	57	all
new_york	37	2	36	all
young_man	32	3	32	all
young_woman	28	4	28	all
high_school	28	4	27	all
best_friend	27	6	26	all
world_war	14	7	14	all
york_citi	12	8	12	all
true_love	10	9	10	all
war_ii	9	10	9	all

ngram.romance3 <- tokens_ngrams(tokens.romance, n = 3) %>% dfm()
ngram.romance3 %>% textstat_frequency(n = 10) %>% kable() %>% kable_styling()

feature	frequency	rank	docfreq	group
new_york_citi	12	1	12	all
world_war_ii	9	2	9	all
two_best_friend	5	3	5	all
woman_fall_love	4	4	4	all
fall_love_woman	4	4	4	all
meet_fall_love	3	6	3	all
drama_center_around	3	6	3	all
high_school_crush	3	6	3	all
young_woman_find	3	6	3	all
experi_chang_live	3	6	3	all

Finally we can see which words are most common in the romance movie plot summaries:

tokens.romance %>% dfm(stem = T) %>% topfeatures() %>% kable() %>% kable_styling()

	x
love	182
young	146
woman	144
man	138
life	112
fall	93
find	93
friend	91
two	90
new	88

Unsurprisingly(!!) love is the number one most frequently used word followed by young, woman, and man.

Word Clouds

Now that we’ve previewed our data step-by-step, we can create a wordcloud. We will repeat some of our previous steps in a more automated format using the tm() package. If you do not have it installed, once again you can install it via install.packages("tm")

library(tm)  # package for processing text

Romance Movies

For our first word cloud, we will create a new dataset creating only our plots from romance movies:

# Create a vector containing only the text
romance.plot <- dat.romance$plot

Then we will create a new Corpus consisting solely of the plot:

# Create a corpus
romance.docs <- Corpus(VectorSource(romance.plot))
romance.docs

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 927

Afterwords, we will go through the same steps that we did above to clean our data set:

romance.docs <- romance.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
romance.docs <- tm_map(romance.docs, content_transformer(tolower))
romance.docs <- tm_map(romance.docs, removeWords, stopwords("english"))

And once again find our most frequently used words:

romance.dtm <- TermDocumentMatrix(romance.docs)
romance.matrix <- as.matrix(romance.dtm)
romance.words <- sort(rowSums(romance.matrix), decreasing = TRUE)
romance.df <- data.frame(word = names(romance.words), freq = romance.words)
romance.df %>% head(10) %>% kable() %>% kable_styling()

	word	freq
love	love	174
young	young	146
woman	woman	133
man	man	132
life	life	110
two	two	90
new	new	88
friends	friends	62
one	one	62
family	family	62

Finally, we will create a color pallete and voila! we’ve created a word cloud based on over 900 romance movie plots:

pallete <- mutate(romance.df, color = cut(freq, breaks = c(0, 40, 80, 120, 160, Inf),
    labels = c("#FFCAE5", "#BF1168", "#EA72C4", "#FF167F", "A40000"), include.lowest = TRUE))

romance.word.cloud <- wordcloud2(data = pallete, color = pallete$color)
romance.word.cloud

If we would like to save our widget, then we can save it as an html, png, and pdf file.

# save it in html
library("htmlwidgets")
library(webshot)

saveWidget(romance.word.cloud, "romancewordcloud.html", selfcontained = F)

# and in png or pdf
webshot("romancewordcloud.html", "romancewordcloud.png", delay = 60, vwidth = 960,
    vheight = 480)

Crime Movies

Finally, for a comparison word cloud, we will create a second data set for crime movies to find out how keywords for this genre differ from romance movies:

grep(pattern = "crime", x = dat$genre, value = TRUE, ignore.case = TRUE) %>% head() %>%
    kable() %>% kable_styling()

x
Crime, Mystery
Comedy, Crime, Fantasy
Action, Crime, Drama
Comedy, Crime, Romance
Crime, Drama, Music
Comedy, Crime, Family

We find that there are 930 movies with crime listed as a genre:

crime <- grepl("crime", dat$genre, ignore.case = T)
sum(crime)

[1] 930

Once again, we will use this criteria to segment out the crime movies:

dat.crime <- dat[crime, c("title", "genre", "plot")]
dat.crime %>% head(10) %>% kable() %>% kable_styling()

	title	genre	plot
1	Code Name: K.O.Z.	Crime, Mystery	A look at the 17-25 December 2013 corruption scandal in Turkey, from the viewpoint of the Erdogan government.
4	Daniel der Zauberer	Comedy, Crime, Fantasy	Evil assassins want to kill Daniel Kublbock, the third runner up for the German Idols.
46	Final Justice	Action, Crime, Drama	Homicidal Sheriff Thomas Jefferson Geronimo is tasked with escorting a mobster to Malta; when the prisoner escapes, Geronimo goes rogue to catch him.
51	Gigli	Comedy, Crime, Romance	The violent story about how a criminal lesbian, a tough-guy hit-man with a heart of gold, and a mentally challenged man came to be best friends through a hostage.
63	Girl in Gold Boots	Crime, Drama, Music	A young woman leaves her job as a waitress and travels to Los Angeles, where she strives to become the top star in the glamorous world of go-go dancing.
69	Baby Geniuses	Comedy, Crime, Family	Scientist hold talking, super-intelligent babies captive, but things take a turn for the worse when a mix-up occurs between a baby genius and its twin.
75	Tees Maar Khan	Comedy, Crime	Posing as a movie producer, a conman attempts to trick an entire village into helping him rob a treasure-laden train.
86	Mitchell	Action, Crime, Drama	A sleazy, incompetent detective tries to simultaneously take down heroin dealers and a socialite who murdered a burglar.
96	I Accuse My Parents	Crime, Drama	Young man goes to work for gangsters to impress his nightclub-singer girlfriend.
99	Ghosts Can't Do It	Comedy, Crime, Fantasy	Elderly Scott kills himself after a heart attack wrecks his body, but then comes back as a ghost and convinces his loving young hot wife Kate to pick and kill a young man in order for Scott to possess his body and be with her again.

Now we extract the crime plots:

# Create a vector containing only the text
crime.plot <- dat.crime$plot

Create a crime corpus:

# Create a corpus
crime.docs <- Corpus(VectorSource(crime.plot))
crime.docs

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 930

crime.docs <- crime.docs %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)
crime.docs <- tm_map(crime.docs, content_transformer(tolower))
crime.docs <- tm_map(crime.docs, removeWords, stopwords("english"))

crime.dtm <- TermDocumentMatrix(crime.docs)
crime.matrix <- as.matrix(crime.dtm)
crime.words <- sort(rowSums(crime.matrix), decreasing = TRUE)
crime.df <- data.frame(word = names(crime.words), freq = crime.words)
crime.df %>% head(10) %>% kable() %>% kable_styling()

	word	freq
man	man	109
young	young	89
two	two	86
police	police	83
one	one	77
life	life	69
murder	murder	60
new	new	59
crime	crime	57
drug	drug	52

For crime moves, we find that our most frequent word is man and young comes in second place again. The third most frequently used word here is 'two', but police is not far behind for fourth place, as would be expected in a crime movie. Now we create a new color pallete and just like that we have a second word cloud based on over 900 crime movie plots:

crime.pallete <- mutate(crime.df, color = cut(freq, breaks = c(0, 25, 50, 75, 100,
    Inf), labels = c("#C9D1D6", "#616D7A", "#5B83AD", "#1E1E1E", "#13365B"), include.lowest = TRUE))

crime.word.cloud <- wordcloud2(data = crime.pallete, color = crime.pallete$color)
crime.word.cloud

# save it in html
library("htmlwidgets")
library(webshot)

saveWidget(crime.word.cloud, "crimewordcloud.html", selfcontained = F)

# and in png or pdf
webshot("crimewordcloud.html", "crimewordcloud.png", delay = 60, vwidth = 960, vheight = 480)

Comparison and Conclusion:

Now that we have completed this tutorial, we have been able to discover some similarities in individual words between the two word clouds:

However, the collective data from both sets of plots clearly displays how some of the same keywords can carry very different meanings between genres. We have also learned two methods for processing data: the first method using grep(), grepl() and quanteda() is quite a bit longer than using the tm() methods, but it has its perks of letting us see our data at every stage. Similarly tm() does not provide as much insight into into textual analysis, but for the purposes of quick projects like the wordclouds, it makes the process much easier!

Resources

Now that you have created your first wordcloud, I encourage you to try out the features and give it a try yourself! The following links contain tutorials and resources to guide you each step of the way:

Create a word cloud with r This tutorial from ‘towards data science’ by Celine Van den Rul served as the inspiration for this code-through and includes great step-by-step details on all of the features of wordcloud2() as well as its more basic predecessor wordcloud(). Both packages have unique benefits!
Wordcloud This article focuses on the first wordcloud() and has lots of great insight.
R Graph Gallery: Wordcloud2 The R Graph Gallery has excellent tutorials on various data visualization strategies in R and is a wonderful guide for Wordcloud2.
CRAN also has wonderful resources for WordCloud2() available for view here in PDF form and here in website form.
Webshot is a neat package that allows us to create static saved images of the ever-changing word clouds that we create.
Coolors is a great site to find colors for your word cloud palettes.
Last, but certainly not least, Data.World is an excellent place to find data on just about any subject and the pre-configured R URL download links are especially helpful!

About The Author

This code-through tutorial was created by Courtney Stowers for CPP 527 Data Science 2 in the Master of Science in Program Evaluation and Data Analytics program at Arizona State University. She can be contacted at: courtney.stowers@asu.edu.