Quantcast
Channel: Statistics – Douglas E. Rice
Viewing all articles
Browse latest Browse all 3

4 Cool Things You Can Do with R

$
0
0

Screenshot of the R Console

Over the past year, I have been seeking to learn more about the field of data science. Of course, as is clear if you’ve been following my blog recently, I have a deep personal interest in using data and statistics to answer questions and solve problems. But, more recently, I’ve determined that the trajectory of my career is also tending to align closely with data analysis. Therefore, I’ve been spending a good amount of time trying to learn the top programs and applications used by data scientists today–the most important of which is the topic of this post: R.

R is a statistical programming language. If that simple definition sounds daunting to you, good–you are my intended audience for this post. If your response to this definition is, “duh, tell me something I don’t know,” you probably won’t get anything out of this.

Let me state upfront: I’m an amateur and am still learning the best ways to use R. I have no background in programming whatsoever. There are probably easier ways to do the things in going to discuss, but this where I am in the learning process right now. So, please, take what’s useful and discard what’s not.

Okay, so back to R. R is an open-source (that means that it’s free and that it’s always being refined by nerds to make it work better) software that allows users to work with data in all sorts of ways. As I’ve played around with it, I know that what I’ve discovered merely scratches the surface of its capabilities. But I think these things may stimulate your curiosity enough to convince you that learning R is worth the time and effort–even for someone with no programming experience whatsoever.

If you have never used R, I would recommend a few tutorials and references to get you started.

Resources for Learning R Programming

Now, let’s see what R can do…

1) How to Scrape Data from the Web Using R

One of the most important features of any programming language is the capacity for web scraping. What this means is that the code is able to read webpages and extract certain information. The benefit of a web scraper (or web crawler, as it is also called) is that you don’t have to manually go to every web page to get your data. Moreover, if the data changes, all you have to do is run the code in order to get the new data.

R has web scraping capabilities in its “rvest” package. (A package in R is an application that allows you to do additional stuff). Find out more about the “rvest” package here.

So, why would you as a regular person want to use a web crawler? Maybe you want to keep track of the stocks in your portfolio. Maybe you’re looking for a home and you want to compare crime rates in various neighborhoods. Maybe you’re a sports fanatic and you want to monitor stats on your favorite players and teams. You can get all of this data and more from the web with R.

So, let’s scrape some data. Let’s say you’re in the retail business and want to get the headlines of what’s going on in the industry. You can use R to scrape the headlines from Fortune.com’s Retail directory. First, I’ll share the script–and then we’ll discuss what each part means…

install.packages(“rvest”)
library(rvest)
data <- read_html(“http://www.fortune.com/retail”)
headlines <- c(
headline1 <- data %>% html_node(“#content li:nth-child(1) a”) %>% html_text(),
headline2 <- data %>% html_node(“#content li:nth-child(2) a”) %>% html_text(),
headline3 <- data %>% html_node(“#content li:nth-child(3) a”) %>% html_text(),
headline4 <- data %>% html_node(“#content li:nth-child(4) a”) %>% html_text(),
headline5 <- data %>% html_node(“#content li:nth-child(5) a”) %>% html_text(),
headline6 <- data %>% html_node(“#content li:nth-child(6) a”) %>% html_text(),
headline7 <- data %>% html_node(“#content li:nth-child(7) a”) %>% html_text(),
headline8 <- data %>% html_node(“#content li:nth-child(8) a”) %>% html_text(),
headline9 <- data %>% html_node(“#content li:nth-child(9) a”) %>% html_text(),
headline10 <- data %>% html_node(“#content li:nth-child(10) a”) %>% html_text(),
headline11 <- data %>% html_node(“#content li:nth-child(11) a”) %>% html_text(),
headline12 <- data %>% html_node(“#content li:nth-child(12) a”) %>% html_text(),
headline13 <- data %>% html_node(“#content li:nth-child(13) a”) %>% html_text(),
headline14 <- data %>% html_node(“#content li:nth-child(14) a”) %>% html_text(),
headline15 <- data %>% html_node(“#content li:nth-child(15) a”) %>% html_text(),
headline16 <- data %>% html_node(“#content li:nth-child(16) a”) %>% html_text(),
headline17 <- data %>% html_node(“#content li:nth-child(17) a”) %>% html_text(),
headline18 <- data %>% html_node(“#content li:nth-child(18) a”) %>% html_text(),
headline19 <- data %>% html_node(“#content li:nth-child(19) a”) %>% html_text(),
headline20 <- data %>% html_node(“#content li:nth-child(20) a”) %>% html_text(),
headline21 <- data %>% html_node(“#content li:nth-child(21) a”) %>% html_text(),
headline22 <- data %>% html_node(“#content li:nth-child(22) a”) %>% html_text()
)
write.csv(headlines,file=”retailheadlines.csv”)

So, the first thing you do when running an R script is install the appropriate packages and open the libraries of those packages–as shown in the first two lines of code. Once you’ve done this with the “rvest” package, you should be able to select data from any website.

Next, you read the webpage or webpages from which you’re going to pull the data. This is done with the read_html function. For this example, we’re interested in the URL “http://www.fortune.com/retail.” So, we put that into the function and then assign it to a vector we call “data.” Once that line of code is run, the data vector contains all of the html from the URL we are using.

Believe it or not, there are only two lines of code left after this. The last line begins with “write.csv…” The next line we are going to cover refers to everything else in the script. So, what is all of that junk?

In order to list the data we are pulling (in our case headlines), we’ll want to create a vector consisting of a single column with the data in it. Our vector in this case is “headlines,” and we create the rows for this column of data with the c() function. Separated by commas, we then include a script for each headline we want to pull within the parenthesis. In this case, there are 22 headlines, so we’ll have 22 rows in our column.

For each row, we create a vector that pulls a headline from the webpage. The first row we call “headline1.” We create it by pulling from the “data” vector we created the html code for the information we’re seeking. IMPORTANT: there’s a really cool application called Selector Gadget that allows you to find the code you need from any given webpage. Bookmark this tool and use it with reckless abandon.

Once you know the code, put it into the html_node function. Then, you’ll finish the html_text() function to indicate that you are pulling the text from the code. The “%>%” snippet is place between each function to indicate the path you are taking on the webpage.

Once you’ve done this for each row in your c() function, you’ll have pulled all of your headlines. The last line, “write.csv…,” will export the list of headlines into an Excel CSV in your Documents on your computer.

Now, if you enter “headline1” into R, you’ll get the first headline. If you enter “headline4,” you’ll get the 4th headline. If you enter “headlines,” you’ll get the entire list. At the time of this writing, here are the headlines…

You may not find this particular set of information very useful, because you could simply go to the webpage and see the headlines. But you can easily customize this to pull data into this list from other websites or other sections of Fortune.com. This is just a basic, relatively simple script. If you want, and with a little effort upfront, you could create your own customized news feed.

2) How to Build a Word Cloud and Analyze Text Using R

Another cool thing that R can do is text analysis. The “tm” package includes all sorts of capabilities for analyzing text. You can even do sentiment analysis in R, but I haven’t quite figured it out yet. What I can do is use R to discover the most frequent words that occur in a text and build a word cloud with them.

And why would anyone want to do this? Well, maybe you’re a writer and you want to see which words you’re overusing. Maybe you’re reading reviews on a product and you want to see what words people are using to describe it. Maybe you want to see what’s trending on a popular news site. The possibilities are endless. So, let’s get your hands dirty with a personal example.

As I mentioned in the introduction to this post, I’m trying to steer my career toward the direction of data science. I work for a company that is very data-oriented and plan to move into a more data-centric role at some point in the future. So, while I’m not looking for another job, job openings for data scientists and data analysts do provide useful direction for the skills I should be developing. So, I used R’s text analysis capabilities to discover what sorts of things people look for in data scientists.

Again, let’s start with the script and then we’ll discuss what each part means:

install.packages(“rvest”)
install.packages(“tm”)
install.packages(“SnowballC”)
install.packages(“wordcloud”)
library(rvest)
library(tm)
library(SnowballC)
library(wordcloud)
job1 <- read_html(“http://jobview.monster.com/Data-Scientist-%E2%80%93-Mine-data-for-new-discoveries-Job-Wichita-KS-US-157806954.aspx?mescoid=1500152001001&jobPosition=1”)
job2 <- read_html(“http://jobview.monster.com/Business-Intelligence-Data-Scientist-Job-Knoxville-TN-US-156975894.aspx?mescoid=1500144001001&jobPosition=2”)
job3 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Jupiter-FL-US-156298625.aspx?mescoid=1500152001001&jobPosition=3”)
job4 <- read_html(“http://jobview.monster.com/Senior-Data-Scientist-Large-Data-Sets-SQL-SAS-Job-San-Francisco-CA-US-157581393.aspx?mescoid=1500152001001&jobPosition=5”)
job5 <- read_html(“http://jobview.monster.com/Data-Scientist-Global-Healthcare-Company-Job-Boston-MA-US-156278410.aspx?mescoid=1500152001001&jobPosition=6”)
job6 <- read_html(“http://jobview.monster.com/Watson-Health-Data-Scientist-Analytics-Explorys-Job-Cleveland-OH-US-155179521.aspx?mescoid=1500152001001&jobPosition=7”)
job7 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Irving-TX-US-157804748.aspx?mescoid=1500152001001&jobPosition=9”)
job8 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Chicago-IL-US-157786798.aspx?mescoid=1500152001001&jobPosition=10”)
job9 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Chicago-IL-US-157799266.aspx?mescoid=1500152001001&jobPosition=11”)
job10 <- read_html(“http://jobview.monster.com/Sr-Data-Scientist-Job-Tampa-Bay-FL-US-156414261.aspx?mescoid=1500152001001&jobPosition=12”)
job11 <- read_html(“http://jobview.monster.com/Senior-Data-Scientist-Pricing-Strategy-Job-Houston-TX-US-157704074.aspx?mescoid=1500152001001&jobPosition=13”)
job12 <- read_html(“http://jobview.monster.com/Python-Developer-Data-Scientist-Job-New-York-NY-US-157723650.aspx?mescoid=1500127001001&jobPosition=14”)
job13 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-New-York-City-NY-US-155850149.aspx?mescoid=1500152001001&jobPosition=15”)
job14 <- read_html(“http://jobview.monster.com/Sr-Data-Scientist-stat-Prog-Python-R-Big-Data-Algorithms-Creative-Stimulating-Opportunity!-Job-Boston-MA-US-156847308.aspx?mescoid=1500152001001&jobPosition=17”)
job15 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Boston-MA-US-157720122.aspx?mescoid=1500152001001&jobPosition=18”)
job16 <- read_html(“http://jobview.monster.com/Senior-Data-Scientist-Job-Philadelphia-PA-US-157463595.aspx?mescoid=1500152001001&jobPosition=1”)
job17 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Chicago-IL-US-157403480.aspx?mescoid=1500152001001&jobPosition=3”)
job18 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Omaha-NE-US-157357352.aspx?mescoid=1500152001001&jobPosition=4”)
job19 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Cedar-Rapids-IA-US-157347856.aspx?mescoid=1500152001001&jobPosition=5”)
job20 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Port-Washington-NY-US-157206309.aspx?mescoid=1500152001001&jobPosition=6”)
job21 <- read_html(“http://jobview.monster.com/Principal-Data-Scientist-Job-US-157206617.aspx?mescoid=1500152001001&jobPosition=7”)
job22 <- read_html(“http://jobview.monster.com/Analytic-Data-Scientist-Job-Dearborn-MI-US-157120444.aspx?mescoid=1500152001001&jobPosition=8”)
job23 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Addison-TX-US-157050874.aspx?mescoid=1500152001001&jobPosition=10”)
job24 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Cupertino-CA-US-156945952.aspx?mescoid=1500152001001&jobPosition=15”)
job25 <- read_html(“http://jobview.monster.com/Data-Scientist-Customer-Facing-Job-Boston-MA-US-156791046.aspx?mescoid=1500152001001&jobPosition=18”)
job26 <- read_html(“http://jobview.monster.com/Data-architect-Data-scientist-2801-Job-Wall-NJ-US-156737953.aspx?mescoid=1500142001001&jobPosition=20”)
job27 <- read_html(“http://jobview.monster.com/Sr-Data-Scientist-Job-New-York-City-NY-US-156671144.aspx?mescoid=1500152001001&jobPosition=21”)
job28 <- read_html(“http://jobview.monster.com/Machine-Learning-Specialist-Data-Scientist-Job-Waltham-MA-US-156300437.aspx?mescoid=4300761001001&jobPosition=3”)
job29 <- read_html(“http://jobview.monster.com/Principal-Data-Scientist-Job-US-157206617.aspx?mescoid=1500152001001&jobPosition=4”)
job30 <- read_html(“http://jobview.monster.com/Analytic-Data-Scientist-Job-Dearborn-MI-US-157120444.aspx?mescoid=1500152001001&jobPosition=6”)
job31 <- read_html(“http://jobview.monster.com/Big-Data-Scientist-Job-Addison-TX-US-157067573.aspx?mescoid=1500152001001&jobPosition=12”)
job32 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Schaumburg-IL-US-157010535.aspx?mescoid=1500152001001&jobPosition=14”)
job33 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-New-York-NY-US-156734707.aspx?mescoid=1500152001001&jobPosition=1”)
job34 <- read_html(“http://jobview.monster.com/Data-architect-Data-scientist-2801-Job-Wall-NJ-US-156737953.aspx?mescoid=1500142001001&jobPosition=2”)
job35 <- read_html(“http://jobview.monster.com/Junior-Data-Scientist-Job-Sunnyvale-CA-US-153360698.aspx?mescoid=1500152001001&jobPosition=9”)
job36 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Columbia-MD-US-156294139.aspx?mescoid=1500152001001&jobPosition=12”)
job37 <- read_html(“http://jobview.monster.com/Data-Scientists-Analysts-Job-Dearborn-MI-US-156411132.aspx?mescoid=1500152001001&jobPosition=17”)
job38 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Boston-MA-US-157401958.aspx?mescoid=1500152001001&jobPosition=15”)
job39 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Arlington-VA-US-157677328.aspx?mescoid=1500152001001&jobPosition=17”)
job40 <- read_html(“http://jobview.monster.com/Data-Scientist-Job-Englewood-CO-US-156829911.aspx?mescoid=1500152001001&jobPosition=18”)

jobs <- c(
job1_text <- tryCatch({job1 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job2_text <- tryCatch({job2 %>% html_node(“#CJT-jobbody”) %>% html_text()}, error=function(cond) {return(0)}),
job3_text <- tryCatch({job3 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job4_text <- tryCatch({job4 %>% html_node(“monsterAppliesContentHolder”) %>% html_text()}, error=function(cond) {return(0)}),
job5_text <- tryCatch({job5 %>% html_node(“#CJT_body”) %>% html_text()}, error=function(cond) {return(0)}),
job6_text <- tryCatch({job6 %>% html_node(“#innerBOX”) %>% html_text()}, error=function(cond) {return(0)}),
job7_text <- tryCatch({job7 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job8_text <- tryCatch({job8 %>% html_node(“#CJT-bodypanel”) %>% html_text()}, error=function(cond) {return(0)}),
job9_text <- tryCatch({job9 %>% html_node(“#CJT-jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job10_text <- tryCatch({job10 %>% html_node(“#CJT_jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job11_text <- tryCatch({job11 %>% html_node(“#CJT-leftpanel”) %>% html_text()}, error=function(cond) {return(0)}),
job12_text <- tryCatch({job12 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job13_text <- tryCatch({job13 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job14_text <- tryCatch({job14 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job15_text <- tryCatch({job15 %>% html_node(“#CJT_jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job16_text <- tryCatch({job16 %>% html_node(“#CJT-desc”) %>% html_text()}, error=function(cond) {return(0)}),
job17_text <- tryCatch({job17 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job18_text <- tryCatch({job18 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job19_text <- tryCatch({job19 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job20_text <- tryCatch({job20 %>% html_node(“#CJT-jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job21_text <- tryCatch({job21 %>% html_node(“#jobdesc”) %>% html_text()}, error=function(cond) {return(0)}),
job22_text <- tryCatch({job22 %>% html_node(“#jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job23_text <- tryCatch({job23 %>% html_node(“#CJT-jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job24_text <- tryCatch({job24 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job25_text <- tryCatch({job25 %>% html_node(“#jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job26_text <- tryCatch({job26 %>% html_node(“#jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job27_text <- tryCatch({job27 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job28_text <- tryCatch({job28 %>% html_node(“#CJT_jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job29_text <- tryCatch({job29 %>% html_node(“#jobdesc”) %>% html_text()}, error=function(cond) {return(0)}),
job30_text <- tryCatch({job30 %>% html_node(“#jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job31_text <- tryCatch({job31 %>% html_node(“#TrackingJobBody”) %>% html_text()}, error=function(cond) {return(0)}),
job32_text <- tryCatch({job32 %>% html_node(“#CJT-jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job33_text <- tryCatch({job33 %>% html_node(“#jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job34_text <- tryCatch({job34 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job35_text <- tryCatch({job35 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job36_text <- tryCatch({job36 %>% html_node(“#jobcopy”) %>% html_text()}, error=function(cond) {return(0)}),
job37_text <- tryCatch({job37 %>% html_node(“#bodycol”) %>% html_text()}, error=function(cond) {return(0)}),
job38_text <- tryCatch({job38 %>% html_node(“#CJT_jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job39_text <- tryCatch({job39 %>% html_node(“#CJT_jobBodyContent”) %>% html_text()}, error=function(cond) {return(0)}),
job40_text <- tryCatch({job40 %>% html_node(“#cjt-rightpanel”) %>% html_text()}, error=function(cond) {return(0)})
)
jobsCorpus <- Corpus(VectorSource(jobs))
jobsCorpus <- tm_map(jobsCorpus,PlainTextDocument)
jobsCorpus <- tm_map(jobsCorpus, removePunctuation)
jobsCorpus <- tm_map(jobsCorpus, removeWords, c(‘the’, ‘this’, ‘will’,’data’,’new’,’science’,’job’,’years’,’working’,’position’,’using’,stopwords(‘english’)))
wordcloud(jobsCorpus, max.words=50,random.order = FALSE)

dtm <- DocumentTermMatrix(jobsCorpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
topkeywords <- frequency[1:10]
topkeywords

For this example, I used R’s web scraping functionality to pull data from 40 different job postings for data scientists and data analysts on Monster.com. (Note: since these are temporary web pages by their very nature, they likely won’t be active by the time you read this–just in case you were trying to run the script yourself).

So, I load the packages and libraries that I’ll be using–the previously mentioned package for web scraping in addition to packages that will be used to build the word cloud and explore the text.

Next, I scrape the data. For each job posting (vectors naked job1, job2, etc), I scrape the data from the posting’s webpage. Then, I use the html_node() function to pull the particular data I want from the page (job1_text, job2_text, etc) and put it into a single column of data that I give the name “jobs.”

Next, I create the vector (“JobsCorpus”) on which the word cloud will be based. To do this, I use the Corpus() function to convert the data into a usable format and then the tm_map() function to make it plain text, remove the punctuation, and then remove the words that I don’t want to include in my word cloud.

Finally, I apply the wordcloud() function to the “JobsCorpus” vector I created, noting how many words I want in my wordcloud. Then, like magic, I get the wordcloud shown below.

Data Science Job Descriptions Word Cloud in R

If you want to know what the most frequent words are from your text, you use the rest of the above script. When you put the remainder of the script into the R console, you get the following output.

Word Frequency in R

To change the number of words you see, you can change what’s in the “frequency[1:10]” section. Right now, it’s showing the top 10. 

If you’re interested in a more thorough explanation of building a word cloud and analyzing text in R, check out this post. That’s pretty much where I learned how to do this. Pretty cool, huh?

3) How to Run a Regression Using R

Multivariate regression analysis is the single most useful tool in all of statistics. When you want to find out what causes something to occur, you estimate it by gathering data on variables that you think may be factors and comparing that data to concurrent data from the variable you are measuring. For example, about a year ago, I gathered daily data on my weight fluctuations and compared it to daily consumption of carbs, fat, sugar, etc in order to estimate what kind of diet caused me to gain and lose weight. This type of exercise is called running a regression.

I’m not going to go into how regression works or how to interpret the results. I’m going to assume that either you’re already familiar with regression or that you’re going to drop what you’re doing right now to go learn it. Because, yeah, it’s that awesome.

So, let’s run a regression using R. Suppose I’m going through my Facebook feed and I want to find out what kinds of posts I am most likely to interact with. What causes me to like, share, or comment on a post? For simplicity’s sake, I considered only 4 different kinds of posts: educational, inspirational, humorous, and political. With this varisbles, I browsed my feed and categorized 200 posts in these categories where applicable–also tracking how I engaged with each post.

As has been our custom thus far, I’ll share my script and then discuss…

#####

##### Vector Creation #####

#####
like <- c(1,0,1,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,0,1,0,1,1,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0)
comment <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
share <- c(0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
humorous <- c(0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0)
political <- c(0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
educational <- c(0,0,0,0,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,0,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
inspirational <- c(1,1,1,0,1,0,0,1,0,0,1,0,1,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,0,1,0,1,1,0,0,1,0,0)
#####

##### Model Construction ######

#####
facebook <- data.frame(like,comment,share,humorous,political,educational,inspirational)
likemodel <- lm(facebook$like ~ facebook$humorous + facebook$political + facebook$educational + facebook$inspirational,data=facebook)

summary(likemodel)
commentmodel <- lm(facebook$comment ~ facebook$humorous + facebook$political + facebook$educational + facebook$inspirational,data=facebook)

summary(commentmodel)
sharemodel <- lm(facebook$share ~ facebook$humorous + facebook$political + facebook$educational + facebook$inspirational,data=facebook)

summary(sharemodel)

Okay, so that’s a big one. But, remember, the elements inside of each c() function actual make up a single line of code. Again, in each of these creates a column in which each comma separated element (the “0”s and “1”s in this case) makes up a row. So, in this script, I placed a “1” in each row for which the column was applicable and a “0” in each row for which it wasn’t. Each row represents a Facebook post. So, the first post I saw was an inspirational post that I “liked” but did not “comment” on or “share.”

After I’ve created all of my columns, I use that data.frame() function to put them into a table of data that I can manipulate. I call my data frame simply “facebook.”

(Side Note: if you’re wondering what all of those “#” symbols are, no, I’m not trying to get this post trending on Twitter. As in many other programming languages, starting a line of code with the “#” symbol let’s R know that you don’t want it to operate on any of the content on that line. You use these lines to make notes to yourself about the code–and that’s what I’ve done here).

Now that I’ve made gathered my data, I can run a regression. The function for running a regression in R is lm(y ~ x1 + x2 + x3….., data), where the “y” is the column of data you are interested in (like, share, or comment), the “x”s are the factors you’re measuring against it, and “data” is the data frame you’re drawing from. The naming convention in pulling a column from a data frame is “dataframename$columnname,” so that the “$” separates the data frame and the column within the data frame. And that gives us our regression equation.

When you assign this function a name, it becomes a model. Then, you use the summary() function, with the name of the model inside of it to see the regression results. For this example, I created 3 models: one for likes, one for comments, and one for shares.

LIKE MODEL

 

COMMENT MODEL

 

SHARE MODEL

Based on these results, I can conclude the following:

  • I “like” posts more often when they are either humorous or educational.
  • I “comment” on posts more often when they are educational.
  • I “share” posts more often when they are humorous, educational l, or inspirational.
  • I don’t seem to care much for politics.

One interesting thing to note about running regressions in R is that, while Excel’s regression application is limited to 16 variables, R has no limit on variables. So, as many things as you can think of and as much data as you can gather, R can help you process it.

4) How to Plot Data and Make Graphs Using R

One of things for which R is most admired is its data visualization capabilities. To the human eye, patterns in data are sometimes difficult to see. Plotting the data helps us see what it’s doing where entering it into a spreadsheet cannot.

Because there are so many kinds of graphs and charts you can do in R and also because I’m tired of writing, I’m only going to address the basic plot() function and show two visualizations: a scatter plot and a line graph.

For my example, I’m going to be comparing the two previous years’ weekly stock prices of Facebook with those of Twitter. I downloaded this data from Yahoo Finance into a CSV file. So, let’s take a look at the script and then discuss what it does…

stocks <- data.frame(read.csv(“stocks.csv”))

plot(stocks$fb,stocks$twtr)

abline(lsfit(stocks$fb,stocks$twtr))

x11()

plot(c(1,103),c(0,100),type=”n”,xlab=”Last Two Years by Week”,ylab=”Stock Price”,main=”Facebook and Twitter Stock Prices”)

lines(seq(1,103),stocks$fb,type=”l”,col=”blue”,lwd=2.5,lty=1)

lines(seq(1,103),stocks$twtr,type=”l”,col=”green”,lwd=2.5,lty=1)

legend(5,20, c(“Facebook”,”Twitter”), lty=c(1,1), lwd=c(2.5,2.5),col=c(“blue”,”green”),ncol=2,bty=”n”)

For this script, I neither pulled my data from the web (like I did for the wordcloud) nor compiled my data in the script itself (like I did for the regression. Instead, I imported my data from a CSV file. If you’re going to do data analysis in R, this is one of the more important things you’ll need to know. Just use the function in the first line, replacing “stocks.csv” with whatever your CSV is called (and, of course, giving the vector a name that makes sense for your data). You’ll also need to make sure your file is saved in the default folder that R draws from on your computer.

To see what’s in your CSV file, you can now simply enter  “stocks” (no quotes) into the console and you’ll see your columns of data. In this case, I have the last 103 weeks of stock prices from “fb” and “twtr,” or Facebook and Twitter.

There are all kinds of things you can do with this data, but let’s talk about some basics. To see visually how Facebook’s stock price correlates with Twitter’s stock price, you can plot them against one another with the plot() function. That gives you the below scatter plot. You get the line from adding the abline() function after the plot has been created.


From this visual, you can see that Facebook’s stock price is negatively correlated with Twitter’s stock price. This relationship perhaps becomes even clearer when you do another plot.

First, I want to note that the x11() function prevents the new plot you create from replacing the old one. That way, you can see both plots side-by-side in your R console.

Okay, so now we’re going to do a longitudinal graph showing how both stock prices move together over time.

The plot() function in this case creates the graph on which your stock prices will be plotted. Then, you use the lines() function to create a line for Facebook and a line for Twitter. Finally, you can create a legend with the legends() function. After all of this, you get the graph shown below.


With this visual, we can see that–particularly on the last 20 or so weeks–Twitter’s stock price has been falling while Facebook’s has been rising.

You can find out more about how to use the plot function in R by following this link. But, if you just Google “plot in R,” you should be able to find all sorts of helpful tips. In fact, I’ve learned most of what I know about R so far simply by googling questions that I have. There are a lot of forums and blogs that answer any question you can possibly think of. As with anything else, the best way to learn is by following your nose where it leads and inhaling everything you can along the way.

So, that’s all I’ve got on R programming for right now. Hopefully, as time passes, I’ll be able to share more of what I’ve learned and explain it in more concrete ways. This application is a really powerful tool for doing all sorts of stuff–I hope you’ll find it as useful as I have.

Stay curious,

Doug


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles





Latest Images