Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 33 min ago

R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan

Mon, 10/30/2017 - 17:31

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

 

Data Visualization and Dashboard with R is our fourth course of the autumn term. It takes place in November 7-8 in a location close to Milano Lima.
This course will teach you how to build beautiful, effective and flexible plots using the most modern R tools for data visualization. Then you will discover how to embed visualizations and tables in a powerful Shinyapp, to make your data easily navigable and let their insights emerge.
You should take this course if you have some knowledge of R and would like to deepen your knowledge in data visualization with R, both static data visualization and dashboards.

Data Visualization and Dashboard with R: Outlines

– ggplot2 grammar
– Creating plots with ggplot (Scatter Plot, Line Plot, Bar Plot, Histogram, Box Plot, Surface Plot)
– Customizing Plots (aesthetics, legend, axes, faceting and themes)
– Specialised visualisation tools: ggmap and ggally
– Basic shiny concepts
– The structure of a shiny app
– Shiny: the server side and the user side
– Understanding reactivity in shiny
– An overview of html widgets

Data Visualization and Dashboard with R is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • November 21-22R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
  • November 29-30Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

heatmaply: an R package for creating interactive cluster heatmaps for online publishing

Mon, 10/30/2017 - 15:07

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

This post on the heatmaply package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full. My co-authors for this paper are Jonathan Sidi, Alan O’Callaghan, and Carson Sievert.

Summary: heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels.  Thanks to the synergistic relationship between heatmaply and other R packages, the user is empowered by a refined control over the statistical and visual aspects of the heatmap layout.

Availability: The heatmaply package is available under the GPL-2 Open Source license. It comes with a detailed vignette, and is freely available from: http://cran.r-project.org/package=heatmaply

Contact: Tal.Galili@math.tau.ac.il

Introduction

A cluster heatmap is a popular graphical method for visualizing high dimensional data. In it, a table of numbers is scaled and encoded as a tiled matrix of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms and extra columns of categorical annotation. The ongoing development of this iconic visualization, spanning over more than a century, has provided the foundation for one of the most widely used of all bioinformatics displays (Wilkinson and Friendly, 2009). When using the R language for statistical computing (R Core Team, 2016), there are many available packages for producing static heatmaps, such as: stats, gplots, heatmap3, fheatmap, pheatmap, and others. Recently released packages also allow for more complex layouts; these include gapmap, superheat, and ComplexHeatmap (Gu et al., 2016). The next evolutionary step has been to create interactive cluster heatmaps, and several solutions are already available. However, these solutions, such as the idendro R package (Sieger et al., 2017), are often focused on providing an interactive output that can be explored only on the researcher’s personal computer. Some solutions do exist for creating shareable interactive heatmaps. However, these are either dependent on a specific online provider, such as XCMS Online, or require JavaScript knowledge to operate, such as InCHlib. In practice, when publishing in academic journals, the reader is left with a static figure only (often in a png or pdf format).

To fill this gap, we have developed the heatmaply R package for easily creating a shareable HTML file that contains an interactive cluster heatmap. The interactivity is based on a client-side JavaScript code that is generated based on the user’s data, after running the following command:

install.packages("heatmaply") library(heatmaply) heatmaply(data, file = "my_heatmap.html")

The HTML file contains a publication-ready, interactive figure that allows the user to zoom in as well as see values when hovering over the cells. This self-contained HTML file can be made available to interested readers by uploading it to the researcher’s homepage or as a supplementary material in the journal’s server. Concurrently, this interactive figure can be displayed in RStudio’s viewer pane, included in a Shiny application, or embedded in a knitr/RMarkdown HTML documents.

The rest of this paper offers guidelines for creating effective cluster heatmap visualization. Figure 1 demonstrates the suggestions from this section on data from project Tycho (van Panhuis et al., 2013), while the online supplementary information includes the interactive version, as well as several examples of using the package on real-world biological data.

Fig. 1. The (square root) number of people infected by Measles in 50 states, from 1928 to 2003. Vaccines were introduced in 1963

click the image for the online interactive version of the plot

An interactive version of the measles heatmap (embedded in the post using iframe)

I uploaded the measles_heatmaply.html to github and then used the following code to embed it in the post:

Here is the result:

heatmaply – a simple example

The generation of cluster heatmaps is a subtle process (Gehlenborg and Wong, 2012; Weinstein, 2008), requiring the user to make many decisions along the way. The major decisions to be made deal with the data matrix and the dendrogram. The raw data often need to be transformed in order to have a meaningful and comparable scale, while an appropriate color palette should be picked. The clustering of the data requires us to decide on a distance measure between the observation, a linkage function, as well as a rotation and coloring of branches that manage to highlight interpretable clusters. Each such decision can have consequences on the patterns and interpretations that emerge. In this section, we go through some of the arguments in the function heatmaply, aiming to make it easy for the user to tune these important statistical and visual parameters. Our toy example visualizes the effect of vaccines on measles infection. The output is given in the static Fig. 1, while an interactive version is available online in the supplementary file “measles.html”. Both were created using:

heatmaply(x = sqrt(measles), color = viridis, # the default Colv = NULL, hclust_method = "average", k_row = NA, # ... file = c("measles.html", "measles.png") )

The first argument of the function (x) accepts a matrix of the data. In the measles data, each row corresponds with a state, each column with a year (from 1928 to 2003), and each cell with the number of people infected with measles per 100,000 people. In this example, the data were scaled twice – first by not giving the raw number of cases with measles, but scaling them relatively to 100,000 people, thus making it possible to more easily compare between states. And second by taking the square root of the values. This was done since all the values in the data represent the same unit of measure, but come from a right-tailed distribution of count data with some extreme observations. Taking the square root helps with bringing extreme observations closer to one another, helping to avoid an extreme observation from masking the general pattern. Other transformations that may be considered come from Box-Cox or Yeo-Johnson family of power transformations. If each column of the data were to represent a different unit of measure, then leaving the values unchanged will often result in the entire figure being un-usable due to the column with the largest range of values taking over most of the colors in the figure. Possible per-column transformations include the scale function, suitable for data that are relatively normal. normalize, and percentize functions bring data to the comparable 0 to 1 scale for each column. The normalize function preserves the shape of each column’s distribution by subtracting the minimum and dividing by the maximum of all observations for each column. The percentize function is similar to ranking but with the simpler interpretation of each value being replaced by the percent of observations that have that value or below. It uses the empirical cumulative distribution function of each variable on its own values. The sparseness of the dataset can be explored using is.na10.

Once the data are adequately scaled, it is important to choose a good color palette for the data. Other than being pretty, an ideal color palette should have three (somewhat conflicting) properties: (1) Colorful, spanning as wide a palette as possible so as to make differences easy to see; (2) Perceptually uniform, so that values close to each other have similar-appearing colors compared with values that are far away, consistently across the range of values; and (3) Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as printing well in grey scale. The default passed to the color argument in heatmaply is viridis, which offers a sequential color palette, offering a good balance of these properties. Divergent color scale should be preferred when visualizing a correlation matrix, as it is important to make the low and high ends of the range visually distinct. A helpful divergent palette available in the package is cool_warm (other alternatives in the package include RdBu, BrBG, or RdYlBu, based on the RColorBrewer package). It is also advisable to set the limits argument to range from -1 to 1.

Passing NULL to the Colv argument, in our example, removed the column dendrogram (since we wish to keep the order of the columns, relating to the years). The row dendrogram is automatically calculated using hclust with a Euclidean distance measure and the average linkage function. The user can choose to use an alternative clustering function (hclustfun), distance measure (dist_method), or linkage function (hclust_method), or to have a dendrogram only in the rows/columns or none at all (through the dendrogram argument). Also, the users can supply their own dendrogram objects into the Rowv (or Colv) arguments. The preparation of the dendrograms can be made easier using the dendextend R package (Galili, 2015) for comparing and adjusting dendrograms. These choices are all left for the user to decide. Setting the k_col/k_row argument to NA makes the function search for the number of clusters (between from 2 to 10) by which to color the branches of the dendrogram. The number picked is the one that yields the highest average silhouette coefficient (based on the find_k function from dendextend). Lastly, the heatmaply function uses the seriation package to find an “optimal” ordering of rows and columns (Hahsler et al., 2008). This is controlled using the seriation argument where the default is “OLO” (optimal-leaf-order) – which rotates the branches so that the sum of distances between each adjacent leaf (label) will be minimized (i.e.: optimize the Hamiltonian path length that is restricted by the dendrogram structure). The other arguments in the example were omitted since they are self-explanatory – the exact code is available in the supplementary material.

In order to make some of the above easier, we created the shinyHeatmaply package (available on CRAN) which offers a GUI to help guide the researcher with the heatmap construction, with the functionality to export the heatmap as an html file and summaries parameter specifications to reproduce the heatmap with heatmaply. For more detailed step-by-step demonstration of using heatmaply on biological datasets, you should explore the heatmaplyExamples package (at github.com/talgalili/heatmaplyExamples).

The following biological examples are available and fully reproducible from within the package. You may also view them online in the following links (the html files also include the R code for producing the figures):

Acknowledgements

The heatmaply package was made possible by leveraging many wonderful R packages, including ggplot2 (Wickham, 2009), plotly (Sievert et al., 2016), dendextend (Galili, 2015) and many others. We would also like to thank Yoav Benjamini, Madeline Bauer, and Marilyn Friedes for their helpful comments, as well as Joe Cheng for initiating the collaboration with Tal Galili on d3heatmap, which helped lay the foundation for heatmaply.

Funding: This work was supported in part by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).  

Conflict of Interest: none declared.

References
  • Galili,T. (2015) dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31, 3718–3720.
  • Gehlenborg,N. and Wong,B. (2012) Points of view: Heat maps. Nat. Methods, 9, 213–213.
  • Gu,Z. et al. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32, 2847–2849.
  • Hahsler,M. et al. (2008) Getting Things in Order : An Introduction to the R Package seriation. J. Stat. Softw., 25, 1–27.
  • van Panhuis,W.G. et al. (2013) Contagious Diseases in the United States from 1888 to the Present. N. Engl. J. Med., 369, 2152–2158.
  • R Core Team,(R Foundation for Statistical Computing) (2016) R: A Language and Environment for Statistical Computing.
  • Sieger,T. et al. (2017) Interactive Dendrograms: The R Packages idendro and idendr0. J. Stat. Softw., 76.
  • Sievert,C. et al. (2016) plotly: Create Interactive Web Graphics via ‘plotly.js’.
  • Weinstein,J.N. (2008) BIOCHEMISTRY: A Postgenomic Visual Icon. Science (80-. )., 319, 1772–1773.
  • Wickham,H. (2009) ggplot2 Elegant Graphics for Data Analysis.
  • Wilkinson,L. and Friendly,M. (2009) The History of the Cluster Heat Map. Am. Stat., 63, 179–184.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates

Mon, 10/30/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Shiny is an amazing R package which lets the R developers and users build amazing web apps using R itself. It lets the R users analyze, visualize and deploy their machine learning models directly in the form of the web app. This package lets you host standalone apps on a webpage or embed them in R markdown documents or build dashboards and various forecasting applications. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. Shiny lets us write client-side front-end code in R itself and also lets users write server-side script in R itself. More details on this package can be found here.

I recently learned Shiny and started developing a web application using it.And since then I have been in love with it and have been using it in each and every data science and analytics project. The syntax is super easy to understand and there are lots of amazing articles and documentation available for you to learn it and use it. I personally had a background of developing full-stack web applications using HTML, CSS and javascript and other JS based scripting languages so I found the syntax easy.

This article is meant for people who at least have a basic understanding of how to use shiny in R or who either have knowledge of developing web apps. But still, it is easy for an intermediate level or beginner R user to learn shiny and its syntax to develop web apps.

In this article I am going to demonstrate how to make a dashboard using shinydashboard package. Shiny Dashboard is also another package similar to shiny which is specifically used to make dashboards easily in R and deploy them. More details on this package can be found here.

Secondly, in this article I am going to demonstrate how to use highcharter package which is a Javascript based visualization library in R to develop amazing and beautiful plots. Its syntax is somewhat similar to qplot function from the ggplot2 package in R. So if you have good experience of using ggplot2 package then you would find it easier to use. Details about the highcharter can be found here.

Analyzing Inflation Rates

In this article I am going to demonstrate how to use shinydashboard and highcharter package in R to analyze and visualize Inflation rates of major economies and other economic and regional trade unions.

What are inflation rates

Inflation rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money. In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation.

Deflation is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy’s growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.

Now I won’t be explaining much about these economic terms. I leave these things for the readers to go check and read these things out. I am sure you will find such subjects quite interesting.

Lets start with designing UI in R

In shiny you have the choice to write UI and server-side code in a single file. But I prefer to write the client-side and server-side code in separate files for easy understanding and modularity and stop the things to get messed up if the code gets too long.

#loading the packages library(shinydashboard) require(shiny) require(highcharter) #layout of the dashboard #defining character vectors for select inputs country<-c("India","United States","Mexico","Canada","China, People's Republic of","Japan","Russian Federation","Germany","United Kingdom","European Union", "ASEAN-5","New Zealand","Australia","Netherlands","Luxembourg", "France","Qatar","United Arab Emirates","Saudi Arabia") unions<-c("Major advanced economies (G7)","European Union","Emerging and Developing Europe","ASEAN-5","Commonwealth of Independent States", "Emerging and Developing Asia","Latin America and the Caribbean", "Middle East, North Africa, Afghanistan, and Pakistan") #function used to define the dashboard dashboardPage( #defines header skin = "red", #header of the dashboard dashboardHeader( title="Inflation Rates" , dropdownMenu() ), #defines sidebar of the dashboard dashboardSidebar( sidebarMenu( #the sidebar menu items menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("About", tabName = "about", icon = icon("th")), menuItem("Trade Unions",tabName="unions",icon=icon("signal")), menuItem("World",tabName="world",icon=icon("globe")) )), #defines the body of the dashboard dashboardBody( #to add external CSS tags$head( tags$link(rel = "stylesheet", type = "text/css", href = "custom.css") ), tabItems( #First TAB Menu-dashboard- first argument should be the 'tabName' value of the menuItem function tabItem(tabName = "dashboard", fluidRow( column(12, #box() is similar to a 'div' element in HTML box( selectInput("country",label="Select Country",choices=country), width = 12)# end box ),#end column #box for plotting the time series plot column(12, box( #below function is used to define a highcharter output plot which will be made in the server side highchartOutput("hcontainer"), width="12") #end box2 ), #end column br(), #line break h4("Relative inflation rates time series plot",align="center"), br(), column(12, box(highchartOutput("hc2"),width=12)) ),#end row h4("Made with love from", strong("Anish Singh Walia")), a("R code for this project",target="_blank",href="https://github.com/anishsingh20/Analzying-Inflation-Rates-Worldwide") ), #second tab menu- ABOUT tabItem(tabName="about", h2("What is Inflation ?",style="text-align:center"), br(), br(), box(width=12,height="400px", p(style="font-size:20px",strong("Inflation"),"rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money.In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation."), p(style="font-size:20px",strong("Deflation"), "is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy's growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.") ) ), tabItem(tabName = "unions", h3("Time series of Inflation rates of Economic trade unions",align="center") , fluidRow( column(12, box(selectInput("region",label="Select Economic Region",choices=unions),width = 12) ), box(highchartOutput("hc3"), width=12) )# end row ), tabItem(tabName = "world", h3("World's Inflation Rates",align="center") , box(highchartOutput("hc4"), width=12) ) )#end tabitems )#end body )#end dashboard

Now, in the above code one can notice that the various functions which are used to design the UI are actually similar to the HTML elements. But in R shiny allows us to use those like functions and the HTML attributes as the function arguments. It’s hard to explain everything in the above code, but I insist the readers to check the documentation and help files of the above functions and go and check out the links to various resources which I will add at the end.

The highchartOutput(id) function which takes the ‘id’ as an argument which is used to define a high charter plot which will be developed and plotted at the server side. Here we only design and define the UI and layout of the dashboard.

Let’s write the Server logic

All the server-side code would be in a separate file named server.R. In the server side, it would only contain the logic to reactively and dynamically plot the various plots.

The inflation rates dataset is downloaded from IMF(International Monetary Fund) website and is publically available. I first had to do some preprocessing and transformations using the tidyr and dplyr packages to convert the dataset to the desired format.

require(shinydashboard) require(ggplot2) require(dplyr) require(highcharter) #to plot amazing time series plots library(readxl) require(tidyr) #loading the dataset inflation <- read_excel("inflation.xls") year<-c(1980:2022) #making a vector consisting of all years year<-as.character(year)#converting to character type to use in gather() #converting dataset to a long format inf% gather(year,key = "Year",value="InflationRate") inf<-na.omit(inf) #omitting NA values names(inf)<-c("region","year","inflation") inf$year<-as.integer(inf$year) India<-filter(inf,region=="India") India$inflation<-as.numeric(India$inflation) India$year<-as.numeric(India$year) China<-filter(inf,region=="China, People's Republic of") Ger<-filter(inf,region=="Germany") Japan<-filter(inf,region=="Japan") US<-filter(inf,region=="United States") EU<-filter(inf,region=="European Union") UK<-filter(inf,region=="United Kingdom") Fr<-filter(inf,region=="France") uae<-filter(inf,region=="United Arab Emirates") #server function and logic server <- function(input, output) { output$hcontainer <- renderHighchart ({ #write all R-code inside this df% filter(region==input$country)#making is the dataframe of the country #above input$country is used to extract the select input value from the UI and then make #a dataframe based on the selected input df$inflation<-as.numeric(df$inflation) df$year% hc_exporting(enabled = TRUE) %>% #to enable downloading the plot image hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% #designing the hoverable tooltil hc_title(text="Time series plot of Inflation Rates",align="center") %>% #title hc_subtitle(text="Data Source: IMF",align="center") %>% #subtile of the plot hc_add_theme(hc_theme_elementary()) #adding theme #to add 3-d effects #hc_chart(type = "column", #options3d = list(enabled = TRUE, beta = 15, alpha = 15)) }) # end hcontainer output$hc2% hc_xAxis(categories=inf$year) %>% hc_add_series(name = "India", data = India$inflation) %>% hc_add_series(name = "USA", data = US$inflation) %>% hc_add_series(name = "UK", data = UK$inflation) %>% hc_add_series(name = "China", data = China$inflation) %>% hc_add_series(name = "Germany", data = Ger$inflation) %>% hc_add_series(name="Japan",data=Japan$inflation) %>% #to add colors hc_colors(c("red","blue","green","purple","darkpink","orange")) %>% hc_add_theme(hc_theme_elementary()) }) # end hc2 output$hc3<-renderHighchart({ union% filter(region==input$region) union$year<-as.numeric(union$year) union$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for Economic Unions",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc3 output$hc4<-renderHighchart({ world% filter(region=="World") world$year<-as.numeric(world$year) world$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for World",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc4 }

In the above code we access the plot defined in the UI using their id in the form output$id. We use the function renderHighchart({ #write all R code and logic }) to render the plots and write all the R-code to develop any output inside it.v Inside this function as you all can notice is simply the R code to make plots.

Now the function hchart(df,type,hcaes(x,y)...) works in the same fashion as qplot. It takes data, type of plot, plot asthetics as its arguments. Rest things such as title, subtitle etc are piped to it like we piped extra things and plotting characteristics in ggplot. For more details read this http://jkunst.com/highcharter/hchart.html

Screenshots of the Dashboard:

Now deploying a shiny app is just a matter of minutes. You just need to have an account on shinyapps.io and it offers infrastructure as a service(IaaS) to run and manage you application. You can deploy you app directly from your Rstudio IDE.

Conclusion

This was just a short tutorial on how to make a dashboard in R and use shinydashboard and highcharter package in R. This article is not aimed to explain everything in detail,rather just demonstrate the use of shinydashboard and highcharter in R to build awesome dashboards and web apps.I will attach various resources below to help out with learning to develop web apps in R and which helped me out too.

Resources

Hope this article is motivating enough to at least get you started with learning and developing your own web apps in R and start using highcharter in R to visualize data. For any doubt or queries feel free to drop a comment. I will be happy to help. Do like and share the article.
Happy coding!

    Related Post

    1. Time Series Analysis in R Part 2: Time Series Transformations
    2. Time Series Analysis in R Part 1: The Time Series Object
    3. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    4. Using MongoDB with R
    5. Finding Optimal Number of Clusters
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Increasing Airline Customer Satisfaction

    Mon, 10/30/2017 - 02:38

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Motivation

    The sheer size of the airline industry provides a reason to care about it: it affects not only millions of people directly (flyers, pilots, engineers, etcetera), but also millions more indirectly by the heft of its economic presence. In a December 2016 report, the International Air Transport Association (IATA) wrote:

    “While airline industry profits are expected to have reached a cyclical peak in 2016 of $35.6 billion, a soft landing in profitable territory is expected in 2017 with a net profit of $29.8 billion. 2017 is expected to be the eighth year in a row of aggregate airline profitability, illustrating the resilience to shocks that have been built into the industry structure. On average, airlines will retain $7.54 for every passenger carried.” (reference)

    As a resident of the US, the daughter of an airline pilot, and a semi-frequent flyer, I have a personal interest in the US airline industry. In looking at carriers by region in the same article mentioned above, the IATA concluded: “The strongest financial performance is being delivered by airlines in North America. Net post-tax profits will be the highest at $18.1 billion next year […]. The net margin for the region’s carriers is also expected to be the strongest at 8.5% with an average profit of $19.58/passenger.”

    Although the North American airline industry is strong, it must be ever-vigilant about keeping up with customer demands in order to maintain its continued growth and its continued position as industry leader across regions. Of course, success in this regard requires airlines to know what customers care about in the first place. Discovering what airline customers like and dislike about their flight experiences was the starting point for this project. To view the project in an interactive Shiny App, click here.

    Data 

    To understand more precisely which aspects of a flight shape customers’ opinions, I decided to scrape the website Skytrax, which collects customer-written reviews of flights for nearly every airline in operation. A typical review appears as follows:

    To get the data from posts of this kind into an analyzable format, I wrote a python script using Selenium to define a web scraping spider, the code for which can be found here on my Github. Of the hundreds of airlines available for reviewing on Skytrax, I limited the scope of my investigation to the eleven largest US-based companies: Alaska Airlines, Allegiant Air, American Airlines, Delta Air Lines, Frontier Airlines, Hawaiian Airlines, JetBlue Airways, Southwest Airlines, Spirit Airlines, United Airlines, and Virgin America. The variables I included in the scrape were:

    • airline: Airline with which the review-writer flew
    • overall: Overall airline rating (out of 10) given by the review-writer
    • author: Name of the review-writer
    • date: Date the review was written
    • customer_review: Text of the customer review
    • aircraft: Aircraft class/type on which the review-writer flew (possibilities too numerous to list; example: Boeing 737)
    • traveller_type: Type of traveller of the review-writer (Business, Couple Leisure, Family Leisure, Solo Leisure)
    • cabin: Type of cabin/class in which the review-writer flew (Business Class, Economy Class, First Class, Premium Economy)
    • route: Origin and destination of the flight (example: Chicago to Boston)
    • date_flown: Month and year in which the flight in question took place
    • seat_comfort: Rating (out of 5) of seat comfort
    • cabin_service: Rating (out of 5) of the inflight service
    • food_bev: Rating (out of 5) of the quality of the inflight food and beverages
    • entertainment: Rating (out of 5) of the availability and connectivity of the inflight entertainment; includes movies and wifi connectivity
    • ground_service: Rating (out of 5) of the service on the ground before and/or after the flight
    • value_for_money: Rating (out of 5) of the value of the airline against the cost of a ticket
    • recommended: Does the review-writer plan to recommend the airline to others (Yes or No)

    One detail of the data worth mentioning is that the variables coming from the “table” in the review (aircraft, type of traveller, etcetera) were not required fields for the review writer. This meant that different reviews could contain different subsets of variables from that table. This affected both the engineering and analysis aspects of this project. On the engineering side, the changing tables from review to review forced me to use Selenium rather than the much faster python package Scrapy. On the analysis side, reviews that did not contain all possible table fields would have missing values (NAs) for those omitted variables, and missingness was leveraged to help answer the more focused questions that follow.

     

    Question 1

    Out of seat comfort, cabin service, food and beverages, entertainment, and ground service, which aspect of a flight has the most influence on a customers’ overall rating?

    This is a classic machine learning question that is easy to ask but difficult to answer, the difficulty lying in the potentially subtle interactions among the predictor variables. To attack this question in my project, I turned to random forests; this choice was motivated by the need to avoid machine learning techniques (such as linear regression) that rely on normality assumptions for the predictor variables. Given the biased nature of reviews, nearly all of the predictor variables in question here (seat comfort, cabin service, food and beverages, entertainment, ground service) were not normally distributed but rather bimodal, with peaks near ratings of 1 and 5.

    I used the randomForest() function from the R package “randomForest”,  which uses the non-parametric Breiman random forest algorithm to produce regression models. As a side perk, it estimates the importance of each predictor variable, relative to the others in question, in predicting the response variable. This output is what I used to determine which of my five variables was most important to the overall flight rating. The details of the Breiman random forest regression algorithm as well as the details for the algorithms used to determine the prediction importance can be found here. Below is a visual of the variable importance output from the randomForest() function when run on my data:

    One thing to note is that the above computation was done on a version of my dataset in which I had imputed “3” ratings into all the missing ratings. This was done with the idea that if someone had omitted a variable from their review, they likely felt ambivalent about it and would give a middling rating of 3/5 if pressed. However, this was an assumption that could alter my results so I did two things:

    1. Ran the same computation on three different versions of my data: one with “3” imputed for NAs; one with the variable mean imputed for NAs in that variable; one with only reviews that were entirely complete. The above result, as mentioned, is from the first version of imputed data. While the importance ratings (the numbers themselves) changed from version to version, the order of variable importance was constant throughout. This allows me to conclude that according to the Breiman algorithm, ground service was the most important variable in predicting a customer’s overall rating for a flight, followed by seat comfort, cabin service, food and beverage, and entertainment (in that order).
    2. Analyzed the proportion of reviews that included each variable. Imputing missing values inevitably skews results, but eliminating them costs us potentially valuable information. In this case, I believed missingness in various fields tended to fall into the category of “Missing Not at Random,” meaning that the cause for missingness was actually related to the value of the variable in question; in particular, I believed fields with high amounts of missingness were likely less important to customers than fields with very little missingness. To analyze this, I graphed the proportion of reviews that included each variable:

    From this we see that cabin service and seat comfort are fields that are included in nearly every review, while ground service is only included in about 55% of the reviews.

    Insight:  Cabin service and seat comfort are filled in by nearly every customer who writes a review, so these aspects of a flight are important to most customers. Since these variables rank as second and third most important in predicting overall flight score according to the Breiman random forest algorithm, we may conclude that these two fields have the most influence on a customer’s overall rating of a flight. Furthermore, while ground service appears to be the field most often omitted from reviews (and therefore possibly the least important to customers in general), overall flight rating is highly dependent on ground service for those customers who do include it. In other words, most people don’t care too much about ground service, but those who do care a lot. 

     

    Question 2

    How are US airlines performing across different aspects of customer flight experience?

    Given our results in Question 1, an airline may now want to compare itself to other airlines and to the industry as a whole across the variables of cabin service, entertainment, food and beverage, ground service, and seat comfort. To analyze this, I counted the number of reviews giving 1, 2, 3, 4, 5, and NA ratings for each variable and within each airline, as well as for the industry as a whole. For seat comfort ratings specifically, we have the following results:

     

    Each of the five variables of cabin service, entertainment, food and beverage, ground service, and seat comfort were explored, and going over all of these lead to the following observations:

    • JetBlue Airways had the best ratings for seat comfort across all airlines and thus should market themselves as the industry leaders in seat comfort. Similarly, Alaska Airlines should market themselves as the industry leader in cabin service. Given the results from Question 1, both JetBlue and Alaska would likely see a boost in sales if customers knew they lead in seat comfort and cabin service since these are the variables (of the five studied so far) that impact customers’ overall impressions of a flight the most.
    • The industry as a whole sees quite a lot of low or missing ratings in entertainment. An airline that has received middling ratings in other fields could distinguish itself with entertainment (inflight movies, wifi access, etcetera) and potentially influence more customers to care more about their inflight entertainment experiences.
    • Spirit Airlines consistently receives mostly 1 ratings, which suggests that across all considered fields customers tend to be unhappy with their experience. Yet, Spirit Airlines continues to grow. This suggests a need for more exploration into the needs and wants of airline customers.

    Insight: On the whole the US airline industry is doing the best with cabin service and the worst with ground service and seat comfort (in these fields, there were fewer 5s than any other rating). In addition, there is a huge lack of ratings for entertainment. Within each of the five fields studied, there are opportunities for individual airlines to capitalize on the difference between their own performance and that of the industry: either boast about leading in a currently important field (seat comfort or cabin service) or put money towards becoming a leader in an overlooked arena (entertainment, ground service, food and beverage).

     

    Question 3

    What words come up most frequently in positive reviews? In negative reviews?

    The previous questions aimed to better understand airline customer views on five specific aspects of flight experience (cabin service, entertainment, food and beverage, ground service, seat comfort), but since these five fields do not account for all that could impact a customer’s overall experience, I wanted to analyze the actual text of their reviews. To do this, I used word clouds generated by a combination of the “tm”, “wordcloud,” and “memoise” packages in R. I analyzed the positive and negative reviews separately and for both individual airlines and the industry as a whole. Positive reviews were those with overall ratings of 6/10 or better and negative reviews were those with overall ratings of 5/10 or worse.

    Here are the positive and negative word clouds for the industry as a whole:

    In both the positive and negative reviews, the word ‘time’ is one of the top three most-used words. This suggests that, indeed, the five fields considered in questions 1 and 2 did not capture all that is important to flyers (an obvious statement to flyers, of course!). Looking through the word clouds for all the airlines individually further emphasizes that customers primarily rave about and complain about time-related topics. Interestingly, the word ‘hours’ comes up in negative reviews, suggesting that customers can tolerate a moderate amount of lateness or delay (say, under an hour) but lose patience after that. Conversely, no amount of time comes up in positive reviews; potentially, any amount of time that an airline can save a customer is rewarded.

    The words ‘seats’ and ‘service’ still appear in the top five words too, though, so the analysis from the previous questions is corroborated.

    In addition to providing additional information about the trends in the industry, individual airlines can benefit from these word cloud analyses by seeing words that appear in their clouds and not in others (or vice versa). For example, ‘Atlanta’ comes up in the negative reviews for Delta, suggesting that Delta has problems in Atlanta; similarly, Frontier has problems in Denver, and Hawaiian airline customers mention class – both positively and negatively – more than those of any other airline. Spirit, although it ranks as the worst among the airlines in nearly every one of the five fields considered in questions 1 and 2, has a negative word cloud that is not distinguished from the other negative word clouds. That is, Spirit customers still complain the most in text reviews about delays and time, and in fact, Spirit customers mention the word ‘seat’ in their positive reviews!

    Insight: Airline customers write about time more than anything else, followed by service and seats. Given the surprising findings about Spirit Airlines, it’s possible that saving/wasting time is an even stronger predictor of a customer’s overall flight rating than anything else. In business practice, this would suggest that anything an airline can do to save customers time would boost their sales; for some airlines, making tweaks even in single cities may lead to higher level of customer satisfaction.

     

    Conclusion and future directions

    Despite the fact that there are a myriad of factors that impact a flyer’s experience with a flight, airlines can boost customer satisfaction by focusing on a few main aspects of flights – particularly time, seat comfort, and cabin service. This project only scratches the surface in understanding all that impacts flyers’ impressions and each flight aspect merits a full investigation of its own. For example, research into the cost and benefit of improving seat comfort would be highly beneficial for an airline looking to improve sales through seat comfort. In addition, I’d like to employ more advanced machine learning techniques to help predict customer flight ratings and, subsequently, sales. For this, I’d like to look into the unused data I scraped about routes, traveller type, cabin, and date flown, as well as incorporate weather and economic data.

    The post Increasing Airline Customer Satisfaction appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Deep learning and the German Data Science Job Market

    Mon, 10/30/2017 - 01:00

    (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

    Almost 2 years ago, I wrote a short post on the German data science market by analysing open position on the key job platforms; Stepstone, Monster and Indeed. Recently I received requests to update the post with fresh data.

    So here it comes. To add a slightly different spin, I thought it would be interesting to see how widespread and prevalent the deep learning / AI trend shows up in the job market.

    To get started, I scraped the number of available job openings from the sites: Monster, Stepstone and Indeed searching for the term “Data Scientist”.

    To put the numbers in perspective, in the December 2015, I found 65 jobs on Indeed, 36 on Monster and 36 on Stepstone. As a first take-away; the data science market increased somewhere between 5 and 10 fold.
    The ~600 positions on Indeed are advertised by 254 companies, which is almost a 5 fold increase vs 2015.

    In order to see how important deep learning is, I compare the number of open positions for the search term in relation to other data search terms. We see that there are 400 open positions with that keyword. Data mining or machine learning are more common by factor 2 respectively factor 4.5.

    Looking at advertised methods, we see that it is always good to have some standard data preperation and statistic skills under the belt. Looking at the technology search terms, we see a hugh interest in spark (almost a 10 fold increase compared to the 2015 data.) What is espcially interesting here is that there are “only” 330 data engineer positions. Hence big data technologies are mentioned in a much wider range of job positions.

    Finally, looking at programming languages and business intelligence tools we see that Java and Excel are still highly in demand.

    Two years ago I concluded with stating:
    Given this analysis, I would conclude that Germany is not (yet) a hot market for data scientist. Both in terms of regional distribution as well as the low absolute and relative number of open positions suggests that the job market is not well developed.

    Given the 5 fold increase in postions, I think that German companies finally realized that the digital transformation requires highly-skilled data-educated employees.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    2017-04 Speeding up gridSVG

    Sun, 10/29/2017 - 23:23

    (This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

    This report describes changes in version 1.6-0 of the ‘gridSVG’ package for R. The most important result of these changes is that ‘gridSVG’ is now much faster at generating SVG files (in situations where it used to be painfully slow).

    Paul Murrell

    Download

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Big Data Transforms

    Sun, 10/29/2017 - 16:33

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark).



    Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

    The R package DBI gives us direct access to SQL and the package dplyr gives us access to a transform grammar that can either be executed or translated into SQL.

    But, as we point out in the replyr README: moving from in-memory R to large data systems is always a bit of a shock as you lose a lot of your higher order data operators or transformations. Missing operators include:

    • union (binding by rows many data frames into a single data frame).
    • split (splitting a single data frame into many data frames).
    • pivot (moving row values into columns).
    • un-pivot (moving column values to rows).

    I can repeat this. If you are an R user used to using one of dply::bind_rows() , base::split(), tidyr::spread(), or tidyr::gather(): you will find these functions do not work on remote data sources, but have replacement implementations in the replyr package.

    For example:

    library("RPostgreSQL") ## Loading required package: DBI suppressPackageStartupMessages(library("dplyr")) isSpark <- FALSE # Can work with PostgreSQL my_db <- DBI::dbConnect(dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') # # Can work with Sparklyr # my_db <- sparklyr::spark_connect(version='2.2.0', # master = "local") # isSpark <- TRUE d <- dplyr::copy_to(my_db, data.frame(x = c(1,5), group = c('g1', 'g2'), stringsAsFactors = FALSE), 'd') print(d) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 # show dplyr::bind_rows() fails. dplyr::bind_rows(list(d, d)) ## Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl

    The replyr package supplies R accessible implementations of these missing operators for large data systems such as PostgreSQL and Spark.

    For example:

    # using the development version of replyr https://github.com/WinVector/replyr library("replyr") ## Loading required package: seplyr ## Loading required package: wrapr ## Loading required package: cdata packageVersion("replyr") ## [1] '0.8.2' # binding rows dB <- replyr_bind_rows(list(d, d)) print(dB) ## # Source: table [?? x ## # 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 ## 3 1 g1 ## 4 5 g2 # splitting frames replyr_split(dB, 'group') ## $g2 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 5 g2 ## 2 5 g2 ## ## $g1 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 1 g1 # pivoting pivotControl <- buildPivotControlTable(d, columnToTakeKeysFrom = 'group', columnToTakeValuesFrom = 'x', sep = '_') dW <- moveValuesToColumnsQ(keyColumns = NULL, controlTable = pivotControl, tallTableName = 'd', my_db = my_db, strict = FALSE) %>% compute(name = 'dW') print(dW) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group_g1 group_g2 ## ## 1 1 5 # un-pivoting unpivotControl <- buildUnPivotControlTable(nameForNewKeyColumn = 'group', nameForNewValueColumn = 'x', columnsToTakeFrom = colnames(dW)) moveValuesToRowsQ(controlTable = unpivotControl, wideTableName = 'dW', my_db = my_db) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group x ## ## 1 group_g1 1 ## 2 group_g2 5

    The point is: using the replyr package you can design in terms of higher-order data transforms, even when working with big data in R. Designs in terms of these operators tend to be succinct, powerful, performant, and maintainable.

    To master the terms moveValuesToRows and moveValuesToColumns I suggest trying the following two articles:

    if(isSpark) { status <- sparklyr::spark_disconnect(my_db) } else { status <- DBI::dbDisconnect(my_db) } my_db <- NULL var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Machine Learning with R and Python – Part 4

    Sun, 10/29/2017 - 14:40

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    This is the 4th installment of my ‘Practical Machine Learning with R and Python’ series. In this part I discuss classification with Support Vector Machines (SVMs), using both a Linear and a Radial basis kernel, and Decision Trees. Further, a closer look is taken at some of the metrics associated with binary classification, namely accuracy vs precision and recall. I also touch upon Validation curves, Precision-Recall, ROC curves and AUC with equivalent code in R and Python

    This post is a continuation of my 3 earlier posts on Practical Machine Learning in R and Python
    1. Practical Machine Learning with R and Python – Part 1
    2. Practical Machine Learning with R and Python – Part 2
    3. Practical Machine Learning with R and Python – Part 3

    The RMarkdown file with the code and the associated data files can be downloaded from Github at MachineLearning-RandPython-Part4

    Support Vector Machines (SVM) are another useful Machine Learning model that can be used for both regression and classification problems. SVMs used in classification, compute the hyperplane, that separates the 2 classes with the maximum margin. To do this the features may be transformed into a larger multi-dimensional feature space. SVMs can be used with different kernels namely linear, polynomial or radial basis to determine the best fitting model for a given classification problem.

    In the 2nd part of this series Practical Machine Learning with R and Python – Part 2, I had mentioned the various metrics that are used in classification ML problems namely Accuracy, Precision, Recall and F1 score. Accuracy gives the fraction of data that were correctly classified as belonging to the +ve or -ve class. However ‘accuracy’ in itself is not a good enough measure because it does not take into account the fraction of the data that were incorrectly classified. This issue becomes even more critical in different domains. For e.g a surgeon who would like to detect cancer, would like to err on the side of caution, and classify even a possibly non-cancerous patient as possibly having cancer, rather than mis-classifying a malignancy as benign. Here we would like to increase recall or sensitivity which is  given by Recall= TP/(TP+FN) or we try reduce mis-classification by either increasing the (true positives) TP or reducing (false negatives) FN

    On the other hand, search algorithms would like to increase precision which tries to reduce the number of irrelevant results in the search result. Precision= TP/(TP+FP). In other words we do not want ‘false positives’ or irrelevant results to come in the search results and there is a need to reduce the false positives.

    When we try to increase ‘precision’, we do so at the cost of ‘recall’, and vice-versa. I found this diagram and explanation in Wikipedia very useful Source: Wikipedia

    “Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).”

    1.1a. Linear SVM – R code

    In R code below I use SVM with linear kernel

    source('RFunctions-1.R') library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Read data. Data from SKLearn cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split into training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Fit a linear basis kernel. DO not scale the data svmfit=svm(target~., data=train, kernel="linear",scale=FALSE) ypred=predict(svmfit,test) #Print a confusion matrix confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 54 3 ## 1 3 82 ## ## Accuracy : 0.9577 ## 95% CI : (0.9103, 0.9843) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9121 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9474 ## Specificity : 0.9647 ## Pos Pred Value : 0.9474 ## Neg Pred Value : 0.9647 ## Prevalence : 0.4014 ## Detection Rate : 0.3803 ## Detection Prevalence : 0.4014 ## Balanced Accuracy : 0.9560 ## ## 'Positive' Class : 0 ## 1.1b Linear SVM – Python code

    The code below creates a SVM with linear basis in Python and also dumps the corresponding classification metrics

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.svm import LinearSVC from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = LinearSVC().fit(X_train, y_train) print('Breast cancer dataset') print('Accuracy of Linear SVC classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Linear SVC classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset ## Accuracy of Linear SVC classifier on training set: 0.92 ## Accuracy of Linear SVC classifier on test set: 0.94 1.2 Dummy classifier

    Often when we perform classification tasks using any ML model namely logistic regression, SVM, neural networks etc. it is very useful to determine how well the ML model performs agains at dummy classifier. A dummy classifier uses some simple computation like frequency of majority class, instead of fitting and ML model. It is essential that our ML model does much better that the dummy classifier. This problem is even more important in imbalanced classes where we have only about 10% of +ve samples. If any ML model we create has a accuracy of about 0.90 then it is evident that our classifier is not doing any better than a dummy classsfier which can just take a majority count of this imbalanced class and also come up with 0.90. We need to be able to do better than that.

    In the examples below (1.3a & 1.3b) it can be seen that SVMs with ‘radial basis’ kernel with unnormalized data, for both R and Python, do not perform any better than the dummy classifier.

    1.2a Dummy classifier – R code

    R does not seem to have an explicit dummy classifier. I created a simple dummy classifier that predicts the majority class. SKlearn in Python also includes other strategies like uniform, stratified etc. but this should be possible to create in R also.

    # Create a simple dummy classifier that computes the ratio of the majority class to the totla DummyClassifierAccuracy <- function(train,test,type="majority"){ if(type=="majority"){ count <- sum(train$target==1)/dim(train)[1] } count } cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Create training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] #Dummy classifier majority class acc=DummyClassifierAccuracy(train,test) sprintf("Accuracy is %f",acc) ## [1] "Accuracy is 0.638498" 1.2b Dummy classifier – Python code

    This dummy classifier uses the majority class.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.metrics import confusion_matrix (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Negative class (0) is most frequent dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train) y_dummy_predictions = dummy_majority.predict(X_test) print('Dummy classifier accuracy on test set: {:.2f}' .format(dummy_majority.score(X_test, y_test))) ## Dummy classifier accuracy on test set: 0.63 1.3a – Radial SVM (un-normalized) – R code

    SVMs perform better when the data is normalized or scaled. The 2 examples below show that SVM with radial basis kernel does not perform any better than the dummy classifier

    library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Radial SVM unnormalized train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Unnormalized data svmfit=svm(target~., data=train, kernel="radial",cost=10,scale=FALSE) ypred=predict(svmfit,test) confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 0 0 ## 1 57 85 ## ## Accuracy : 0.5986 ## 95% CI : (0.5131, 0.6799) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 0.5363 ## ## Kappa : 0 ## Mcnemar's Test P-Value : 1.195e-13 ## ## Sensitivity : 0.0000 ## Specificity : 1.0000 ## Pos Pred Value : NaN ## Neg Pred Value : 0.5986 ## Prevalence : 0.4014 ## Detection Rate : 0.0000 ## Detection Prevalence : 0.0000 ## Balanced Accuracy : 0.5000 ## ## 'Positive' Class : 0 ## 1.4b – Radial SVM (un-normalized) – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = SVC(C=10).fit(X_train, y_train) print('Breast cancer dataset (unnormalized features)') print('Accuracy of RBF-kernel SVC on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of RBF-kernel SVC on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset (unnormalized features) ## Accuracy of RBF-kernel SVC on training set: 1.00 ## Accuracy of RBF-kernel SVC on test set: 0.63 1.5a – Radial SVM (Normalized) -R Code

    The data is scaled (normalized ) before using the SVM model. The SVM model has 2 paramaters a) C – Large C (less regularization), more regularization b) gamma – Small gamma has larger decision boundary with more misclassfication, and larger gamma has tighter decision boundary

    The R code below computes the accuracy as the regularization paramater is changed

    trainingAccuracy <- NULL testAccuracy <- NULL C1 <- c(.01,.1, 1, 10, 20) for(i in C1){ svmfit=svm(target~., data=train, kernel="radial",cost=i,scale=TRUE) ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) trainingAccuracy <-c(trainingAccuracy,a$overall[1]) testAccuracy <-c(testAccuracy,b$overall[1]) } print(trainingAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.6384977 0.9671362 0.9906103 0.9976526 1.0000000 print(testAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.5985915 0.9507042 0.9647887 0.9507042 0.9507042 a <-rbind(C1,as.numeric(trainingAccuracy),as.numeric(testAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)")

    1.5b – Radial SVM (normalized) – Python

    The Radial basis kernel is used on normalized data for a range of ‘C’ values and the result is plotted.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print('Breast cancer dataset (normalized with MinMax scaling)') trainingAccuracy=[] testAccuracy=[] for C1 in [.01,.1, 1, 10, 20]: clf = SVC(C=C1).fit(X_train_scaled, y_train) acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy.append(acctrain) testAccuracy.append(accTest) # Create a dataframe C1=[.01,.1, 1, 10, 20] trainingAccuracy=pd.DataFrame(trainingAccuracy,index=C1) testAccuracy=pd.DataFrame(testAccuracy,index=C1) # Plot training and test R squared as a function of alpha df=pd.concat([trainingAccuracy,testAccuracy],axis=1) df.columns=['trainingAccuracy','trainingAccuracy'] fig1=df.plot() fig1=plt.title('Training and test accuracy vs C (SVC)') fig1.figure.savefig('fig1.png', bbox_inches='tight') ## Breast cancer dataset (normalized with MinMax scaling)

    Output image: 

    1.6a Validation curve – R code

    Sklearn includes code creating validation curves by varying paramaters and computing and plotting accuracy as gamma or C or changd. I did not find this R but I think this is a useful function and so I have created the R equivalent of this.

    # The R equivalent of np.logspace seqLogSpace <- function(start,stop,len){ a=seq(log10(10^start),log10(10^stop),length=len) 10^a } # Read the data. This is taken the SKlearn cancer data cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) set.seed(6) # Create the range of C1 in log space param_range = seqLogSpace(-3,2,20) # Initialize the overall training and test accuracy to NULL overallTrainAccuracy <- NULL overallTestAccuracy <- NULL # Loop over the parameter range of Gamma for(i in param_range){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(cancer), replace=TRUE) # Initialize the training and test accuracy of folds to 0 trainingAccuracy <- 0 testAccuracy <- 0 # Loop through the folds for(j in 1:noFolds){ # The training is all rows for which the row is != j (k-1 folds -> training) train <- cancer[folds!=j,] # The rows which have j as the index become the test set test <- cancer[folds==j,] # Create a SVM model for this svmfit=svm(target~., data=train, kernel="radial",gamma=i,scale=TRUE) # Add up all the fold accuracy for training and test separately ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) # Create confusion matrix a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) # Get the accuracy trainingAccuracy <-trainingAccuracy + a$overall[1] testAccuracy <-testAccuracy+b$overall[1] } # Compute the average of accuracy for K folds for number of features 'i' overallTrainAccuracy=c(overallTrainAccuracy,trainingAccuracy/noFolds) overallTestAccuracy=c(overallTestAccuracy,testAccuracy/noFolds) } #Create a dataframe a <- rbind(param_range,as.numeric(overallTrainAccuracy), as.numeric(overallTestAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") #Plot in log axis ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)") + scale_x_log10()

    1.6b Validation curve – Python

    Compute and plot the validation curve as gamma is varied.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC from sklearn.model_selection import validation_curve # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_cancer) # Create a gamma values from 10^-3 to 10^2 with 20 equally spaced intervals param_range = np.logspace(-3, 2, 20) # Compute the validation curve train_scores, test_scores = validation_curve(SVC(), X_scaled, y_cancer, param_name='gamma', param_range=param_range, cv=10) #Plot the figure fig2=plt.figure() #Compute the mean train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) fig2=plt.title('Validation Curve with SVM') fig2=plt.xlabel('$\gamma$ (gamma)') fig2=plt.ylabel('Score') fig2=plt.ylim(0.0, 1.1) lw = 2 fig2=plt.semilogx(param_range, train_scores_mean, label='Training score', color='darkorange', lw=lw) fig2=plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color='darkorange', lw=lw) fig2=plt.semilogx(param_range, test_scores_mean, label='Cross-validation score', color='navy', lw=lw) fig2=plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color='navy', lw=lw) fig2.figure.savefig('fig2.png', bbox_inches='tight')

    Output image: 

    1.7a Validation Curve (Preventing data leakage) – Python code

    In this course Applied Machine Learning in Python, the Professor states that when we apply the same data transformation to a entire dataset, it will cause a data leakage. “The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately”

    So I apply separate scaling to the training and testing folds and plot. In the lecture the Prof states that this can be done using pipelines.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.cross_validation import KFold from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC # Read the data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) # Set the parameter range param_range = np.logspace(-3, 2, 20) # Set number of folds folds=5 #Initialize overallTrainAccuracy=[] overallTestAccuracy=[] # Loop over the paramater range for c in param_range: trainingAccuracy=0 testAccuracy=0 kf = KFold(len(X_cancer),n_folds=folds) # Partition into training and test folds for train_index, test_index in kf: # Partition the data acccording the fold indices generated X_train, X_test = X_cancer[train_index], X_cancer[test_index] y_train, y_test = y_cancer[train_index], y_cancer[test_index] # Scale the X_train and X_test scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit a SVC model for each C clf = SVC(C=c).fit(X_train_scaled, y_train) #Compute the training and test score acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy += np.sum(acctrain) testAccuracy += np.sum(accTest) # Compute the mean training and testing accuracy overallTrainAccuracy.append(trainingAccuracy/folds) overallTestAccuracy.append(testAccuracy/folds) overallTrainAccuracy=pd.DataFrame(overallTrainAccuracy,index=param_range) overallTestAccuracy=pd.DataFrame(overallTestAccuracy,index=param_range) # Plot training and test R squared as a function of alpha df=pd.concat([overallTrainAccuracy,overallTestAccuracy],axis=1) df.columns=['trainingAccuracy','testAccuracy'] fig3=plt.title('Validation Curve with SVM') fig3=plt.xlabel('$\gamma$ (gamma)') fig3=plt.ylabel('Score') fig3=plt.ylim(0.5, 1.1) lw = 2 fig3=plt.semilogx(param_range, overallTrainAccuracy, label='Training score', color='darkorange', lw=lw) fig3=plt.semilogx(param_range, overallTestAccuracy, label='Cross-validation score', color='navy', lw=lw) fig3=plt.legend(loc='best') fig3.figure.savefig('fig3.png', bbox_inches='tight')

    Output image: 

    1.8 a Decision trees – R code

    Decision trees in R can be plotted using RPart package

    library(rpart) library(rpart.plot) rpart = NULL # Create a decision tree m <-rpart(Species~.,data=iris) #Plot rpart.plot(m,extra=2,main="Decision Tree - IRIS")

     

    1.8 b Decision trees – Python code from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.model_selection import train_test_split import graphviz iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3) clf = DecisionTreeClassifier().fit(X_train, y_train) print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ## Accuracy of Decision Tree classifier on training set: 1.00 ## Accuracy of Decision Tree classifier on test set: 0.97 1.9a Feature importance – R code

    I found the following code which had a snippet for feature importance. Sklean has a nice method for this. For some reason the results in R and Python are different. Any thoughts?

    set.seed(3) # load the library library(mlbench) library(caret) # load the dataset cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split as data data <- cancer[,1:31] target <- cancer[,32] # Train the model model <- train(data, target, method="rf", preProcess="scale", trControl=trainControl(method = "cv")) # Compute variable importance importance <- varImp(model) # summarize importance print(importance) # plot importance plot(importance)

    1.9b Feature importance – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer import numpy as np # Read the data cancer= load_breast_cancer() (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Use the DecisionTreClassifier clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8, random_state = 0).fit(X_train, y_train) c_features=len(cancer.feature_names) print('Breast cancer dataset: decision tree') print('Accuracy of DT classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of DT classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) # Plot the feature importances fig4=plt.figure(figsize=(10,6),dpi=80) fig4=plt.barh(range(c_features), clf.feature_importances_) fig4=plt.xlabel("Feature importance") fig4=plt.ylabel("Feature name") fig4=plt.yticks(np.arange(c_features), cancer.feature_names) fig4=plt.tight_layout() plt.savefig('fig4.png', bbox_inches='tight') ## Breast cancer dataset: decision tree ## Accuracy of DT classifier on training set: 0.96 ## Accuracy of DT classifier on test set: 0.94

    Output image: 

    1.10a Precision-Recall, ROC curves & AUC- R code

    I tried several R packages for plotting the Precision and Recall and AUC curve. PRROC seems to work well. The Precision-Recall curves show the tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa.AUC curves that hug the top left corner indicate a high sensitivity,specificity and an excellent accuracy.

    source("RFunctions-1.R") library(dplyr) library(caret) library(e1071) library(PRROC) # Read the data (this data is from sklearn!) d <- read.csv("digits.csv") digits <- d[2:66] digits$X64 <- as.factor(digits$X64) # Split as training and test sets train_idx <- trainTestSplit(digits,trainPercent=75,seed=5) train <- digits[train_idx, ] test <- digits[-train_idx, ] # Fit a SVM model with linear basis kernel with probabilities svmfit=svm(X64~., data=train, kernel="linear",scale=FALSE,probability=TRUE) ypred=predict(svmfit,test,probability=TRUE) head(attr(ypred,"probabilities")) ## 0 1 ## 6 7.395947e-01 2.604053e-01 ## 8 9.999998e-01 1.842555e-07 ## 12 1.655178e-05 9.999834e-01 ## 13 9.649997e-01 3.500032e-02 ## 15 9.994849e-01 5.150612e-04 ## 16 9.999987e-01 1.280700e-06 # Store the probability of 0s and 1s m0<-attr(ypred,"probabilities")[,1] m1<-attr(ypred,"probabilities")[,2] # Create a dataframe of scores scores <- data.frame(m1,test$X64) # Class 0 is data points of +ve class (in this case, digit 1) and -ve class (digit 0) #Compute Precision Recall pr <- pr.curve(scores.class0=scores[scores$test.X64=="1",]$m1, scores.class1=scores[scores$test.X64=="0",]$m1, curve=T) # Plot precision-recall curve plot(pr)

    #Plot the ROC curve roc<-roc.curve(m0, m1,curve=TRUE) plot(roc)

    1.10b Precision-Recall, ROC curves & AUC- Python code

    For Python Logistic Regression is used to plot Precision Recall, ROC curve and compute AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_digits from sklearn.metrics import precision_recall_curve from sklearn.metrics import roc_curve, auc #Load the digits dataset = load_digits() X, y = dataset.data, dataset.target #Create 2 classes -i) Digit 1 (from digit 1) ii) Digit 0 (from all other digits) # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) # Fit a LR model lr = LogisticRegression().fit(X_train, y_train) #Compute the decision scores y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test) y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20])) #Show the decision_function scores for first 20 instances y_score_list precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #Plot plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig5.png', bbox_inches='tight') #Compute and plot the ROC y_score_lr = lr.fit(X_train, y_train).decision_function(X_test) fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr) roc_auc_lr = auc(fpr_lr, tpr_lr) plt.figure() plt.xlim([-0.01, 1.00]) plt.ylim([-0.01, 1.01]) plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr)) plt.xlabel('False Positive Rate', fontsize=16) plt.ylabel('True Positive Rate', fontsize=16) plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16) plt.legend(loc='lower right', fontsize=13) plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--') plt.axes() plt.savefig('fig6.png', bbox_inches='tight')

    output

    output

    1.10c Precision-Recall, ROC curves & AUC- Python code

    In the code below classification probabilities are used to compute and plot precision-recall, roc and AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_digits from sklearn.svm import LinearSVC from sklearn.calibration import CalibratedClassifierCV dataset = load_digits() X, y = dataset.data, dataset.target # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) svm = LinearSVC() # Need to use CalibratedClassifierSVC to redict probabilities for lInearSVC clf = CalibratedClassifierCV(svm) clf.fit(X_train, y_train) y_proba_lr = clf.predict_proba(X_test) from sklearn.metrics import precision_recall_curve precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr[:,1]) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #plt.figure(figsize=(15,15),dpi=80) plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig7.png', bbox_inches='tight')

    output

    Note: As with other posts in this series on ‘Practical Machine Learning with R and Python’,   this post is based on these 2 MOOC courses
    1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
    2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

    Conclusion

    This 4th part looked at SVMs with linear and radial basis, decision trees, precision-recall tradeoff, ROC curves and AUC.

    Stick around for further updates. I’ll be back!
    Comments, suggestions and correction are welcome.

    Also see
    1. A primer on Qubits, Quantum gates and Quantum Operations
    2. Dabbling with Wiener filter using OpenCV
    3. The mind of a programmer
    4. Sea shells on the seashore
    5. yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

    To see all posts see Index of posts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Capturing the Brexit vote in data

    Sun, 10/29/2017 - 13:27

    (This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

    In my recent research I have worked on understanding the key correlates of Brexit. This paper is joint work with my co-authors Sascha Becker and Dennis Novy and has now been published in Economic Policy. After having gone through the peer-review process, I am very happy to share the data and the underlying code.

    On the construction of a rich local authority level dataset

    After the EU Referendum, I started working on assembling a dataset to study the question to what extent low turnout among potential remain voters may help to understand the result of the referendum. In particular, back then I looked at the extent to which rainfall may have discouraged people to turn up to vote in the first place. After all, a lot of polls, the financial markets and betting markets seemed to indicate that Remain would win.

    It was not inconceivable to imagine that turnout may have been particularly low among a set of voters living in the London commuter belt, who were turned off from voting on a working day after an arduous journey into London that was marred by train cancelations due to the bad weather.

    The result then was that there appears to have been a weak effect on turnout, but this effect was not biased in either direction with regard to voter preferences: Remain would have lost also on a sunny day (we confirm this finding in the new paper with a different rainfall data set).

    In any case, this was the starting point to a significant data collection effort. In particular we have combined data from all the 380 local authorities in England, Scotland and Wales as well as some data of ward level voting across five cities. The resulting set of covariates is quite sizable and they can be roughly grouped into the following four categories:

    1. characteristics of the underlying economic structure of an area, like the unemployment rate, wages or sectoral employment shares and changes therein,
    2. demographic and human capital characteristics like age structure, education or life satisfaction,
    3. exposure to the EU measured by indicators like the level of EU transfers, trade and the growth rate of immigration from other EU countries,
    4. public service provision and fiscal consolidation that covers the share of public employment, as well as the reduction in public spending per capita among others and NHS performance metrics.

    To analyse which of these covariates are strong correlates of the Leave vote share in an area, we used a simple machine-learning method that chooses the best subset of covariates that best predicts the referendum Leave share. The idea is to build a robust predictive model that could achieve significant out of sample prediction accurracy.

    More formally, best subset selection solves the following optimization problem to obtain an optimal vector beta that can robustly predict the vote leave share y_c

    While the formulation may be a bit technical, it boils down to estimating all possible ways to combine regressors. That means, all possible ways of estimating a model that includes 1, 2, 3, …, p covariates are being estimated. This amounts to estimating (2^p) models, which becomes infeasible to estimate very fast. Lasso and other methods such as forward- and backward stepwise selection are solving approximations to the combinatorical problem that BSS solves. The method is usually taught at the beginning of machine learning courses as it provides a great way to illustrate the bias-variance tradeoff.

    The result can be summarized in a simple bar chart and is somewhat striking:

     

    R2 of best models with different groups of regressors

    What this suggests is that “Fundamental factors” such economic structure, demographics and human capital are strongest correlates of Leave vote in the UK; the direct level effects of migration and trade exposure captured in the group of regressors called “EU Exposure” seem second order.

    Also what is striking that even very simple empirical models with just 10 – 20 variables are doing a good job in capturing the variation in the vote leave share across the whole of the UK.  The best model that includes variables measuring vote shares of different parties (especially UKIP) from the 2014 European Parliamentary election captures 95% of the overall variation.

    The observation that variables capturing direct exposure to the European Union, in particular, Immigration seems at odds with voter narratives around the EU referendum, which were doinated by the immigration topic. In fact, the 2015 British Election study suggests that voters  considered immigration to be the single most important issue of the time.

    Unigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    In the above the word cloud the keyword immigration is just quite striking. Looking at the bigram word cloud drawn from the British Election study, the themes of the individual responses become even more apparent.

    Bigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    There things like “many people”, “small island”, “uncontrolled immigration” appear in addition to the simple immigration keyword. But also other themes, such as “national debt”, “health service” and concerns about the “welfare system” seem to feature quite large. Overall this suggests that immigration may have been a salient feature in the public debate, but it seems at odds with the fact that the variable groups pertaining to EU immigration seem to capture very little of the variation in the intensity of the leave vote across the UK.

     

    In another paper with Sascha we confirm this observation  in a panel setup. We show  that a local area’s exposure to Eastern European immigration has, at best, only a small impact on anti-EU preferences in the UK as measured by UKIP voting in European Parliamentary Elections.

    While our code for the Economic policy paper is implemented in Stata, it is very easy to replicate this work in Stata. Below is an example of how you would implement BSS in R through the leaps package.

    The leaps package has a nice visualization of which regressors are included in which of the best models. The below example replicates the results in our Table 1.

    library(data.table) library(haven) library(leaps) DTA<-data.table(read_dta(file="data/brexitevote-ep2017.dta")) #TABLE 1 IN THE PAPER WOULD BE REPLICATED BY RUNNING TABLE1<-regsubsets(Pct_Leave~InitialEUAccession+MigrantEUAccessionGrowth+InitialOldEUMember+MigrantOldEUMemberGrowth+InitialElsewhere+MigrantElsewhereGrowth+Total_EconomyEU_dependence+eufundspercapitapayment13+EU75Leaveshare,nbest=1,data=DTA) plot(TABLE1,scale="r2")

     

    Plotting best subset results

    If you want to replicate the word clouds from the 2015 British Election study, the following lines of code will do. You will need to download the British Election Study dataset here.

    library(data.table) library(haven) library(quanteda) DTA<-data.table(read_dta(file="data/bes_f2f_original_v3.0.dta")) C<-corpus(DTA$A1) docnames(C)<-DTA$finalserialno C.dfmunigram C.dfmunigram <- dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=1) C.dfm<-dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=2) plot(C.dfmunigram) plot(C.dfm) this var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Moving to Blogdown

    Sun, 10/29/2017 - 02:00

    (This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

    As I announced last friday on Twitter I have moved Data Imaginist to Blogdown from Jekyll. This is going to be a short post about why I did it. There’s not going to be much tutorial about this; the migration is as painless as can be expected and besides, other more qualified people have already made great tutorials about the matter.

    A little over a year ago I started blogging, using what I expect most were using at the time (in R-land anyway) — The Jekyll/Github Pages combo. I’d stolen a script from David Robinson that converted new RMarkdown files into Markdown documents for Jekyll, resulting in a workflow quite like what Blogdown offers. So when Blogdown arrived, I had little inclination to look more into it. Blogdown might be newer and more flashy, but it didn’t seem to offer enough to warrant the time investment it would take to migrate my post and learn about Hugo. Arguably it is possible to ignore the Hugo backend when using Blogdown, but I like to tinker and would never be satisfied using an engine I didn’t understand (this might explain why I have been poking at ggplot2 to the extend I has).

    Bottomline: I let it pass…

    Then, little by little, I was drawn in anyway. The first seducer were the slew of blogposts descibing the bliss that migrating your blog to Blogdown had brought – In addition those posts also spoke highly of the hosting from Netlify and their free HTTPS support. Being around Bob Rudis (in the Twitterverse) had made me aware of the need for secure browsing habits so this appealed a lot to me (even though the HTTPS support came throgh Let’s Encrypt – something Bob vehemently opposes). The second seducer was more of a repulsor from my current setup: My page was falling apart… Due to changes in Jekyll my internal links were no longer functioning and I would need to go through my whole theme to fix it. What was most annoying was that this stabbed me in the back. Jekyll had been upgraded on Github pages at some point, silently wrecking havock on my site without me knowing it – clearly an unacceptable setup. Lastly, as convenient as the automatic Jekyll building was by Github (when it didn’t break my site), as annoying it was to do it locally to preview a post before publishing. Mayby it is because I haven’t spend much time in Ruby-world but I never quite managed to get everything installed correctly so I could ensure that my own system matched what Github was using… The final push came when I for a brief period was the designer of the Blogdown logo and decided that now was the right time to make the switch.

    As I said in the beginning, this will not be a post about how to switch to Blogdown from Jekyll as I’m horrible at writing such posts. I can, however, describe my approach to migrating. I started by browsing the available Hugo themes and to my joy the one I was currently using had been ported and improved upon so the choice was easy (to be honest I’m not really really pleased with the look of my blog but the Cactus theme is pretty stylish in its minimalism). Once the theme was sorted out I began moving my posts into the new system one by one to see how it went. To my surprise it went predominantly without a hitch. Blogdown sets some other default figure sizes which resulted in some of my animations becoming unbearably large, so this had to be adjusted, but other than that it was mainly a matter of switching the Jekyll macros for Hugo shortcodes. My most used macro was for linking between blogposts which is akin to the ref shortcode but it took me a while to figure out that the argument to ref had to be enclosed in escaped quotes to survive the RMarkdown + Pandoc conversion. This could be a nice thing for Blogdown to take care off as it feels pretty cumbersome to write a path as '\\"my-post.html\\"'…

    Once I had gotten all my posts successfully into Blogdown I needed to get the rest going, which meant for the most part to port all my hacks and changes from the Jekyll theme into the Hugo version. One thing I’m pretty serious about is how links to my posts look when shared on Twitter so I had made a system where I could set the Twitter card type in the preamble of a post (large vs small) as well as a picture to use instead of my default logo. Porting this meant getting a bit into the Hugo templating system so that I could migrate my changes to the HTML header that gives these nice cards on Twitter/Facebook/LinkedIn. In the end HTML templates seems to be so alike nowadays that moving from one to the next is mostly a matter of understanding the data binding and how it differs between the systems. If you have some custom templates in Jekyll I can assure you that they probably wont take long to port over to Hugo…

    My last big todo was my RSS feed. Had I only had one it would not have been an issue at all, but as it is my page actually got two – a general one and one unique to posts in the R category (this was only made for R-Bloggers). Hugo comes with automatic RSS generation for both the main site and for differentt tags and categories, but they all reside at different locations. My two feeds both resides at the root of my page – something not really supported by Hugo. But I couldn’t accept breaking subscription or link so I scavenged the net and came to a solution – as my git commits tell, it was not straightforward…

    In the end I prevailed. If you are in a similar situation do look at my source code for the solution or contact me – I wont bore the 99.9 % rest with the details. The RSS episode was my first experience battling the Hugo system and arguably only because I had to support a setup from Jekyll that was handled in another way by Hugo…

    I wish I had a lot to tell you about my experience with Netlify, but I haven’t. Their service is the posterchild of easy, one-click solutions and I can whole-heartedly recommend them. The switch from Github to Netlify hosting took a total of around 20 minutes, including setting up HTTPS.

    As easy as I have hopefully made all of this sound, I really, really, really hope that something new and better doesn’t come out within a year. Doing this sort of infrastructure stuff is not something I particularly enjoy and I could have given my site a new look or begun developing the next version of gganimate instead of basically getting to the same page I had before. But sometimes you have to clean your room before you can play, and hopefully playtime is coming soon.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    It’s a dirty job, but someone’s got to do it..

    Sun, 10/29/2017 - 01:00

    (This article was first published on HighlandR, and kindly contributed to R-bloggers)

    A tidytext analysis of Faith No More lyrics –

    Is this a midlife crisis?

    I wanted to ease myself back into text mining,specifically using the tidytext package as I haven’t had to do any at work for well over a year.

    I’ve been thinking about some of the old bands of the 90’s, some of whom split up, and then reformed. I was interested to see how lyrics evolve over time, and to see what changes there were as the band matured.

    I’ve decided to look at Faith No More, because:

    • they recently reformed after an 18 year split
    • they had numerous line up changes, the main one being a change of vocalist shortly before the release of their 3rd (and most successful) album, “The Real Thing”.
    • Said vocalist, Mike Patton, was only 21 when he joined. I thought it might be interesting to see how his lyrics evolved over time
    • Their 4th album “Angel Dust” has been named the most influential rock album of all time.

    Let’s get one thing out of the way – there will be some review of the band’s history as well as being an R article but I promise – there will be no wordclouds.

    To start – let’s look at the number of tracks being analysed by album. I’ve excluded “We Care A Lot” from “Introduce Yourself”, as it’s the same as the title track from their first album, and would skew the analysis. I’ve also excluded cover versions and bonus tracks and a track from Sol Invictus, because the title is a very naughty word indeed .

    Most of the examples here follow the code available on the most excellent http://tidytextmining.com/. Big thanks to the package authors, Julia Silge and David Robinson for this package. I had done some text mining prior to the release of this package, but as someone used to dataframes, I soon found my progress limited to following a small number of tutorials and not knowing how to manipulate term document matrices etc to conduct further analysis.
    Tidytext is a game-changer because the data remains in dataframe/tibble format, and we can use dplyr and ggplot2 to wrangle and visualise.

    I had originally hoped this whole exercise would give me some experience with web-scraping – but the site I chose only allowed the use of an API – and while there was a Python one, there was no R alternative. Regardless – I was able to collate the data I needed.

    The original dataframe consisted of Title, Album and a lyrics column, and this was transformed into a format suitable for text analysis:

    text_df <- data %>% unnest_tokens(word,lyrics) %>% filter(word %notin% c("b","e","a","g","r","s","i","v","la")) %>% mutate(linenumber = row_number())

    One song (“Be Aggressive”) has a cheerleader style chorus where the title is spelled out letter by letter – so, I excluded these, along with “la”, which was also featuring highly in my initial plot of common words. Along with filtering out common “stop words”, I arrived at the following plot:

    This looks pretty boring, so let’s at least split it by album:

    We Care A Lot

    Unsurprisingly, “We Care A Lot” dominates the first album, given that it is repeated endlessly throughout the title track. Live favourite “As The Worm Turns” doesn’t feature though.

    Introduce Yourself

    Looking at “Introduce Yourself”, I’m surprised the title track doesn’t feature more heavily . Of the top words shown here, I recognise a couple from “Anne’s Song”. None from my under the radar favourites “Faster Disco” and “Chinese Arithmetic” though.

    The Real Thing

    Now on to the golden age of the band – Mike Patton unleashed his 6 octave vocal range and guitarist Jim Martin was still alternately shredding and providing short, clever solos. It took me a while figure out why “yeah” was so high, then I remembered the chorus of “Epic”. Never heard Faith No More? Start with this one. Also featuring here are words from “Underwater Love”, “Surprise! You’re Dead!” and personal favourite “From Out of Nowhere”.

    Angel Dust

    “Angel Dust” is dominated from words in just 2 songs – the aforementioned “Be Aggressive” and “Midlife Crisis”.. Other popular tracks from this album were “A Small Victory”, and their hit cover of the Commodore’s “Easy”.

    King For A Day…Fool For A Lifetime

    “King For A Day…Fool For A Lifetime” marks the start of the post-Jim Martin era and stands out in this plot in that it has a number of different tracks with frequently mentioned words. This album saw the band alternate many musical styles – sometimes in the same track – an example being “Just A Man”. This was one of the album’s highlights (see also – “Evidence”) but there was some filler on there too unfortunately.

    Album Of The Year

    The ironically titled “Album of the Year” reverts to being dominated by a few tracks, including single “Last Cup of Sorrow”. Other highlights were “Ashes To Ashes” and “Stripsearch”.

    Shortly after the release of this album, and years of in-fighting the band called it a day. They didn’t match the success of “The Real Thing” in the US, although the rest of the world were more switched on and “Angel Dust” actually outsold it in some areas.

    Even towards the end, they were popular enough in the UK to appear on “TFI Friday” (popular 90’s live entertainment show, hosted by then top presenter Chris Evans) 2 weeks running. While Britpop was a thing, FNM played “Ashes to Ashes” and turned in one of the finest live performances on UK TV (according to the internet).

    And, for a long time (18 years) it looked like that was that. An avant-garde rock band consigned to history. But, following several years of occasional live shows, the band returned in their final lineup and released “Sol Invictus”. I have to confess I have not listened to this album, apart from a few tracks. “Sunny Side Up” being the stand-out for me so far.

    Enough history, let’s look at some more plots and R code.

    song_words <- data %>% unnest_tokens(word, lyrics) %>% count(Title, word, sort = TRUE) %>% filter(n>30) %>% ungroup() ggplot(song_words,aes(reorder(word, n),n, fill=Title)) + geom_col()+ xlab(NULL) + ylab(NULL)+ coord_flip()+ theme_ipsum()+ scale_fill_ipsum()+ #scale_fill_viridis(option ="C", discrete = TRUE)+ ggtitle("Most Common Words - All Faith No More Studio Songs", subtitle = "Frequency >30. Including common stop words")+ theme(legend.position = "bottom")

    ## now remove stop words and the letters from # "Be Aggressive" song_words_filtered <- data %>% unnest_tokens(word, lyrics) %>% anti_join(stop_words) %>% filter(stringr::str_detect(word,"[a-z`]$"), !word %in% stop_words$word) %>% filter(word %notin% c("b","e","a","g","r","s","i","v","la")) %>% count(Title, word, sort = TRUE) %>% filter(n>20) %>% ungroup() ggplot(song_words_filtered,aes(reorder(word, n),n, fill=Title)) + geom_col()+ xlab(NULL) + ylab(NULL)+ coord_flip()+ theme_ipsum()+ scale_fill_ipsum()+ ggtitle("Most Common Words - All Faith No More Studio Songs", subtitle = "Frequency >30. Excluding common stop words")+ theme(legend.position = "bottom")

    Term document frequency by album – the number of times a word is used in the lyrics / by the total number of words in the lyrics. A lot of words occur rarely and a very few occurring frequently:

    Inverse Term Document frequency – or an attempt to identify words that don’t occur frequently but are important.

    Now we’ve looked at common and important words, let’s see if we can get a handle on the sentiment of those words.
    Admittedly, this first plot is a bit of a gimmick, but rather than use geom_col, or geom_point, I thought I’d use the ggimage package, and mark the overall album sentiment using the relevant album covers.

    Here’s the code:

    #overall album sentiment albumsentiment<- text_df %>% inner_join(get_sentiments("bing")) %>% count(Album, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) #create vector of images albumsentiment$img <- c("2017-10-22-WCAL.jpg","2017-10-22-IY.jpg", "2017-10-22-TRT.jpg","2017-10-22-AD.jpg","2017-10-22-KFAD.jpg", "2017-10-22-AOTY.jpg","2017-10-22-SI.jpg") ggplot(albumsentiment, aes(Album, sentiment, fill = Album)) + #geom_col(show.legend = FALSE) + theme_ipsum()+ ggExtra::removeGrid()+ coord_flip()+ geom_image(aes(image=img), size=.15)+ ggtitle("Overall Sentiment Score by Album", subtitle = "Faith No More Studio Albums - Excludes cover versions")+ labs(x = NULL, y = NULL)+ geom_text(aes(label=sentiment), hjust=0, nudge_y=7)+ theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

    So only “Introduce Yourself” managed to get a positive overall sentiment score. I’m surprised that “Album of the Year” scores as highly as it does given the band were on the verge of splitting up. However – this is simply a case of summing positive & negative scores by individual word and finding the overall difference – so its definitely not an exact science. Songs like “Collision” and “Helpless” do not convey positive emotion to me.

    Here is a more straightforward plot of sentiment by track:

    Topic Models

    I’m going to take the easy way out and refer you to the tidytext website to explain topic models. I have to be honest, I don’t really see any themes that make any sense to me here. So, I thought I should look at lyrics by individual vocalist:

    Here I take the first 5 topics by both to see if there is anything in common:

    After a bit of digging around, it turned out only 2 terms existed in both sets:

    Hmm,”Love” and “World”. Seems a bit “touchy feely” as they say. Perhaps I should look at this by album and see what topics appear?

    I’d expected perhaps that words belonging to the same song would get grouped together, but that doesn’t appear to happen very often.

    I also looked at splitting the lyrics into trigrams (sequences of 3 words, e.g. “We Care A”, “Care A Lot”, “A Lot We”, “Lot We Care” etc. By far the most common trigram was “Separation Anxiety Hey” from “Separation Anxiety”.

    I also looked at words where “Not” was the first of the 3 words in the trigram, to see if they were of positive or negative sentiment. First -the second word of the 3, then the final word of the 3.

    This can be expanded to look at other types of negating words:

    I also tried producing a network plot of the trigrams. Bit of a mess to be honest, but here it is:

    Finally – a correlation plot looking at closely linked overall each album is to the others, in terms of words used:

    library(widyr) album_word_cors <- album_words %>% pairwise_cor(Album,word,n, sort= TRUE) album_word_cors library(ggraph) library(igraph) set.seed(1234) album_word_cors %>% filter(correlation > .05) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(alpha = correlation, width = correlation)) + geom_node_point(size = 6, color = "lightblue") + geom_node_text(aes(label = name), repel = TRUE) + theme_void() ggsave("2017-10-22-album-word-correlations.png",width=8, height =5)

    This is more interesting -although – the correlations are pretty weak overall. However, The Real Thing, Angel Dust and King For A Day have the strongest correlations with each other.
    There appears to be less correlation over time between the lyrics used in The Real Thing compared to Album of the Year.

    There appears to little (or no) correlation between Album of the Year and Sol Invictus – that 18 year gap really made a difference. But what is interesting is the thick line between Sol Invictus and Introduce Yourself – 28 years apart, and from 2 different vocalists.

    However, the sting in the tail is that Mike Patton has said that he chose words more for their sound, and to fit the music, than for their actual meaning. That would explain the lyrics of “ Midlife Crisis” and “A Small Victory”, but seems to short change the cleverness of the lyrics in tracks like “Everything’s Ruined” and “Kindergarten”. Maybe this means this has been a massive waste of time? Regardless, I have enjoyed looking back at some great music from this wonderfully unique band.

    Code for the plots, and the plots themselves is available here

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Meetup with Hadley Wickham

    Sat, 10/28/2017 - 21:57

    (This article was first published on R – Data Science Los Angeles, and kindly contributed to R-bloggers)

    In our next meetup, we’ll have Hadley (I think no introduction needed).

    Packages made easy
    by Hadley Wickham

    In this talk I’ll show you how easy it to make an R package by live coding one in front of you! You’ll learn about three useful packages:
    * usethis: automating the setup of key package components
    * devtools: for automating many parts of the package development workflow
    * testthat: to write unit tests that make you confident your code is correct

    Timeline:

    – 6:30pm arrival, drinks and networking
    – 7:00pm talk

    You must have a confirmed RSVP and please arrive by 6:55pm the latest. Please RSVP here on Eventbrite.

    Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
    Park underneath the building (Colorado Center), Edmunds will validate.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Los Angeles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Spring Budget 2017: Circle Visualisation

    Fri, 10/27/2017 - 23:49

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    It’s time to branch out into a new area of data visualisation: proportion area plots. These plots use area to show proportion between different related values. A common type of proportional area plots are tree maps. We are going to be using the same principle but with circles.

    A common subject for area visualisation is budgets and money, which will be the subject of this post.

    We are going to take a look at the Spring Budget 2017 in the UK.

    Page four of the document has two pie charts detailing how public money will be spent this year and where it will come from.

    I couldn’t find this data in any tables, so we will have to create our data frames manually:

    spending <- data.frame(c("Social protection","Health","Education","Other","Defence","Debt interest","Transport","Housing\n & environment","Public order","Social\n services","Industry"), c(245,149,102,50,48,46,37,36,34,32,23), c(802,802,802,802,802,802,802,802,802,802,802)) names(spending) <- c("Expenditure","Spending","Total") income <- data.frame(c("Income tax","VAT","National Insurance", "Other taxes","Borrowing","Other (non taxes)","Corporation Tax","Excise duties","Council tax","Business\n rates"), c(175,143,130,80,58,54,52,48,32,30), c(802,802,802,802,802,802,802,802,802,802)) names(income) <- c("Source","Income","Total")

    I am including ‘borrowing’ in the Government’s income section, although it will be the only circle that won’t be a tax.

    To draw our circles we will be using the packcircles package.

    library("packcircles") library("ggplot2") options(scipen = 999) library(extrafont) library(extrafontdb) loadfonts() fonts() t <- theme_bw() + theme(panel.grid = element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank())

    The first thing to do is to use the functions within the package to generate the radii and coordinates, as per this code here.

    # circle areas areas <- income$Income areas2 <- spending$Spending # income packing1 <- circleProgressiveLayout(areas) dat1 <- circleLayoutVertices(packing1) # spending packing2 <- circleProgressiveLayout(areas2) dat2 <- circleLayoutVertices(packing2)

    Next up we are going to move to the visualisation. It would be more informative if we could display income and expenditure side-by-side.

    Like last time when we compared Donald Trump and Hillary Clinton on Facebook, we can use facetting for this purpose in ggplot2.

    I adapted this code for this purpose.

    dat <- rbind( cbind(dat1, set = 1), cbind(dat2, set = 2) ) p <- ggplot(data = dat, aes(x, y)) + geom_polygon(aes(group = id, fill = -set), colour = "black", show.legend = FALSE) + theme(text = element_text(family="Browallia New", size = 50)) + coord_equal() + ggtitle("Income and expenditure for the UK Government") + scale_colour_manual(values = c("white","White")) + facet_grid(~set, labeller = as_labeller( c('1' = "Income", '2' = "Expenditure")))

    With no labels:

    This is looking good, but obviously we have no way of knowing which circle corresponds to which area of income or expenditure. In ggplot2 you can use geom_text() to add annotations. In order for them to be positioned correctly, we need to build a relationship between the labels and the circles. We can do this by creating centroids from the coordinates we already have for the circles.

    If we have accurate centroid coordinates then our labels will always appear in the centre of our circles.

    Going back to our dat data frame:

    str(dat) > str(dat)'data.frame': 546 obs. of  4 variables: $ x  : num  0 -0.277 -1.092 -2.393 -4.099 ... $ y  : num  0 2.2 4.25 6.05 7.46 ... $ id : int  1 1 1 1 1 1 1 1 1 1 ... $ set: num  1 1 1 1 1 1 1 1 1 1 ...

    Our ID column shows which circle it belongs to, and our set column shows whether it belongs in income or expenditure. From this we can deduce that each circle is plotted using 26 pairs of x/y coordinates (at row 27 the ID changes from one to two).

    The mean of each 26 pairs is the centre of the corresponding circle.

    Our dat data frame is 546 rows long. We need a sequence of numbers, 26 apart, going up to 546:

    sequence <- data.frame(c(x = (seq(from = 1, to = 546, by = 26))), y = seq(from = 26, to = 547, by = 26)) names(sequence) head(sequence) x y x1 1 26 x2 27 52 x3 53 78 x4 79 104 x5 105 130 x6 131 156

    Next up we will create a data frame that will contain our centroids:

    centroid_df <- data.frame(seq(from = 1, to = 521, by = 1)) names(centroid_df) <- c("number")

    There are probably a few different ways to run the next section, probably a bit more elegantly than what we’re about to do as well.

    This for loop will calculate the mean of every single combination of 26 coordinates and store the results in the new centroid_df.

    for (i in 1:521) { j = i + 25 centroid_df$x[i] <- mean(dat$x[i:j]) centroid_df$y[i] <- mean(dat$y[i:j]) }

    Now we bring in our sequence data frame to filter out the ones we don’t need in a new coords data frame, leaving just the ones that correspond correctly to the circles:

    coords <- merge(sequence,centroid_df, by.x = "x", by.y = "number") names(coords) <- c("number1","number2","x","y")

    Next up for labelling purposes is to clarify which set each pair of coordinates belongs to:

    coords$set[1:21] <- 1 coords$set[11:21] <- 2 #this overrides some of the values set in the line above

    Finally we will paste in the names and amounts from our original income and spending data frames into a new label column:

    coords$label <- NA coords$label[1:21] <- paste(income$Source,"\n£",income$Income,"bn") coords$label[11:21] <- paste(spending$Expenditure,"\n£",spending$Spending,"bn") coords$label <- gsub("£ ","£",coords$label) coords$label <- gsub(" bn","bn",coords$label)

    Now let’s add in the labels:

    p <- p+ geom_text(data = coords, aes(x = x, y = y, label = label, colour = factor(set)),show.legend = FALSE) p

    Gives this plot:

    There we have it, some nicely formatted labels showing what each circle refers to, and how much money it represents.

    Brief analysis

    On the left hand side we see that income tax is the biggest cash cow for the Government. Incidentally, about 28 per cent of that is paid for by the top 1% of earners:

    The top 5% are liable for almost half the UK's total income tax.

    Not enough for Jeremy Corbyn, so what would he like it to be? pic.twitter.com/poeC2N6Sn9

    — Rob Grant (@robgrantuk) June 21, 2017

    Social protection means benefits, essentially. A huge chunk of that is the State Pension, followed by various working-age benefits and tax credits.

    Notice that as a country we spend almost as much paying off the interest on our debts as we do on our own defence.

    Health and education – those two ‘classic’ public services – cost a combined £250bn.

    Conclusion

    I am a fan of using these circles to visualise this kind of data, especially money.

    If you have lots of networked data (for instance, if you had breakdowns of how exactly the social protection budget will be spent) then ggraph would be a good option for you to produce plots like this:

    Comment below if you have questions!

      Related Post

      1. Qualitative Research in R
      2. Multi-Dimensional Reduction and Visualisation with t-SNE
      3. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
      4. Analyzing Obesity across USA
      5. Can we predict flu deaths with Machine Learning and R?
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Stan Roundup, 27 October 2017

      Fri, 10/27/2017 - 21:00

      (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

      I missed two weeks and haven’t had time to create a dedicated blog for Stan yet, so we’re still here. This is only the update for this week. From now on, I’m going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.

      Not one, but two new devs!

      This is my favorite news to post, hence the exclamation.

      • Matthijs Vákár from University of Oxford joined the dev team. Matthijs’s first major commit is a set of GLM functions for negative binomial with log link (2–6 times speedup), normal linear regression with identity link (4–5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn’t just implement the straight-line case—this is a fully vectorized implementation as a density, so we’ll be able to use them this way: int y[N]; // observations matrix[N, K] x; // "data" matrix vector[K] beta; // slope coefficients real alpha; // intercept coefficient y ~ bernoulli_logit_glm(x, beta, alpha);

        These stand in for what is now written as

        y ~ bernoulli_logit(x * beta + alpha);

        and before that was written

        y ~ bernoulli(inv_logit(x * beta + alpha));

        Matthijs also successfully defended his Ph.D. thesis—welcome to the union, Dr. Vákár.

      • Andrew Johnson from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He’s also been patching doc and other issues.

      Visitors

      • Kentaro Matsura, author of Bayesian Statistical Modeling Using Stan and R (in Japanese) visited and we talked about what he’s been working on and how we’ll handle the syntax for tuples in Stan.

      • Shuonan Chen visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics—specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro’s book!

      Other news

      • Bill Gillespie presented a Stan and Torsten tutorial at ACoP.

      • Charles Margossian had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.

      • Krzysztof Sakrejda nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).

      • Ben Goodrich has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review—suggestions welcome).

      • Bob Carpenter and Mitzi Morris taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! Mike Lawrence, of Stan tutorial YouTube fame, helped people with debugging and installs—nice to finally meet him in person.

      • Ben Bales got the metric initialization into CmdStan, so we’ll finally be able to restart (the metric used to be called the mass matrix—it’s just the inverse of a regularized estimate of global posterior covariance during warmup)

      • Michael Betancourt just returned from SACNAS (diversity in STEM conference attended by thousands).

      • Michael also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.

      • Aki Vehtari was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with Andrew Gelman; it’ll also be joint between Helsinki and New York. Sounds like fun!

      • Breck Baldwin and Sean Talts headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.

      • Imad Ali is leaving Columbia to work for the NBA league office (he’ll still be in NYC) as a statistical analyst! That’s one way to get access to the data!

      The post Stan Roundup, 27 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Projects chapter added to “Empirical software engineering using R”

      Fri, 10/27/2017 - 20:06

      (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

      The Projects chapter of my Empirical software engineering book has been added to the draft pdf (download here).

      This material turned out to be harder to bring together than I had expected.

      Building software projects is a bit like making sausages in that you don’t want to know the details, or in this case those involved are not overly keen to reveal the data.

      There are lots of papers on requirements, but remarkably little data (Soo Ling Lim’s work being the main exception).

      There are lots of papers on effort prediction, but they tend to rehash the same data and the quality of research is poor (i.e., tweaking equations to get a better fit; no explanation of why the tweaks might have any connection to reality). I had not realised that Norden did all the heavy lifting on what is sometimes called the Putnam model; Putnam was essentially an evangelist. The Parr curve is a better model (sorry, no pdf), but lacked an evangelist.

      Accurate estimates are unrealistic: lots of variation between different people and development groups, the client keeps changing the requirements and developer turnover is high.

      I did turn up a few interesting data-sets and Rome came to the rescue in places.

      I have been promised more data and am optimistic some will arrive.

      As always, if you know of any interesting software engineering data, please tell me.

      I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

      Reliability chapter next.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Microsoft R Open 3.4.2 now available

      Fri, 10/27/2017 - 16:50

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Microsoft R Open (MRO), Microsoft's enhanced distribution of open source R, has been upgraded to version 3.4.2 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R 3.4.2 and updates the bundled packages

      MRO is 100% compatible with all R packages. MRO 3.4.2 points to a fixed CRAN snapshot taken on October 15 2017, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

      MRO 3.4.2 is based on R 3.4.2, a minor update to the R engine (you can see the detailed list of updates to R here). This update is backwards-compatible with R 3.4.1 (and MRO 3.4.1), so you shouldn't encounter an new issues by upgrading. 

      We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

      MRAN: Download Microsoft R Open

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Substitute levels in a factor or character vector

      Fri, 10/27/2017 - 11:05

      I’ve been using the ggplot2 package a lot recently. When creating a legend or tick marks on the axes, ggplot2 uses the levels of a character or factor vector. Most of the time, I am working with coded variables that use some abbreviation of the “true” meaning (e.g. “f” for female and “m” for male or single characters for some single character for a location: “S” for Stuttgart and “M” for Mannheim).

      In my plots, I don’t want these codes but the full name of the level. Since I am not aware of any super-fast and easy to use function in base R (let me know in the comments if there is one), I came up with a very simple function and put this in my .Rprofile (that means that it is available whenever I start R). I called it “replace.levels“. The dot before the name means that it is invisible and does not show up in my Global Environment overview in RStudio. You have to call it with .replace.levels(), of course.

      This is how it works:
      .replace.levels takes two arguments: vec and replace.list

      vec is the vector where substitutions shall be made.
      replace.list is a list with named elements. The names of the elements are substituted for the contents of the elements.

      The function also checks if all elements (both old and new ones) appear in vec. If not, it throws an error. It is a very simple function, but it saves me a lot of typing.

      Example:

      a <- c(“B”, “C”, “F”)
      .replace.levels(a, list(“B” = “Braunschweig”, “C” = “Chemnitz”, “F” = “Frankfurt”))
      [1] “Braunschweig” “Chemnitz”     “Frankfurt”

      Here it is:

      .replace.levels <- function (vec, replace.list) {
        # Checking if all levels to be replaced are in vec (and other way around)
        cur.levels <- unique(as.character(vec))
        not.in.cur.levels <- setdiff(cur.levels, names(replace.list))
        not.in.new.levels <- setdiff(names(replace.list), cur.levels)
        if (length(not.in.cur.levels) != 0 | length(not.in.new.levels) != 0) {
          stop(“The following elements do not match: “, paste0(not.in.cur.levels, not.in.new.levels, collapse = “, “))
        }
        for (el in 1:length(replace.list)) {
          vec <- gsub(names(replace.list[el]), replace.list[[el]], vec)
        }
        vec
      }

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      Gold-Mining – Week 8 (2017)

      Thu, 10/26/2017 - 22:17

      (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

      Week 8 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

      The post Gold-Mining – Week 8 (2017) appeared first on Fantasy Football Analytics.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

      Thu, 10/26/2017 - 20:40

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      by Błażej Moska, computer science student and data science intern 

      One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

      Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

      \[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

      \[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

      Bias is defined as follows:

      \[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

      One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

      \(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

      We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl times, where nsampl is the parameter describing number of bootstrap samples.

      Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

      Because bias and variance could be controlled by parameters sent to the rpart function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp (complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit, which defines minimum number of observations needed to attempt a split.

      Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl trees, calculate nsampl predictions, nrow variances, nrow biases and repeat those operations for the number of parameters (length of the vector cp or minsplit). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

      So, summing up, the procedure looks as follows:

      1. Create bootstrap samples (by bootstrapping original dataset)
      2. Train model on each of these bootstrap datasets
      3. Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
      4. Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
      5. Calculate mean bias/absolute bias and mean variance 
      R Code

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Analysing Cryptocurrency Market in R

      Thu, 10/26/2017 - 15:01

      (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

      Cryptocurrency market has been growing rapidly that being an Analyst, It intrigued me what does it comprise of. In this post, I’ll explain how can we analyse the Cryptocurrency Market in R with the help of the package coinmarketcapr. Coinmarketcapr package is an R wrapper around coinmarketcap API.

      To get started, Let us load the library into our R session and Plot the top 5 Cryptocurrencies.

      library(coinmarketcapr) plot_top_5_currencies()

      Gives this plot:

      The above plot clearly shows how bitcoin is leading the market but that does not give us the picture of how the marketshare is split among various cryptocurrencies, so let us get the complete data of various cryptocurrencies.

      market_today <- get_marketcap_ticker_all() head(market_today[,1:8]) id name symbol rank price_usd price_btc X24h_volume_usd market_cap_usd 1 bitcoin Bitcoin BTC 1 5568.99 1.0 2040540000.0 92700221345.0 2 ethereum Ethereum ETH 2 297.408 0.0537022 372802000.0 28347433482.0 3 ripple Ripple XRP 3 0.204698 0.00003696 100183000.0 7887328954.0 4 bitcoin-cash Bitcoin Cash BCH 4 329.862 0.0595624 156369000.0 5512868154.0 5 litecoin Litecoin LTC 5 55.431 0.010009 124636000.0 2967255097.0 6 dash Dash DASH 6 287.488 0.0519109 46342600.0 2197137527.0

      Having extracted the complete data of various cryptocurrencies, let us try to visualize the marketshare split with a treemap. For plotting, let us extract only the two columns ID and market_cap_usd and convert the market_cap_usd into numeric type and a little bit of number formatting for the treemap labels.

      library(treemap) df1 <- na.omit(market_today[,c('id','market_cap_usd')]) df1$market_cap_usd <- as.numeric(df1$market_cap_usd) df1$formatted_market_cap <- paste0(df1$id,'\n','$',format(df1$market_cap_usd,big.mark = ',',scientific = F, trim = T)) treemap(df1, index = 'formatted_market_cap', vSize = 'market_cap_usd', title = 'Cryptocurrency Market Cap', fontsize.labels=c(12, 8), palette='RdYlGn')

      Gives this plot:

      The above visualization explains the whole cryptocurrency market is propped by two currencies primarily – Bitcoin and Etherum and even the second ranked Etherum is far behind than Bitcoin which is the driving factor of this market. But it is also fascinating (and shocking at the same time) that both Bitcoin and Etherum together create a 100 Billion Dollar (USD) market. Whether this is a sign of bubble or no – We’ll leave that for market analysts to speculate, but being a data scientist or analyst, We have a lot of insights to extract from the above data and it should be interesting analysing such an expensive market.

        Related Post

        1. Time Series Analysis in R Part 3: Getting Data from Quandl
        2. Pulling Data Out of Census Spreadsheets Using R
        3. Extracting Tables from PDFs in R using the Tabulizer Package
        4. Extract Twitter Data Automatically using Scheduler R package
        5. An Introduction to Time Series with JSON Data
        var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

        To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

        Pages