Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 47 min ago

Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates

Mon, 10/30/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Shiny is an amazing R package which lets the R developers and users build amazing web apps using R itself. It lets the R users analyze, visualize and deploy their machine learning models directly in the form of the web app. This package lets you host standalone apps on a webpage or embed them in R markdown documents or build dashboards and various forecasting applications. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. Shiny lets us write client-side front-end code in R itself and also lets users write server-side script in R itself. More details on this package can be found here.

I recently learned Shiny and started developing a web application using it.And since then I have been in love with it and have been using it in each and every data science and analytics project. The syntax is super easy to understand and there are lots of amazing articles and documentation available for you to learn it and use it. I personally had a background of developing full-stack web applications using HTML, CSS and javascript and other JS based scripting languages so I found the syntax easy.

This article is meant for people who at least have a basic understanding of how to use shiny in R or who either have knowledge of developing web apps. But still, it is easy for an intermediate level or beginner R user to learn shiny and its syntax to develop web apps.

In this article I am going to demonstrate how to make a dashboard using shinydashboard package. Shiny Dashboard is also another package similar to shiny which is specifically used to make dashboards easily in R and deploy them. More details on this package can be found here.

Secondly, in this article I am going to demonstrate how to use highcharter package which is a Javascript based visualization library in R to develop amazing and beautiful plots. Its syntax is somewhat similar to qplot function from the ggplot2 package in R. So if you have good experience of using ggplot2 package then you would find it easier to use. Details about the highcharter can be found here.

Analyzing Inflation Rates

In this article I am going to demonstrate how to use shinydashboard and highcharter package in R to analyze and visualize Inflation rates of major economies and other economic and regional trade unions.

What are inflation rates

Inflation rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money. In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation.

Deflation is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy’s growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.

Now I won’t be explaining much about these economic terms. I leave these things for the readers to go check and read these things out. I am sure you will find such subjects quite interesting.

Lets start with designing UI in R

In shiny you have the choice to write UI and server-side code in a single file. But I prefer to write the client-side and server-side code in separate files for easy understanding and modularity and stop the things to get messed up if the code gets too long.

#loading the packages library(shinydashboard) require(shiny) require(highcharter) #layout of the dashboard #defining character vectors for select inputs country<-c("India","United States","Mexico","Canada","China, People's Republic of","Japan","Russian Federation","Germany","United Kingdom","European Union", "ASEAN-5","New Zealand","Australia","Netherlands","Luxembourg", "France","Qatar","United Arab Emirates","Saudi Arabia") unions<-c("Major advanced economies (G7)","European Union","Emerging and Developing Europe","ASEAN-5","Commonwealth of Independent States", "Emerging and Developing Asia","Latin America and the Caribbean", "Middle East, North Africa, Afghanistan, and Pakistan") #function used to define the dashboard dashboardPage( #defines header skin = "red", #header of the dashboard dashboardHeader( title="Inflation Rates" , dropdownMenu() ), #defines sidebar of the dashboard dashboardSidebar( sidebarMenu( #the sidebar menu items menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("About", tabName = "about", icon = icon("th")), menuItem("Trade Unions",tabName="unions",icon=icon("signal")), menuItem("World",tabName="world",icon=icon("globe")) )), #defines the body of the dashboard dashboardBody( #to add external CSS tags$head( tags$link(rel = "stylesheet", type = "text/css", href = "custom.css") ), tabItems( #First TAB Menu-dashboard- first argument should be the 'tabName' value of the menuItem function tabItem(tabName = "dashboard", fluidRow( column(12, #box() is similar to a 'div' element in HTML box( selectInput("country",label="Select Country",choices=country), width = 12)# end box ),#end column #box for plotting the time series plot column(12, box( #below function is used to define a highcharter output plot which will be made in the server side highchartOutput("hcontainer"), width="12") #end box2 ), #end column br(), #line break h4("Relative inflation rates time series plot",align="center"), br(), column(12, box(highchartOutput("hc2"),width=12)) ),#end row h4("Made with love from", strong("Anish Singh Walia")), a("R code for this project",target="_blank",href="https://github.com/anishsingh20/Analzying-Inflation-Rates-Worldwide") ), #second tab menu- ABOUT tabItem(tabName="about", h2("What is Inflation ?",style="text-align:center"), br(), br(), box(width=12,height="400px", p(style="font-size:20px",strong("Inflation"),"rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money.In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation."), p(style="font-size:20px",strong("Deflation"), "is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy's growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.") ) ), tabItem(tabName = "unions", h3("Time series of Inflation rates of Economic trade unions",align="center") , fluidRow( column(12, box(selectInput("region",label="Select Economic Region",choices=unions),width = 12) ), box(highchartOutput("hc3"), width=12) )# end row ), tabItem(tabName = "world", h3("World's Inflation Rates",align="center") , box(highchartOutput("hc4"), width=12) ) )#end tabitems )#end body )#end dashboard

Now, in the above code one can notice that the various functions which are used to design the UI are actually similar to the HTML elements. But in R shiny allows us to use those like functions and the HTML attributes as the function arguments. It’s hard to explain everything in the above code, but I insist the readers to check the documentation and help files of the above functions and go and check out the links to various resources which I will add at the end.

The highchartOutput(id) function which takes the ‘id’ as an argument which is used to define a high charter plot which will be developed and plotted at the server side. Here we only design and define the UI and layout of the dashboard.

Let’s write the Server logic

All the server-side code would be in a separate file named server.R. In the server side, it would only contain the logic to reactively and dynamically plot the various plots.

The inflation rates dataset is downloaded from IMF(International Monetary Fund) website and is publically available. I first had to do some preprocessing and transformations using the tidyr and dplyr packages to convert the dataset to the desired format.

require(shinydashboard) require(ggplot2) require(dplyr) require(highcharter) #to plot amazing time series plots library(readxl) require(tidyr) #loading the dataset inflation <- read_excel("inflation.xls") year<-c(1980:2022) #making a vector consisting of all years year<-as.character(year)#converting to character type to use in gather() #converting dataset to a long format inf% gather(year,key = "Year",value="InflationRate") inf<-na.omit(inf) #omitting NA values names(inf)<-c("region","year","inflation") inf$year<-as.integer(inf$year) India<-filter(inf,region=="India") India$inflation<-as.numeric(India$inflation) India$year<-as.numeric(India$year) China<-filter(inf,region=="China, People's Republic of") Ger<-filter(inf,region=="Germany") Japan<-filter(inf,region=="Japan") US<-filter(inf,region=="United States") EU<-filter(inf,region=="European Union") UK<-filter(inf,region=="United Kingdom") Fr<-filter(inf,region=="France") uae<-filter(inf,region=="United Arab Emirates") #server function and logic server <- function(input, output) { output$hcontainer <- renderHighchart ({ #write all R-code inside this df% filter(region==input$country)#making is the dataframe of the country #above input$country is used to extract the select input value from the UI and then make #a dataframe based on the selected input df$inflation<-as.numeric(df$inflation) df$year% hc_exporting(enabled = TRUE) %>% #to enable downloading the plot image hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% #designing the hoverable tooltil hc_title(text="Time series plot of Inflation Rates",align="center") %>% #title hc_subtitle(text="Data Source: IMF",align="center") %>% #subtile of the plot hc_add_theme(hc_theme_elementary()) #adding theme #to add 3-d effects #hc_chart(type = "column", #options3d = list(enabled = TRUE, beta = 15, alpha = 15)) }) # end hcontainer output$hc2% hc_xAxis(categories=inf$year) %>% hc_add_series(name = "India", data = India$inflation) %>% hc_add_series(name = "USA", data = US$inflation) %>% hc_add_series(name = "UK", data = UK$inflation) %>% hc_add_series(name = "China", data = China$inflation) %>% hc_add_series(name = "Germany", data = Ger$inflation) %>% hc_add_series(name="Japan",data=Japan$inflation) %>% #to add colors hc_colors(c("red","blue","green","purple","darkpink","orange")) %>% hc_add_theme(hc_theme_elementary()) }) # end hc2 output$hc3<-renderHighchart({ union% filter(region==input$region) union$year<-as.numeric(union$year) union$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for Economic Unions",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc3 output$hc4<-renderHighchart({ world% filter(region=="World") world$year<-as.numeric(world$year) world$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for World",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc4 }

In the above code we access the plot defined in the UI using their id in the form output$id. We use the function renderHighchart({ #write all R code and logic }) to render the plots and write all the R-code to develop any output inside it.v Inside this function as you all can notice is simply the R code to make plots.

Now the function hchart(df,type,hcaes(x,y)...) works in the same fashion as qplot. It takes data, type of plot, plot asthetics as its arguments. Rest things such as title, subtitle etc are piped to it like we piped extra things and plotting characteristics in ggplot. For more details read this http://jkunst.com/highcharter/hchart.html

Screenshots of the Dashboard:

Now deploying a shiny app is just a matter of minutes. You just need to have an account on shinyapps.io and it offers infrastructure as a service(IaaS) to run and manage you application. You can deploy you app directly from your Rstudio IDE.

Conclusion

This was just a short tutorial on how to make a dashboard in R and use shinydashboard and highcharter package in R. This article is not aimed to explain everything in detail,rather just demonstrate the use of shinydashboard and highcharter in R to build awesome dashboards and web apps.I will attach various resources below to help out with learning to develop web apps in R and which helped me out too.

Resources

Hope this article is motivating enough to at least get you started with learning and developing your own web apps in R and start using highcharter in R to visualize data. For any doubt or queries feel free to drop a comment. I will be happy to help. Do like and share the article.
Happy coding!

    Related Post

    1. Time Series Analysis in R Part 2: Time Series Transformations
    2. Time Series Analysis in R Part 1: The Time Series Object
    3. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    4. Using MongoDB with R
    5. Finding Optimal Number of Clusters
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Increasing Airline Customer Satisfaction

    Mon, 10/30/2017 - 02:38

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Motivation

    The sheer size of the airline industry provides a reason to care about it: it affects not only millions of people directly (flyers, pilots, engineers, etcetera), but also millions more indirectly by the heft of its economic presence. In a December 2016 report, the International Air Transport Association (IATA) wrote:

    “While airline industry profits are expected to have reached a cyclical peak in 2016 of $35.6 billion, a soft landing in profitable territory is expected in 2017 with a net profit of $29.8 billion. 2017 is expected to be the eighth year in a row of aggregate airline profitability, illustrating the resilience to shocks that have been built into the industry structure. On average, airlines will retain $7.54 for every passenger carried.” (reference)

    As a resident of the US, the daughter of an airline pilot, and a semi-frequent flyer, I have a personal interest in the US airline industry. In looking at carriers by region in the same article mentioned above, the IATA concluded: “The strongest financial performance is being delivered by airlines in North America. Net post-tax profits will be the highest at $18.1 billion next year […]. The net margin for the region’s carriers is also expected to be the strongest at 8.5% with an average profit of $19.58/passenger.”

    Although the North American airline industry is strong, it must be ever-vigilant about keeping up with customer demands in order to maintain its continued growth and its continued position as industry leader across regions. Of course, success in this regard requires airlines to know what customers care about in the first place. Discovering what airline customers like and dislike about their flight experiences was the starting point for this project. To view the project in an interactive Shiny App, click here.

    Data 

    To understand more precisely which aspects of a flight shape customers’ opinions, I decided to scrape the website Skytrax, which collects customer-written reviews of flights for nearly every airline in operation. A typical review appears as follows:

    To get the data from posts of this kind into an analyzable format, I wrote a python script using Selenium to define a web scraping spider, the code for which can be found here on my Github. Of the hundreds of airlines available for reviewing on Skytrax, I limited the scope of my investigation to the eleven largest US-based companies: Alaska Airlines, Allegiant Air, American Airlines, Delta Air Lines, Frontier Airlines, Hawaiian Airlines, JetBlue Airways, Southwest Airlines, Spirit Airlines, United Airlines, and Virgin America. The variables I included in the scrape were:

    • airline: Airline with which the review-writer flew
    • overall: Overall airline rating (out of 10) given by the review-writer
    • author: Name of the review-writer
    • date: Date the review was written
    • customer_review: Text of the customer review
    • aircraft: Aircraft class/type on which the review-writer flew (possibilities too numerous to list; example: Boeing 737)
    • traveller_type: Type of traveller of the review-writer (Business, Couple Leisure, Family Leisure, Solo Leisure)
    • cabin: Type of cabin/class in which the review-writer flew (Business Class, Economy Class, First Class, Premium Economy)
    • route: Origin and destination of the flight (example: Chicago to Boston)
    • date_flown: Month and year in which the flight in question took place
    • seat_comfort: Rating (out of 5) of seat comfort
    • cabin_service: Rating (out of 5) of the inflight service
    • food_bev: Rating (out of 5) of the quality of the inflight food and beverages
    • entertainment: Rating (out of 5) of the availability and connectivity of the inflight entertainment; includes movies and wifi connectivity
    • ground_service: Rating (out of 5) of the service on the ground before and/or after the flight
    • value_for_money: Rating (out of 5) of the value of the airline against the cost of a ticket
    • recommended: Does the review-writer plan to recommend the airline to others (Yes or No)

    One detail of the data worth mentioning is that the variables coming from the “table” in the review (aircraft, type of traveller, etcetera) were not required fields for the review writer. This meant that different reviews could contain different subsets of variables from that table. This affected both the engineering and analysis aspects of this project. On the engineering side, the changing tables from review to review forced me to use Selenium rather than the much faster python package Scrapy. On the analysis side, reviews that did not contain all possible table fields would have missing values (NAs) for those omitted variables, and missingness was leveraged to help answer the more focused questions that follow.

     

    Question 1

    Out of seat comfort, cabin service, food and beverages, entertainment, and ground service, which aspect of a flight has the most influence on a customers’ overall rating?

    This is a classic machine learning question that is easy to ask but difficult to answer, the difficulty lying in the potentially subtle interactions among the predictor variables. To attack this question in my project, I turned to random forests; this choice was motivated by the need to avoid machine learning techniques (such as linear regression) that rely on normality assumptions for the predictor variables. Given the biased nature of reviews, nearly all of the predictor variables in question here (seat comfort, cabin service, food and beverages, entertainment, ground service) were not normally distributed but rather bimodal, with peaks near ratings of 1 and 5.

    I used the randomForest() function from the R package “randomForest”,  which uses the non-parametric Breiman random forest algorithm to produce regression models. As a side perk, it estimates the importance of each predictor variable, relative to the others in question, in predicting the response variable. This output is what I used to determine which of my five variables was most important to the overall flight rating. The details of the Breiman random forest regression algorithm as well as the details for the algorithms used to determine the prediction importance can be found here. Below is a visual of the variable importance output from the randomForest() function when run on my data:

    One thing to note is that the above computation was done on a version of my dataset in which I had imputed “3” ratings into all the missing ratings. This was done with the idea that if someone had omitted a variable from their review, they likely felt ambivalent about it and would give a middling rating of 3/5 if pressed. However, this was an assumption that could alter my results so I did two things:

    1. Ran the same computation on three different versions of my data: one with “3” imputed for NAs; one with the variable mean imputed for NAs in that variable; one with only reviews that were entirely complete. The above result, as mentioned, is from the first version of imputed data. While the importance ratings (the numbers themselves) changed from version to version, the order of variable importance was constant throughout. This allows me to conclude that according to the Breiman algorithm, ground service was the most important variable in predicting a customer’s overall rating for a flight, followed by seat comfort, cabin service, food and beverage, and entertainment (in that order).
    2. Analyzed the proportion of reviews that included each variable. Imputing missing values inevitably skews results, but eliminating them costs us potentially valuable information. In this case, I believed missingness in various fields tended to fall into the category of “Missing Not at Random,” meaning that the cause for missingness was actually related to the value of the variable in question; in particular, I believed fields with high amounts of missingness were likely less important to customers than fields with very little missingness. To analyze this, I graphed the proportion of reviews that included each variable:

    From this we see that cabin service and seat comfort are fields that are included in nearly every review, while ground service is only included in about 55% of the reviews.

    Insight:  Cabin service and seat comfort are filled in by nearly every customer who writes a review, so these aspects of a flight are important to most customers. Since these variables rank as second and third most important in predicting overall flight score according to the Breiman random forest algorithm, we may conclude that these two fields have the most influence on a customer’s overall rating of a flight. Furthermore, while ground service appears to be the field most often omitted from reviews (and therefore possibly the least important to customers in general), overall flight rating is highly dependent on ground service for those customers who do include it. In other words, most people don’t care too much about ground service, but those who do care a lot. 

     

    Question 2

    How are US airlines performing across different aspects of customer flight experience?

    Given our results in Question 1, an airline may now want to compare itself to other airlines and to the industry as a whole across the variables of cabin service, entertainment, food and beverage, ground service, and seat comfort. To analyze this, I counted the number of reviews giving 1, 2, 3, 4, 5, and NA ratings for each variable and within each airline, as well as for the industry as a whole. For seat comfort ratings specifically, we have the following results:

     

    Each of the five variables of cabin service, entertainment, food and beverage, ground service, and seat comfort were explored, and going over all of these lead to the following observations:

    • JetBlue Airways had the best ratings for seat comfort across all airlines and thus should market themselves as the industry leaders in seat comfort. Similarly, Alaska Airlines should market themselves as the industry leader in cabin service. Given the results from Question 1, both JetBlue and Alaska would likely see a boost in sales if customers knew they lead in seat comfort and cabin service since these are the variables (of the five studied so far) that impact customers’ overall impressions of a flight the most.
    • The industry as a whole sees quite a lot of low or missing ratings in entertainment. An airline that has received middling ratings in other fields could distinguish itself with entertainment (inflight movies, wifi access, etcetera) and potentially influence more customers to care more about their inflight entertainment experiences.
    • Spirit Airlines consistently receives mostly 1 ratings, which suggests that across all considered fields customers tend to be unhappy with their experience. Yet, Spirit Airlines continues to grow. This suggests a need for more exploration into the needs and wants of airline customers.

    Insight: On the whole the US airline industry is doing the best with cabin service and the worst with ground service and seat comfort (in these fields, there were fewer 5s than any other rating). In addition, there is a huge lack of ratings for entertainment. Within each of the five fields studied, there are opportunities for individual airlines to capitalize on the difference between their own performance and that of the industry: either boast about leading in a currently important field (seat comfort or cabin service) or put money towards becoming a leader in an overlooked arena (entertainment, ground service, food and beverage).

     

    Question 3

    What words come up most frequently in positive reviews? In negative reviews?

    The previous questions aimed to better understand airline customer views on five specific aspects of flight experience (cabin service, entertainment, food and beverage, ground service, seat comfort), but since these five fields do not account for all that could impact a customer’s overall experience, I wanted to analyze the actual text of their reviews. To do this, I used word clouds generated by a combination of the “tm”, “wordcloud,” and “memoise” packages in R. I analyzed the positive and negative reviews separately and for both individual airlines and the industry as a whole. Positive reviews were those with overall ratings of 6/10 or better and negative reviews were those with overall ratings of 5/10 or worse.

    Here are the positive and negative word clouds for the industry as a whole:

    In both the positive and negative reviews, the word ‘time’ is one of the top three most-used words. This suggests that, indeed, the five fields considered in questions 1 and 2 did not capture all that is important to flyers (an obvious statement to flyers, of course!). Looking through the word clouds for all the airlines individually further emphasizes that customers primarily rave about and complain about time-related topics. Interestingly, the word ‘hours’ comes up in negative reviews, suggesting that customers can tolerate a moderate amount of lateness or delay (say, under an hour) but lose patience after that. Conversely, no amount of time comes up in positive reviews; potentially, any amount of time that an airline can save a customer is rewarded.

    The words ‘seats’ and ‘service’ still appear in the top five words too, though, so the analysis from the previous questions is corroborated.

    In addition to providing additional information about the trends in the industry, individual airlines can benefit from these word cloud analyses by seeing words that appear in their clouds and not in others (or vice versa). For example, ‘Atlanta’ comes up in the negative reviews for Delta, suggesting that Delta has problems in Atlanta; similarly, Frontier has problems in Denver, and Hawaiian airline customers mention class – both positively and negatively – more than those of any other airline. Spirit, although it ranks as the worst among the airlines in nearly every one of the five fields considered in questions 1 and 2, has a negative word cloud that is not distinguished from the other negative word clouds. That is, Spirit customers still complain the most in text reviews about delays and time, and in fact, Spirit customers mention the word ‘seat’ in their positive reviews!

    Insight: Airline customers write about time more than anything else, followed by service and seats. Given the surprising findings about Spirit Airlines, it’s possible that saving/wasting time is an even stronger predictor of a customer’s overall flight rating than anything else. In business practice, this would suggest that anything an airline can do to save customers time would boost their sales; for some airlines, making tweaks even in single cities may lead to higher level of customer satisfaction.

     

    Conclusion and future directions

    Despite the fact that there are a myriad of factors that impact a flyer’s experience with a flight, airlines can boost customer satisfaction by focusing on a few main aspects of flights – particularly time, seat comfort, and cabin service. This project only scratches the surface in understanding all that impacts flyers’ impressions and each flight aspect merits a full investigation of its own. For example, research into the cost and benefit of improving seat comfort would be highly beneficial for an airline looking to improve sales through seat comfort. In addition, I’d like to employ more advanced machine learning techniques to help predict customer flight ratings and, subsequently, sales. For this, I’d like to look into the unused data I scraped about routes, traveller type, cabin, and date flown, as well as incorporate weather and economic data.

    The post Increasing Airline Customer Satisfaction appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Deep learning and the German Data Science Job Market

    Mon, 10/30/2017 - 01:00

    (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

    Almost 2 years ago, I wrote a short post on the German data science market by analysing open position on the key job platforms; Stepstone, Monster and Indeed. Recently I received requests to update the post with fresh data.

    So here it comes. To add a slightly different spin, I thought it would be interesting to see how widespread and prevalent the deep learning / AI trend shows up in the job market.

    To get started, I scraped the number of available job openings from the sites: Monster, Stepstone and Indeed searching for the term “Data Scientist”.

    To put the numbers in perspective, in the December 2015, I found 65 jobs on Indeed, 36 on Monster and 36 on Stepstone. As a first take-away; the data science market increased somewhere between 5 and 10 fold.
    The ~600 positions on Indeed are advertised by 254 companies, which is almost a 5 fold increase vs 2015.

    In order to see how important deep learning is, I compare the number of open positions for the search term in relation to other data search terms. We see that there are 400 open positions with that keyword. Data mining or machine learning are more common by factor 2 respectively factor 4.5.

    Looking at advertised methods, we see that it is always good to have some standard data preperation and statistic skills under the belt. Looking at the technology search terms, we see a hugh interest in spark (almost a 10 fold increase compared to the 2015 data.) What is espcially interesting here is that there are “only” 330 data engineer positions. Hence big data technologies are mentioned in a much wider range of job positions.

    Finally, looking at programming languages and business intelligence tools we see that Java and Excel are still highly in demand.

    Two years ago I concluded with stating:
    Given this analysis, I would conclude that Germany is not (yet) a hot market for data scientist. Both in terms of regional distribution as well as the low absolute and relative number of open positions suggests that the job market is not well developed.

    Given the 5 fold increase in postions, I think that German companies finally realized that the digital transformation requires highly-skilled data-educated employees.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    2017-04 Speeding up gridSVG

    Sun, 10/29/2017 - 23:23

    (This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

    This report describes changes in version 1.6-0 of the ‘gridSVG’ package for R. The most important result of these changes is that ‘gridSVG’ is now much faster at generating SVG files (in situations where it used to be painfully slow).

    Paul Murrell

    Download

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Big Data Transforms

    Sun, 10/29/2017 - 16:33

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark).



    Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

    The R package DBI gives us direct access to SQL and the package dplyr gives us access to a transform grammar that can either be executed or translated into SQL.

    But, as we point out in the replyr README: moving from in-memory R to large data systems is always a bit of a shock as you lose a lot of your higher order data operators or transformations. Missing operators include:

    • union (binding by rows many data frames into a single data frame).
    • split (splitting a single data frame into many data frames).
    • pivot (moving row values into columns).
    • un-pivot (moving column values to rows).

    I can repeat this. If you are an R user used to using one of dply::bind_rows() , base::split(), tidyr::spread(), or tidyr::gather(): you will find these functions do not work on remote data sources, but have replacement implementations in the replyr package.

    For example:

    library("RPostgreSQL") ## Loading required package: DBI suppressPackageStartupMessages(library("dplyr")) isSpark <- FALSE # Can work with PostgreSQL my_db <- DBI::dbConnect(dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') # # Can work with Sparklyr # my_db <- sparklyr::spark_connect(version='2.2.0', # master = "local") # isSpark <- TRUE d <- dplyr::copy_to(my_db, data.frame(x = c(1,5), group = c('g1', 'g2'), stringsAsFactors = FALSE), 'd') print(d) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 # show dplyr::bind_rows() fails. dplyr::bind_rows(list(d, d)) ## Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl

    The replyr package supplies R accessible implementations of these missing operators for large data systems such as PostgreSQL and Spark.

    For example:

    # using the development version of replyr https://github.com/WinVector/replyr library("replyr") ## Loading required package: seplyr ## Loading required package: wrapr ## Loading required package: cdata packageVersion("replyr") ## [1] '0.8.2' # binding rows dB <- replyr_bind_rows(list(d, d)) print(dB) ## # Source: table [?? x ## # 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 ## 3 1 g1 ## 4 5 g2 # splitting frames replyr_split(dB, 'group') ## $g2 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 5 g2 ## 2 5 g2 ## ## $g1 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 1 g1 # pivoting pivotControl <- buildPivotControlTable(d, columnToTakeKeysFrom = 'group', columnToTakeValuesFrom = 'x', sep = '_') dW <- moveValuesToColumnsQ(keyColumns = NULL, controlTable = pivotControl, tallTableName = 'd', my_db = my_db, strict = FALSE) %>% compute(name = 'dW') print(dW) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group_g1 group_g2 ## ## 1 1 5 # un-pivoting unpivotControl <- buildUnPivotControlTable(nameForNewKeyColumn = 'group', nameForNewValueColumn = 'x', columnsToTakeFrom = colnames(dW)) moveValuesToRowsQ(controlTable = unpivotControl, wideTableName = 'dW', my_db = my_db) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group x ## ## 1 group_g1 1 ## 2 group_g2 5

    The point is: using the replyr package you can design in terms of higher-order data transforms, even when working with big data in R. Designs in terms of these operators tend to be succinct, powerful, performant, and maintainable.

    To master the terms moveValuesToRows and moveValuesToColumns I suggest trying the following two articles:

    if(isSpark) { status <- sparklyr::spark_disconnect(my_db) } else { status <- DBI::dbDisconnect(my_db) } my_db <- NULL var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Machine Learning with R and Python – Part 4

    Sun, 10/29/2017 - 14:40

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    This is the 4th installment of my ‘Practical Machine Learning with R and Python’ series. In this part I discuss classification with Support Vector Machines (SVMs), using both a Linear and a Radial basis kernel, and Decision Trees. Further, a closer look is taken at some of the metrics associated with binary classification, namely accuracy vs precision and recall. I also touch upon Validation curves, Precision-Recall, ROC curves and AUC with equivalent code in R and Python

    This post is a continuation of my 3 earlier posts on Practical Machine Learning in R and Python
    1. Practical Machine Learning with R and Python – Part 1
    2. Practical Machine Learning with R and Python – Part 2
    3. Practical Machine Learning with R and Python – Part 3

    The RMarkdown file with the code and the associated data files can be downloaded from Github at MachineLearning-RandPython-Part4

    Support Vector Machines (SVM) are another useful Machine Learning model that can be used for both regression and classification problems. SVMs used in classification, compute the hyperplane, that separates the 2 classes with the maximum margin. To do this the features may be transformed into a larger multi-dimensional feature space. SVMs can be used with different kernels namely linear, polynomial or radial basis to determine the best fitting model for a given classification problem.

    In the 2nd part of this series Practical Machine Learning with R and Python – Part 2, I had mentioned the various metrics that are used in classification ML problems namely Accuracy, Precision, Recall and F1 score. Accuracy gives the fraction of data that were correctly classified as belonging to the +ve or -ve class. However ‘accuracy’ in itself is not a good enough measure because it does not take into account the fraction of the data that were incorrectly classified. This issue becomes even more critical in different domains. For e.g a surgeon who would like to detect cancer, would like to err on the side of caution, and classify even a possibly non-cancerous patient as possibly having cancer, rather than mis-classifying a malignancy as benign. Here we would like to increase recall or sensitivity which is  given by Recall= TP/(TP+FN) or we try reduce mis-classification by either increasing the (true positives) TP or reducing (false negatives) FN

    On the other hand, search algorithms would like to increase precision which tries to reduce the number of irrelevant results in the search result. Precision= TP/(TP+FP). In other words we do not want ‘false positives’ or irrelevant results to come in the search results and there is a need to reduce the false positives.

    When we try to increase ‘precision’, we do so at the cost of ‘recall’, and vice-versa. I found this diagram and explanation in Wikipedia very useful Source: Wikipedia

    “Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).”

    1.1a. Linear SVM – R code

    In R code below I use SVM with linear kernel

    source('RFunctions-1.R') library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Read data. Data from SKLearn cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split into training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Fit a linear basis kernel. DO not scale the data svmfit=svm(target~., data=train, kernel="linear",scale=FALSE) ypred=predict(svmfit,test) #Print a confusion matrix confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 54 3 ## 1 3 82 ## ## Accuracy : 0.9577 ## 95% CI : (0.9103, 0.9843) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9121 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9474 ## Specificity : 0.9647 ## Pos Pred Value : 0.9474 ## Neg Pred Value : 0.9647 ## Prevalence : 0.4014 ## Detection Rate : 0.3803 ## Detection Prevalence : 0.4014 ## Balanced Accuracy : 0.9560 ## ## 'Positive' Class : 0 ## 1.1b Linear SVM – Python code

    The code below creates a SVM with linear basis in Python and also dumps the corresponding classification metrics

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.svm import LinearSVC from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = LinearSVC().fit(X_train, y_train) print('Breast cancer dataset') print('Accuracy of Linear SVC classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Linear SVC classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset ## Accuracy of Linear SVC classifier on training set: 0.92 ## Accuracy of Linear SVC classifier on test set: 0.94 1.2 Dummy classifier

    Often when we perform classification tasks using any ML model namely logistic regression, SVM, neural networks etc. it is very useful to determine how well the ML model performs agains at dummy classifier. A dummy classifier uses some simple computation like frequency of majority class, instead of fitting and ML model. It is essential that our ML model does much better that the dummy classifier. This problem is even more important in imbalanced classes where we have only about 10% of +ve samples. If any ML model we create has a accuracy of about 0.90 then it is evident that our classifier is not doing any better than a dummy classsfier which can just take a majority count of this imbalanced class and also come up with 0.90. We need to be able to do better than that.

    In the examples below (1.3a & 1.3b) it can be seen that SVMs with ‘radial basis’ kernel with unnormalized data, for both R and Python, do not perform any better than the dummy classifier.

    1.2a Dummy classifier – R code

    R does not seem to have an explicit dummy classifier. I created a simple dummy classifier that predicts the majority class. SKlearn in Python also includes other strategies like uniform, stratified etc. but this should be possible to create in R also.

    # Create a simple dummy classifier that computes the ratio of the majority class to the totla DummyClassifierAccuracy <- function(train,test,type="majority"){ if(type=="majority"){ count <- sum(train$target==1)/dim(train)[1] } count } cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Create training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] #Dummy classifier majority class acc=DummyClassifierAccuracy(train,test) sprintf("Accuracy is %f",acc) ## [1] "Accuracy is 0.638498" 1.2b Dummy classifier – Python code

    This dummy classifier uses the majority class.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.metrics import confusion_matrix (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Negative class (0) is most frequent dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train) y_dummy_predictions = dummy_majority.predict(X_test) print('Dummy classifier accuracy on test set: {:.2f}' .format(dummy_majority.score(X_test, y_test))) ## Dummy classifier accuracy on test set: 0.63 1.3a – Radial SVM (un-normalized) – R code

    SVMs perform better when the data is normalized or scaled. The 2 examples below show that SVM with radial basis kernel does not perform any better than the dummy classifier

    library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Radial SVM unnormalized train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Unnormalized data svmfit=svm(target~., data=train, kernel="radial",cost=10,scale=FALSE) ypred=predict(svmfit,test) confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 0 0 ## 1 57 85 ## ## Accuracy : 0.5986 ## 95% CI : (0.5131, 0.6799) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 0.5363 ## ## Kappa : 0 ## Mcnemar's Test P-Value : 1.195e-13 ## ## Sensitivity : 0.0000 ## Specificity : 1.0000 ## Pos Pred Value : NaN ## Neg Pred Value : 0.5986 ## Prevalence : 0.4014 ## Detection Rate : 0.0000 ## Detection Prevalence : 0.0000 ## Balanced Accuracy : 0.5000 ## ## 'Positive' Class : 0 ## 1.4b – Radial SVM (un-normalized) – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = SVC(C=10).fit(X_train, y_train) print('Breast cancer dataset (unnormalized features)') print('Accuracy of RBF-kernel SVC on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of RBF-kernel SVC on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset (unnormalized features) ## Accuracy of RBF-kernel SVC on training set: 1.00 ## Accuracy of RBF-kernel SVC on test set: 0.63 1.5a – Radial SVM (Normalized) -R Code

    The data is scaled (normalized ) before using the SVM model. The SVM model has 2 paramaters a) C – Large C (less regularization), more regularization b) gamma – Small gamma has larger decision boundary with more misclassfication, and larger gamma has tighter decision boundary

    The R code below computes the accuracy as the regularization paramater is changed

    trainingAccuracy <- NULL testAccuracy <- NULL C1 <- c(.01,.1, 1, 10, 20) for(i in C1){ svmfit=svm(target~., data=train, kernel="radial",cost=i,scale=TRUE) ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) trainingAccuracy <-c(trainingAccuracy,a$overall[1]) testAccuracy <-c(testAccuracy,b$overall[1]) } print(trainingAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.6384977 0.9671362 0.9906103 0.9976526 1.0000000 print(testAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.5985915 0.9507042 0.9647887 0.9507042 0.9507042 a <-rbind(C1,as.numeric(trainingAccuracy),as.numeric(testAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)")

    1.5b – Radial SVM (normalized) – Python

    The Radial basis kernel is used on normalized data for a range of ‘C’ values and the result is plotted.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print('Breast cancer dataset (normalized with MinMax scaling)') trainingAccuracy=[] testAccuracy=[] for C1 in [.01,.1, 1, 10, 20]: clf = SVC(C=C1).fit(X_train_scaled, y_train) acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy.append(acctrain) testAccuracy.append(accTest) # Create a dataframe C1=[.01,.1, 1, 10, 20] trainingAccuracy=pd.DataFrame(trainingAccuracy,index=C1) testAccuracy=pd.DataFrame(testAccuracy,index=C1) # Plot training and test R squared as a function of alpha df=pd.concat([trainingAccuracy,testAccuracy],axis=1) df.columns=['trainingAccuracy','trainingAccuracy'] fig1=df.plot() fig1=plt.title('Training and test accuracy vs C (SVC)') fig1.figure.savefig('fig1.png', bbox_inches='tight') ## Breast cancer dataset (normalized with MinMax scaling)

    Output image: 

    1.6a Validation curve – R code

    Sklearn includes code creating validation curves by varying paramaters and computing and plotting accuracy as gamma or C or changd. I did not find this R but I think this is a useful function and so I have created the R equivalent of this.

    # The R equivalent of np.logspace seqLogSpace <- function(start,stop,len){ a=seq(log10(10^start),log10(10^stop),length=len) 10^a } # Read the data. This is taken the SKlearn cancer data cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) set.seed(6) # Create the range of C1 in log space param_range = seqLogSpace(-3,2,20) # Initialize the overall training and test accuracy to NULL overallTrainAccuracy <- NULL overallTestAccuracy <- NULL # Loop over the parameter range of Gamma for(i in param_range){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(cancer), replace=TRUE) # Initialize the training and test accuracy of folds to 0 trainingAccuracy <- 0 testAccuracy <- 0 # Loop through the folds for(j in 1:noFolds){ # The training is all rows for which the row is != j (k-1 folds -> training) train <- cancer[folds!=j,] # The rows which have j as the index become the test set test <- cancer[folds==j,] # Create a SVM model for this svmfit=svm(target~., data=train, kernel="radial",gamma=i,scale=TRUE) # Add up all the fold accuracy for training and test separately ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) # Create confusion matrix a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) # Get the accuracy trainingAccuracy <-trainingAccuracy + a$overall[1] testAccuracy <-testAccuracy+b$overall[1] } # Compute the average of accuracy for K folds for number of features 'i' overallTrainAccuracy=c(overallTrainAccuracy,trainingAccuracy/noFolds) overallTestAccuracy=c(overallTestAccuracy,testAccuracy/noFolds) } #Create a dataframe a <- rbind(param_range,as.numeric(overallTrainAccuracy), as.numeric(overallTestAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") #Plot in log axis ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)") + scale_x_log10()

    1.6b Validation curve – Python

    Compute and plot the validation curve as gamma is varied.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC from sklearn.model_selection import validation_curve # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_cancer) # Create a gamma values from 10^-3 to 10^2 with 20 equally spaced intervals param_range = np.logspace(-3, 2, 20) # Compute the validation curve train_scores, test_scores = validation_curve(SVC(), X_scaled, y_cancer, param_name='gamma', param_range=param_range, cv=10) #Plot the figure fig2=plt.figure() #Compute the mean train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) fig2=plt.title('Validation Curve with SVM') fig2=plt.xlabel('$\gamma$ (gamma)') fig2=plt.ylabel('Score') fig2=plt.ylim(0.0, 1.1) lw = 2 fig2=plt.semilogx(param_range, train_scores_mean, label='Training score', color='darkorange', lw=lw) fig2=plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color='darkorange', lw=lw) fig2=plt.semilogx(param_range, test_scores_mean, label='Cross-validation score', color='navy', lw=lw) fig2=plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color='navy', lw=lw) fig2.figure.savefig('fig2.png', bbox_inches='tight')

    Output image: 

    1.7a Validation Curve (Preventing data leakage) – Python code

    In this course Applied Machine Learning in Python, the Professor states that when we apply the same data transformation to a entire dataset, it will cause a data leakage. “The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately”

    So I apply separate scaling to the training and testing folds and plot. In the lecture the Prof states that this can be done using pipelines.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.cross_validation import KFold from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC # Read the data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) # Set the parameter range param_range = np.logspace(-3, 2, 20) # Set number of folds folds=5 #Initialize overallTrainAccuracy=[] overallTestAccuracy=[] # Loop over the paramater range for c in param_range: trainingAccuracy=0 testAccuracy=0 kf = KFold(len(X_cancer),n_folds=folds) # Partition into training and test folds for train_index, test_index in kf: # Partition the data acccording the fold indices generated X_train, X_test = X_cancer[train_index], X_cancer[test_index] y_train, y_test = y_cancer[train_index], y_cancer[test_index] # Scale the X_train and X_test scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit a SVC model for each C clf = SVC(C=c).fit(X_train_scaled, y_train) #Compute the training and test score acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy += np.sum(acctrain) testAccuracy += np.sum(accTest) # Compute the mean training and testing accuracy overallTrainAccuracy.append(trainingAccuracy/folds) overallTestAccuracy.append(testAccuracy/folds) overallTrainAccuracy=pd.DataFrame(overallTrainAccuracy,index=param_range) overallTestAccuracy=pd.DataFrame(overallTestAccuracy,index=param_range) # Plot training and test R squared as a function of alpha df=pd.concat([overallTrainAccuracy,overallTestAccuracy],axis=1) df.columns=['trainingAccuracy','testAccuracy'] fig3=plt.title('Validation Curve with SVM') fig3=plt.xlabel('$\gamma$ (gamma)') fig3=plt.ylabel('Score') fig3=plt.ylim(0.5, 1.1) lw = 2 fig3=plt.semilogx(param_range, overallTrainAccuracy, label='Training score', color='darkorange', lw=lw) fig3=plt.semilogx(param_range, overallTestAccuracy, label='Cross-validation score', color='navy', lw=lw) fig3=plt.legend(loc='best') fig3.figure.savefig('fig3.png', bbox_inches='tight')

    Output image: 

    1.8 a Decision trees – R code

    Decision trees in R can be plotted using RPart package

    library(rpart) library(rpart.plot) rpart = NULL # Create a decision tree m <-rpart(Species~.,data=iris) #Plot rpart.plot(m,extra=2,main="Decision Tree - IRIS")

     

    1.8 b Decision trees – Python code from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.model_selection import train_test_split import graphviz iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3) clf = DecisionTreeClassifier().fit(X_train, y_train) print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ## Accuracy of Decision Tree classifier on training set: 1.00 ## Accuracy of Decision Tree classifier on test set: 0.97 1.9a Feature importance – R code

    I found the following code which had a snippet for feature importance. Sklean has a nice method for this. For some reason the results in R and Python are different. Any thoughts?

    set.seed(3) # load the library library(mlbench) library(caret) # load the dataset cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split as data data <- cancer[,1:31] target <- cancer[,32] # Train the model model <- train(data, target, method="rf", preProcess="scale", trControl=trainControl(method = "cv")) # Compute variable importance importance <- varImp(model) # summarize importance print(importance) # plot importance plot(importance)

    1.9b Feature importance – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer import numpy as np # Read the data cancer= load_breast_cancer() (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Use the DecisionTreClassifier clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8, random_state = 0).fit(X_train, y_train) c_features=len(cancer.feature_names) print('Breast cancer dataset: decision tree') print('Accuracy of DT classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of DT classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) # Plot the feature importances fig4=plt.figure(figsize=(10,6),dpi=80) fig4=plt.barh(range(c_features), clf.feature_importances_) fig4=plt.xlabel("Feature importance") fig4=plt.ylabel("Feature name") fig4=plt.yticks(np.arange(c_features), cancer.feature_names) fig4=plt.tight_layout() plt.savefig('fig4.png', bbox_inches='tight') ## Breast cancer dataset: decision tree ## Accuracy of DT classifier on training set: 0.96 ## Accuracy of DT classifier on test set: 0.94

    Output image: 

    1.10a Precision-Recall, ROC curves & AUC- R code

    I tried several R packages for plotting the Precision and Recall and AUC curve. PRROC seems to work well. The Precision-Recall curves show the tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa.AUC curves that hug the top left corner indicate a high sensitivity,specificity and an excellent accuracy.

    source("RFunctions-1.R") library(dplyr) library(caret) library(e1071) library(PRROC) # Read the data (this data is from sklearn!) d <- read.csv("digits.csv") digits <- d[2:66] digits$X64 <- as.factor(digits$X64) # Split as training and test sets train_idx <- trainTestSplit(digits,trainPercent=75,seed=5) train <- digits[train_idx, ] test <- digits[-train_idx, ] # Fit a SVM model with linear basis kernel with probabilities svmfit=svm(X64~., data=train, kernel="linear",scale=FALSE,probability=TRUE) ypred=predict(svmfit,test,probability=TRUE) head(attr(ypred,"probabilities")) ## 0 1 ## 6 7.395947e-01 2.604053e-01 ## 8 9.999998e-01 1.842555e-07 ## 12 1.655178e-05 9.999834e-01 ## 13 9.649997e-01 3.500032e-02 ## 15 9.994849e-01 5.150612e-04 ## 16 9.999987e-01 1.280700e-06 # Store the probability of 0s and 1s m0<-attr(ypred,"probabilities")[,1] m1<-attr(ypred,"probabilities")[,2] # Create a dataframe of scores scores <- data.frame(m1,test$X64) # Class 0 is data points of +ve class (in this case, digit 1) and -ve class (digit 0) #Compute Precision Recall pr <- pr.curve(scores.class0=scores[scores$test.X64=="1",]$m1, scores.class1=scores[scores$test.X64=="0",]$m1, curve=T) # Plot precision-recall curve plot(pr)

    #Plot the ROC curve roc<-roc.curve(m0, m1,curve=TRUE) plot(roc)

    1.10b Precision-Recall, ROC curves & AUC- Python code

    For Python Logistic Regression is used to plot Precision Recall, ROC curve and compute AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_digits from sklearn.metrics import precision_recall_curve from sklearn.metrics import roc_curve, auc #Load the digits dataset = load_digits() X, y = dataset.data, dataset.target #Create 2 classes -i) Digit 1 (from digit 1) ii) Digit 0 (from all other digits) # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) # Fit a LR model lr = LogisticRegression().fit(X_train, y_train) #Compute the decision scores y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test) y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20])) #Show the decision_function scores for first 20 instances y_score_list precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #Plot plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig5.png', bbox_inches='tight') #Compute and plot the ROC y_score_lr = lr.fit(X_train, y_train).decision_function(X_test) fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr) roc_auc_lr = auc(fpr_lr, tpr_lr) plt.figure() plt.xlim([-0.01, 1.00]) plt.ylim([-0.01, 1.01]) plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr)) plt.xlabel('False Positive Rate', fontsize=16) plt.ylabel('True Positive Rate', fontsize=16) plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16) plt.legend(loc='lower right', fontsize=13) plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--') plt.axes() plt.savefig('fig6.png', bbox_inches='tight')

    output

    output

    1.10c Precision-Recall, ROC curves & AUC- Python code

    In the code below classification probabilities are used to compute and plot precision-recall, roc and AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_digits from sklearn.svm import LinearSVC from sklearn.calibration import CalibratedClassifierCV dataset = load_digits() X, y = dataset.data, dataset.target # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) svm = LinearSVC() # Need to use CalibratedClassifierSVC to redict probabilities for lInearSVC clf = CalibratedClassifierCV(svm) clf.fit(X_train, y_train) y_proba_lr = clf.predict_proba(X_test) from sklearn.metrics import precision_recall_curve precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr[:,1]) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #plt.figure(figsize=(15,15),dpi=80) plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig7.png', bbox_inches='tight')

    output

    Note: As with other posts in this series on ‘Practical Machine Learning with R and Python’,   this post is based on these 2 MOOC courses
    1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
    2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

    Conclusion

    This 4th part looked at SVMs with linear and radial basis, decision trees, precision-recall tradeoff, ROC curves and AUC.

    Stick around for further updates. I’ll be back!
    Comments, suggestions and correction are welcome.

    Also see
    1. A primer on Qubits, Quantum gates and Quantum Operations
    2. Dabbling with Wiener filter using OpenCV
    3. The mind of a programmer
    4. Sea shells on the seashore
    5. yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

    To see all posts see Index of posts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Capturing the Brexit vote in data

    Sun, 10/29/2017 - 13:27

    (This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

    In my recent research I have worked on understanding the key correlates of Brexit. This paper is joint work with my co-authors Sascha Becker and Dennis Novy and has now been published in Economic Policy. After having gone through the peer-review process, I am very happy to share the data and the underlying code.

    On the construction of a rich local authority level dataset

    After the EU Referendum, I started working on assembling a dataset to study the question to what extent low turnout among potential remain voters may help to understand the result of the referendum. In particular, back then I looked at the extent to which rainfall may have discouraged people to turn up to vote in the first place. After all, a lot of polls, the financial markets and betting markets seemed to indicate that Remain would win.

    It was not inconceivable to imagine that turnout may have been particularly low among a set of voters living in the London commuter belt, who were turned off from voting on a working day after an arduous journey into London that was marred by train cancelations due to the bad weather.

    The result then was that there appears to have been a weak effect on turnout, but this effect was not biased in either direction with regard to voter preferences: Remain would have lost also on a sunny day (we confirm this finding in the new paper with a different rainfall data set).

    In any case, this was the starting point to a significant data collection effort. In particular we have combined data from all the 380 local authorities in England, Scotland and Wales as well as some data of ward level voting across five cities. The resulting set of covariates is quite sizable and they can be roughly grouped into the following four categories:

    1. characteristics of the underlying economic structure of an area, like the unemployment rate, wages or sectoral employment shares and changes therein,
    2. demographic and human capital characteristics like age structure, education or life satisfaction,
    3. exposure to the EU measured by indicators like the level of EU transfers, trade and the growth rate of immigration from other EU countries,
    4. public service provision and fiscal consolidation that covers the share of public employment, as well as the reduction in public spending per capita among others and NHS performance metrics.

    To analyse which of these covariates are strong correlates of the Leave vote share in an area, we used a simple machine-learning method that chooses the best subset of covariates that best predicts the referendum Leave share. The idea is to build a robust predictive model that could achieve significant out of sample prediction accurracy.

    More formally, best subset selection solves the following optimization problem to obtain an optimal vector beta that can robustly predict the vote leave share y_c

    While the formulation may be a bit technical, it boils down to estimating all possible ways to combine regressors. That means, all possible ways of estimating a model that includes 1, 2, 3, …, p covariates are being estimated. This amounts to estimating (2^p) models, which becomes infeasible to estimate very fast. Lasso and other methods such as forward- and backward stepwise selection are solving approximations to the combinatorical problem that BSS solves. The method is usually taught at the beginning of machine learning courses as it provides a great way to illustrate the bias-variance tradeoff.

    The result can be summarized in a simple bar chart and is somewhat striking:

     

    R2 of best models with different groups of regressors

    What this suggests is that “Fundamental factors” such economic structure, demographics and human capital are strongest correlates of Leave vote in the UK; the direct level effects of migration and trade exposure captured in the group of regressors called “EU Exposure” seem second order.

    Also what is striking that even very simple empirical models with just 10 – 20 variables are doing a good job in capturing the variation in the vote leave share across the whole of the UK.  The best model that includes variables measuring vote shares of different parties (especially UKIP) from the 2014 European Parliamentary election captures 95% of the overall variation.

    The observation that variables capturing direct exposure to the European Union, in particular, Immigration seems at odds with voter narratives around the EU referendum, which were doinated by the immigration topic. In fact, the 2015 British Election study suggests that voters  considered immigration to be the single most important issue of the time.

    Unigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    In the above the word cloud the keyword immigration is just quite striking. Looking at the bigram word cloud drawn from the British Election study, the themes of the individual responses become even more apparent.

    Bigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    There things like “many people”, “small island”, “uncontrolled immigration” appear in addition to the simple immigration keyword. But also other themes, such as “national debt”, “health service” and concerns about the “welfare system” seem to feature quite large. Overall this suggests that immigration may have been a salient feature in the public debate, but it seems at odds with the fact that the variable groups pertaining to EU immigration seem to capture very little of the variation in the intensity of the leave vote across the UK.

     

    In another paper with Sascha we confirm this observation  in a panel setup. We show  that a local area’s exposure to Eastern European immigration has, at best, only a small impact on anti-EU preferences in the UK as measured by UKIP voting in European Parliamentary Elections.

    While our code for the Economic policy paper is implemented in Stata, it is very easy to replicate this work in Stata. Below is an example of how you would implement BSS in R through the leaps package.

    The leaps package has a nice visualization of which regressors are included in which of the best models. The below example replicates the results in our Table 1.

    library(data.table) library(haven) library(leaps) DTA<-data.table(read_dta(file="data/brexitevote-ep2017.dta")) #TABLE 1 IN THE PAPER WOULD BE REPLICATED BY RUNNING TABLE1<-regsubsets(Pct_Leave~InitialEUAccession+MigrantEUAccessionGrowth+InitialOldEUMember+MigrantOldEUMemberGrowth+InitialElsewhere+MigrantElsewhereGrowth+Total_EconomyEU_dependence+eufundspercapitapayment13+EU75Leaveshare,nbest=1,data=DTA) plot(TABLE1,scale="r2")

     

    Plotting best subset results

    If you want to replicate the word clouds from the 2015 British Election study, the following lines of code will do. You will need to download the British Election Study dataset here.

    library(data.table) library(haven) library(quanteda) DTA<-data.table(read_dta(file="data/bes_f2f_original_v3.0.dta")) C<-corpus(DTA$A1) docnames(C)<-DTA$finalserialno C.dfmunigram C.dfmunigram <- dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=1) C.dfm<-dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=2) plot(C.dfmunigram) plot(C.dfm) this var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Moving to Blogdown

    Sun, 10/29/2017 - 02:00

    (This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

    As I announced last friday on Twitter I have moved Data Imaginist to Blogdown from Jekyll. This is going to be a short post about why I did it. There’s not going to be much tutorial about this; the migration is as painless as can be expected and besides, other more qualified people have already made great tutorials about the matter.

    A little over a year ago I started blogging, using what I expect most were using at the time (in R-land anyway) — The Jekyll/Github Pages combo. I’d stolen a script from David Robinson that converted new RMarkdown files into Markdown documents for Jekyll, resulting in a workflow quite like what Blogdown offers. So when Blogdown arrived, I had little inclination to look more into it. Blogdown might be newer and more flashy, but it didn’t seem to offer enough to warrant the time investment it would take to migrate my post and learn about Hugo. Arguably it is possible to ignore the Hugo backend when using Blogdown, but I like to tinker and would never be satisfied using an engine I didn’t understand (this might explain why I have been poking at ggplot2 to the extend I has).

    Bottomline: I let it pass…

    Then, little by little, I was drawn in anyway. The first seducer were the slew of blogposts descibing the bliss that migrating your blog to Blogdown had brought – In addition those posts also spoke highly of the hosting from Netlify and their free HTTPS support. Being around Bob Rudis (in the Twitterverse) had made me aware of the need for secure browsing habits so this appealed a lot to me (even though the HTTPS support came throgh Let’s Encrypt – something Bob vehemently opposes). The second seducer was more of a repulsor from my current setup: My page was falling apart… Due to changes in Jekyll my internal links were no longer functioning and I would need to go through my whole theme to fix it. What was most annoying was that this stabbed me in the back. Jekyll had been upgraded on Github pages at some point, silently wrecking havock on my site without me knowing it – clearly an unacceptable setup. Lastly, as convenient as the automatic Jekyll building was by Github (when it didn’t break my site), as annoying it was to do it locally to preview a post before publishing. Mayby it is because I haven’t spend much time in Ruby-world but I never quite managed to get everything installed correctly so I could ensure that my own system matched what Github was using… The final push came when I for a brief period was the designer of the Blogdown logo and decided that now was the right time to make the switch.

    As I said in the beginning, this will not be a post about how to switch to Blogdown from Jekyll as I’m horrible at writing such posts. I can, however, describe my approach to migrating. I started by browsing the available Hugo themes and to my joy the one I was currently using had been ported and improved upon so the choice was easy (to be honest I’m not really really pleased with the look of my blog but the Cactus theme is pretty stylish in its minimalism). Once the theme was sorted out I began moving my posts into the new system one by one to see how it went. To my surprise it went predominantly without a hitch. Blogdown sets some other default figure sizes which resulted in some of my animations becoming unbearably large, so this had to be adjusted, but other than that it was mainly a matter of switching the Jekyll macros for Hugo shortcodes. My most used macro was for linking between blogposts which is akin to the ref shortcode but it took me a while to figure out that the argument to ref had to be enclosed in escaped quotes to survive the RMarkdown + Pandoc conversion. This could be a nice thing for Blogdown to take care off as it feels pretty cumbersome to write a path as '\\"my-post.html\\"'…

    Once I had gotten all my posts successfully into Blogdown I needed to get the rest going, which meant for the most part to port all my hacks and changes from the Jekyll theme into the Hugo version. One thing I’m pretty serious about is how links to my posts look when shared on Twitter so I had made a system where I could set the Twitter card type in the preamble of a post (large vs small) as well as a picture to use instead of my default logo. Porting this meant getting a bit into the Hugo templating system so that I could migrate my changes to the HTML header that gives these nice cards on Twitter/Facebook/LinkedIn. In the end HTML templates seems to be so alike nowadays that moving from one to the next is mostly a matter of understanding the data binding and how it differs between the systems. If you have some custom templates in Jekyll I can assure you that they probably wont take long to port over to Hugo…

    My last big todo was my RSS feed. Had I only had one it would not have been an issue at all, but as it is my page actually got two – a general one and one unique to posts in the R category (this was only made for R-Bloggers). Hugo comes with automatic RSS generation for both the main site and for differentt tags and categories, but they all reside at different locations. My two feeds both resides at the root of my page – something not really supported by Hugo. But I couldn’t accept breaking subscription or link so I scavenged the net and came to a solution – as my git commits tell, it was not straightforward…

    In the end I prevailed. If you are in a similar situation do look at my source code for the solution or contact me – I wont bore the 99.9 % rest with the details. The RSS episode was my first experience battling the Hugo system and arguably only because I had to support a setup from Jekyll that was handled in another way by Hugo…

    I wish I had a lot to tell you about my experience with Netlify, but I haven’t. Their service is the posterchild of easy, one-click solutions and I can whole-heartedly recommend them. The switch from Github to Netlify hosting took a total of around 20 minutes, including setting up HTTPS.

    As easy as I have hopefully made all of this sound, I really, really, really hope that something new and better doesn’t come out within a year. Doing this sort of infrastructure stuff is not something I particularly enjoy and I could have given my site a new look or begun developing the next version of gganimate instead of basically getting to the same page I had before. But sometimes you have to clean your room before you can play, and hopefully playtime is coming soon.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    It’s a dirty job, but someone’s got to do it..

    Sun, 10/29/2017 - 01:00

    (This article was first published on HighlandR, and kindly contributed to R-bloggers)

    A tidytext analysis of Faith No More lyrics –

    Is this a midlife crisis?

    I wanted to ease myself back into text mining,specifically using the tidytext package as I haven’t had to do any at work for well over a year.

    I’ve been thinking about some of the old bands of the 90’s, some of whom split up, and then reformed. I was interested to see how lyrics evolve over time, and to see what changes there were as the band matured.

    I’ve decided to look at Faith No More, because:

    • they recently reformed after an 18 year split
    • they had numerous line up changes, the main one being a change of vocalist shortly before the release of their 3rd (and most successful) album, “The Real Thing”.
    • Said vocalist, Mike Patton, was only 21 when he joined. I thought it might be interesting to see how his lyrics evolved over time
    • Their 4th album “Angel Dust” has been named the most influential rock album of all time.

    Let’s get one thing out of the way – there will be some review of the band’s history as well as being an R article but I promise – there will be no wordclouds.

    To start – let’s look at the number of tracks being analysed by album. I’ve excluded “We Care A Lot” from “Introduce Yourself”, as it’s the same as the title track from their first album, and would skew the analysis. I’ve also excluded cover versions and bonus tracks and a track from Sol Invictus, because the title is a very naughty word indeed .

    Most of the examples here follow the code available on the most excellent http://tidytextmining.com/. Big thanks to the package authors, Julia Silge and David Robinson for this package. I had done some text mining prior to the release of this package, but as someone used to dataframes, I soon found my progress limited to following a small number of tutorials and not knowing how to manipulate term document matrices etc to conduct further analysis.
    Tidytext is a game-changer because the data remains in dataframe/tibble format, and we can use dplyr and ggplot2 to wrangle and visualise.

    I had originally hoped this whole exercise would give me some experience with web-scraping – but the site I chose only allowed the use of an API – and while there was a Python one, there was no R alternative. Regardless – I was able to collate the data I needed.

    The original dataframe consisted of Title, Album and a lyrics column, and this was transformed into a format suitable for text analysis:

    text_df <- data %>% unnest_tokens(word,lyrics) %>% filter(word %notin% c("b","e","a","g","r","s","i","v","la")) %>% mutate(linenumber = row_number())

    One song (“Be Aggressive”) has a cheerleader style chorus where the title is spelled out letter by letter – so, I excluded these, along with “la”, which was also featuring highly in my initial plot of common words. Along with filtering out common “stop words”, I arrived at the following plot:

    This looks pretty boring, so let’s at least split it by album:

    We Care A Lot

    Unsurprisingly, “We Care A Lot” dominates the first album, given that it is repeated endlessly throughout the title track. Live favourite “As The Worm Turns” doesn’t feature though.

    Introduce Yourself

    Looking at “Introduce Yourself”, I’m surprised the title track doesn’t feature more heavily . Of the top words shown here, I recognise a couple from “Anne’s Song”. None from my under the radar favourites “Faster Disco” and “Chinese Arithmetic” though.

    The Real Thing

    Now on to the golden age of the band – Mike Patton unleashed his 6 octave vocal range and guitarist Jim Martin was still alternately shredding and providing short, clever solos. It took me a while figure out why “yeah” was so high, then I remembered the chorus of “Epic”. Never heard Faith No More? Start with this one. Also featuring here are words from “Underwater Love”, “Surprise! You’re Dead!” and personal favourite “From Out of Nowhere”.

    Angel Dust

    “Angel Dust” is dominated from words in just 2 songs – the aforementioned “Be Aggressive” and “Midlife Crisis”.. Other popular tracks from this album were “A Small Victory”, and their hit cover of the Commodore’s “Easy”.

    King For A Day…Fool For A Lifetime

    “King For A Day…Fool For A Lifetime” marks the start of the post-Jim Martin era and stands out in this plot in that it has a number of different tracks with frequently mentioned words. This album saw the band alternate many musical styles – sometimes in the same track – an example being “Just A Man”. This was one of the album’s highlights (see also – “Evidence”) but there was some filler on there too unfortunately.

    Album Of The Year

    The ironically titled “Album of the Year” reverts to being dominated by a few tracks, including single “Last Cup of Sorrow”. Other highlights were “Ashes To Ashes” and “Stripsearch”.

    Shortly after the release of this album, and years of in-fighting the band called it a day. They didn’t match the success of “The Real Thing” in the US, although the rest of the world were more switched on and “Angel Dust” actually outsold it in some areas.

    Even towards the end, they were popular enough in the UK to appear on “TFI Friday” (popular 90’s live entertainment show, hosted by then top presenter Chris Evans) 2 weeks running. While Britpop was a thing, FNM played “Ashes to Ashes” and turned in one of the finest live performances on UK TV (according to the internet).

    And, for a long time (18 years) it looked like that was that. An avant-garde rock band consigned to history. But, following several years of occasional live shows, the band returned in their final lineup and released “Sol Invictus”. I have to confess I have not listened to this album, apart from a few tracks. “Sunny Side Up” being the stand-out for me so far.

    Enough history, let’s look at some more plots and R code.

    song_words <- data %>% unnest_tokens(word, lyrics) %>% count(Title, word, sort = TRUE) %>% filter(n>30) %>% ungroup() ggplot(song_words,aes(reorder(word, n),n, fill=Title)) + geom_col()+ xlab(NULL) + ylab(NULL)+ coord_flip()+ theme_ipsum()+ scale_fill_ipsum()+ #scale_fill_viridis(option ="C", discrete = TRUE)+ ggtitle("Most Common Words - All Faith No More Studio Songs", subtitle = "Frequency >30. Including common stop words")+ theme(legend.position = "bottom")

    ## now remove stop words and the letters from # "Be Aggressive" song_words_filtered <- data %>% unnest_tokens(word, lyrics) %>% anti_join(stop_words) %>% filter(stringr::str_detect(word,"[a-z`]$"), !word %in% stop_words$word) %>% filter(word %notin% c("b","e","a","g","r","s","i","v","la")) %>% count(Title, word, sort = TRUE) %>% filter(n>20) %>% ungroup() ggplot(song_words_filtered,aes(reorder(word, n),n, fill=Title)) + geom_col()+ xlab(NULL) + ylab(NULL)+ coord_flip()+ theme_ipsum()+ scale_fill_ipsum()+ ggtitle("Most Common Words - All Faith No More Studio Songs", subtitle = "Frequency >30. Excluding common stop words")+ theme(legend.position = "bottom")

    Term document frequency by album – the number of times a word is used in the lyrics / by the total number of words in the lyrics. A lot of words occur rarely and a very few occurring frequently:

    Inverse Term Document frequency – or an attempt to identify words that don’t occur frequently but are important.

    Now we’ve looked at common and important words, let’s see if we can get a handle on the sentiment of those words.
    Admittedly, this first plot is a bit of a gimmick, but rather than use geom_col, or geom_point, I thought I’d use the ggimage package, and mark the overall album sentiment using the relevant album covers.

    Here’s the code:

    #overall album sentiment albumsentiment<- text_df %>% inner_join(get_sentiments("bing")) %>% count(Album, sentiment) %>% spread(sentiment, n, fill = 0) %>% mutate(sentiment = positive - negative) #create vector of images albumsentiment$img <- c("2017-10-22-WCAL.jpg","2017-10-22-IY.jpg", "2017-10-22-TRT.jpg","2017-10-22-AD.jpg","2017-10-22-KFAD.jpg", "2017-10-22-AOTY.jpg","2017-10-22-SI.jpg") ggplot(albumsentiment, aes(Album, sentiment, fill = Album)) + #geom_col(show.legend = FALSE) + theme_ipsum()+ ggExtra::removeGrid()+ coord_flip()+ geom_image(aes(image=img), size=.15)+ ggtitle("Overall Sentiment Score by Album", subtitle = "Faith No More Studio Albums - Excludes cover versions")+ labs(x = NULL, y = NULL)+ geom_text(aes(label=sentiment), hjust=0, nudge_y=7)+ theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

    So only “Introduce Yourself” managed to get a positive overall sentiment score. I’m surprised that “Album of the Year” scores as highly as it does given the band were on the verge of splitting up. However – this is simply a case of summing positive & negative scores by individual word and finding the overall difference – so its definitely not an exact science. Songs like “Collision” and “Helpless” do not convey positive emotion to me.

    Here is a more straightforward plot of sentiment by track:

    Topic Models

    I’m going to take the easy way out and refer you to the tidytext website to explain topic models. I have to be honest, I don’t really see any themes that make any sense to me here. So, I thought I should look at lyrics by individual vocalist:

    Here I take the first 5 topics by both to see if there is anything in common:

    After a bit of digging around, it turned out only 2 terms existed in both sets:

    Hmm,”Love” and “World”. Seems a bit “touchy feely” as they say. Perhaps I should look at this by album and see what topics appear?

    I’d expected perhaps that words belonging to the same song would get grouped together, but that doesn’t appear to happen very often.

    I also looked at splitting the lyrics into trigrams (sequences of 3 words, e.g. “We Care A”, “Care A Lot”, “A Lot We”, “Lot We Care” etc. By far the most common trigram was “Separation Anxiety Hey” from “Separation Anxiety”.

    I also looked at words where “Not” was the first of the 3 words in the trigram, to see if they were of positive or negative sentiment. First -the second word of the 3, then the final word of the 3.

    This can be expanded to look at other types of negating words:

    I also tried producing a network plot of the trigrams. Bit of a mess to be honest, but here it is:

    Finally – a correlation plot looking at closely linked overall each album is to the others, in terms of words used:

    library(widyr) album_word_cors <- album_words %>% pairwise_cor(Album,word,n, sort= TRUE) album_word_cors library(ggraph) library(igraph) set.seed(1234) album_word_cors %>% filter(correlation > .05) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(alpha = correlation, width = correlation)) + geom_node_point(size = 6, color = "lightblue") + geom_node_text(aes(label = name), repel = TRUE) + theme_void() ggsave("2017-10-22-album-word-correlations.png",width=8, height =5)

    This is more interesting -although – the correlations are pretty weak overall. However, The Real Thing, Angel Dust and King For A Day have the strongest correlations with each other.
    There appears to be less correlation over time between the lyrics used in The Real Thing compared to Album of the Year.

    There appears to little (or no) correlation between Album of the Year and Sol Invictus – that 18 year gap really made a difference. But what is interesting is the thick line between Sol Invictus and Introduce Yourself – 28 years apart, and from 2 different vocalists.

    However, the sting in the tail is that Mike Patton has said that he chose words more for their sound, and to fit the music, than for their actual meaning. That would explain the lyrics of “ Midlife Crisis” and “A Small Victory”, but seems to short change the cleverness of the lyrics in tracks like “Everything’s Ruined” and “Kindergarten”. Maybe this means this has been a massive waste of time? Regardless, I have enjoyed looking back at some great music from this wonderfully unique band.

    Code for the plots, and the plots themselves is available here

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: HighlandR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Meetup with Hadley Wickham

    Sat, 10/28/2017 - 21:57

    (This article was first published on R – Data Science Los Angeles, and kindly contributed to R-bloggers)

    In our next meetup, we’ll have Hadley (I think no introduction needed).

    Packages made easy
    by Hadley Wickham

    In this talk I’ll show you how easy it to make an R package by live coding one in front of you! You’ll learn about three useful packages:
    * usethis: automating the setup of key package components
    * devtools: for automating many parts of the package development workflow
    * testthat: to write unit tests that make you confident your code is correct

    Timeline:

    – 6:30pm arrival, drinks and networking
    – 7:00pm talk

    You must have a confirmed RSVP and please arrive by 6:55pm the latest. Please RSVP here on Eventbrite.

    Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
    Park underneath the building (Colorado Center), Edmunds will validate.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Los Angeles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Spring Budget 2017: Circle Visualisation

    Fri, 10/27/2017 - 23:49

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    It’s time to branch out into a new area of data visualisation: proportion area plots. These plots use area to show proportion between different related values. A common type of proportional area plots are tree maps. We are going to be using the same principle but with circles.

    A common subject for area visualisation is budgets and money, which will be the subject of this post.

    We are going to take a look at the Spring Budget 2017 in the UK.

    Page four of the document has two pie charts detailing how public money will be spent this year and where it will come from.

    I couldn’t find this data in any tables, so we will have to create our data frames manually:

    spending <- data.frame(c("Social protection","Health","Education","Other","Defence","Debt interest","Transport","Housing\n & environment","Public order","Social\n services","Industry"), c(245,149,102,50,48,46,37,36,34,32,23), c(802,802,802,802,802,802,802,802,802,802,802)) names(spending) <- c("Expenditure","Spending","Total") income <- data.frame(c("Income tax","VAT","National Insurance", "Other taxes","Borrowing","Other (non taxes)","Corporation Tax","Excise duties","Council tax","Business\n rates"), c(175,143,130,80,58,54,52,48,32,30), c(802,802,802,802,802,802,802,802,802,802)) names(income) <- c("Source","Income","Total")

    I am including ‘borrowing’ in the Government’s income section, although it will be the only circle that won’t be a tax.

    To draw our circles we will be using the packcircles package.

    library("packcircles") library("ggplot2") options(scipen = 999) library(extrafont) library(extrafontdb) loadfonts() fonts() t <- theme_bw() + theme(panel.grid = element_blank(), axis.text=element_blank(), axis.ticks=element_blank(), axis.title=element_blank())

    The first thing to do is to use the functions within the package to generate the radii and coordinates, as per this code here.

    # circle areas areas <- income$Income areas2 <- spending$Spending # income packing1 <- circleProgressiveLayout(areas) dat1 <- circleLayoutVertices(packing1) # spending packing2 <- circleProgressiveLayout(areas2) dat2 <- circleLayoutVertices(packing2)

    Next up we are going to move to the visualisation. It would be more informative if we could display income and expenditure side-by-side.

    Like last time when we compared Donald Trump and Hillary Clinton on Facebook, we can use facetting for this purpose in ggplot2.

    I adapted this code for this purpose.

    dat <- rbind( cbind(dat1, set = 1), cbind(dat2, set = 2) ) p <- ggplot(data = dat, aes(x, y)) + geom_polygon(aes(group = id, fill = -set), colour = "black", show.legend = FALSE) + theme(text = element_text(family="Browallia New", size = 50)) + coord_equal() + ggtitle("Income and expenditure for the UK Government") + scale_colour_manual(values = c("white","White")) + facet_grid(~set, labeller = as_labeller( c('1' = "Income", '2' = "Expenditure")))

    With no labels:

    This is looking good, but obviously we have no way of knowing which circle corresponds to which area of income or expenditure. In ggplot2 you can use geom_text() to add annotations. In order for them to be positioned correctly, we need to build a relationship between the labels and the circles. We can do this by creating centroids from the coordinates we already have for the circles.

    If we have accurate centroid coordinates then our labels will always appear in the centre of our circles.

    Going back to our dat data frame:

    str(dat) > str(dat)'data.frame': 546 obs. of  4 variables: $ x  : num  0 -0.277 -1.092 -2.393 -4.099 ... $ y  : num  0 2.2 4.25 6.05 7.46 ... $ id : int  1 1 1 1 1 1 1 1 1 1 ... $ set: num  1 1 1 1 1 1 1 1 1 1 ...

    Our ID column shows which circle it belongs to, and our set column shows whether it belongs in income or expenditure. From this we can deduce that each circle is plotted using 26 pairs of x/y coordinates (at row 27 the ID changes from one to two).

    The mean of each 26 pairs is the centre of the corresponding circle.

    Our dat data frame is 546 rows long. We need a sequence of numbers, 26 apart, going up to 546:

    sequence <- data.frame(c(x = (seq(from = 1, to = 546, by = 26))), y = seq(from = 26, to = 547, by = 26)) names(sequence) head(sequence) x y x1 1 26 x2 27 52 x3 53 78 x4 79 104 x5 105 130 x6 131 156

    Next up we will create a data frame that will contain our centroids:

    centroid_df <- data.frame(seq(from = 1, to = 521, by = 1)) names(centroid_df) <- c("number")

    There are probably a few different ways to run the next section, probably a bit more elegantly than what we’re about to do as well.

    This for loop will calculate the mean of every single combination of 26 coordinates and store the results in the new centroid_df.

    for (i in 1:521) { j = i + 25 centroid_df$x[i] <- mean(dat$x[i:j]) centroid_df$y[i] <- mean(dat$y[i:j]) }

    Now we bring in our sequence data frame to filter out the ones we don’t need in a new coords data frame, leaving just the ones that correspond correctly to the circles:

    coords <- merge(sequence,centroid_df, by.x = "x", by.y = "number") names(coords) <- c("number1","number2","x","y")

    Next up for labelling purposes is to clarify which set each pair of coordinates belongs to:

    coords$set[1:21] <- 1 coords$set[11:21] <- 2 #this overrides some of the values set in the line above

    Finally we will paste in the names and amounts from our original income and spending data frames into a new label column:

    coords$label <- NA coords$label[1:21] <- paste(income$Source,"\n£",income$Income,"bn") coords$label[11:21] <- paste(spending$Expenditure,"\n£",spending$Spending,"bn") coords$label <- gsub("£ ","£",coords$label) coords$label <- gsub(" bn","bn",coords$label)

    Now let’s add in the labels:

    p <- p+ geom_text(data = coords, aes(x = x, y = y, label = label, colour = factor(set)),show.legend = FALSE) p

    Gives this plot:

    There we have it, some nicely formatted labels showing what each circle refers to, and how much money it represents.

    Brief analysis

    On the left hand side we see that income tax is the biggest cash cow for the Government. Incidentally, about 28 per cent of that is paid for by the top 1% of earners:

    The top 5% are liable for almost half the UK's total income tax.

    Not enough for Jeremy Corbyn, so what would he like it to be? pic.twitter.com/poeC2N6Sn9

    — Rob Grant (@robgrantuk) June 21, 2017

    Social protection means benefits, essentially. A huge chunk of that is the State Pension, followed by various working-age benefits and tax credits.

    Notice that as a country we spend almost as much paying off the interest on our debts as we do on our own defence.

    Health and education – those two ‘classic’ public services – cost a combined £250bn.

    Conclusion

    I am a fan of using these circles to visualise this kind of data, especially money.

    If you have lots of networked data (for instance, if you had breakdowns of how exactly the social protection budget will be spent) then ggraph would be a good option for you to produce plots like this:

    Comment below if you have questions!

      Related Post

      1. Qualitative Research in R
      2. Multi-Dimensional Reduction and Visualisation with t-SNE
      3. Comparing Trump and Clinton’s Facebook pages during the US presidential election, 2016
      4. Analyzing Obesity across USA
      5. Can we predict flu deaths with Machine Learning and R?
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Stan Roundup, 27 October 2017

      Fri, 10/27/2017 - 21:00

      (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers)

      I missed two weeks and haven’t had time to create a dedicated blog for Stan yet, so we’re still here. This is only the update for this week. From now on, I’m going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.

      Not one, but two new devs!

      This is my favorite news to post, hence the exclamation.

      • Matthijs Vákár from University of Oxford joined the dev team. Matthijs’s first major commit is a set of GLM functions for negative binomial with log link (2–6 times speedup), normal linear regression with identity link (4–5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn’t just implement the straight-line case—this is a fully vectorized implementation as a density, so we’ll be able to use them this way: int y[N]; // observations matrix[N, K] x; // "data" matrix vector[K] beta; // slope coefficients real alpha; // intercept coefficient y ~ bernoulli_logit_glm(x, beta, alpha);

        These stand in for what is now written as

        y ~ bernoulli_logit(x * beta + alpha);

        and before that was written

        y ~ bernoulli(inv_logit(x * beta + alpha));

        Matthijs also successfully defended his Ph.D. thesis—welcome to the union, Dr. Vákár.

      • Andrew Johnson from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He’s also been patching doc and other issues.

      Visitors

      • Kentaro Matsura, author of Bayesian Statistical Modeling Using Stan and R (in Japanese) visited and we talked about what he’s been working on and how we’ll handle the syntax for tuples in Stan.

      • Shuonan Chen visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics—specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro’s book!

      Other news

      • Bill Gillespie presented a Stan and Torsten tutorial at ACoP.

      • Charles Margossian had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.

      • Krzysztof Sakrejda nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).

      • Ben Goodrich has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review—suggestions welcome).

      • Bob Carpenter and Mitzi Morris taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! Mike Lawrence, of Stan tutorial YouTube fame, helped people with debugging and installs—nice to finally meet him in person.

      • Ben Bales got the metric initialization into CmdStan, so we’ll finally be able to restart (the metric used to be called the mass matrix—it’s just the inverse of a regularized estimate of global posterior covariance during warmup)

      • Michael Betancourt just returned from SACNAS (diversity in STEM conference attended by thousands).

      • Michael also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.

      • Aki Vehtari was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with Andrew Gelman; it’ll also be joint between Helsinki and New York. Sounds like fun!

      • Breck Baldwin and Sean Talts headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.

      • Imad Ali is leaving Columbia to work for the NBA league office (he’ll still be in NYC) as a statistical analyst! That’s one way to get access to the data!

      The post Stan Roundup, 27 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Projects chapter added to “Empirical software engineering using R”

      Fri, 10/27/2017 - 20:06

      (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

      The Projects chapter of my Empirical software engineering book has been added to the draft pdf (download here).

      This material turned out to be harder to bring together than I had expected.

      Building software projects is a bit like making sausages in that you don’t want to know the details, or in this case those involved are not overly keen to reveal the data.

      There are lots of papers on requirements, but remarkably little data (Soo Ling Lim’s work being the main exception).

      There are lots of papers on effort prediction, but they tend to rehash the same data and the quality of research is poor (i.e., tweaking equations to get a better fit; no explanation of why the tweaks might have any connection to reality). I had not realised that Norden did all the heavy lifting on what is sometimes called the Putnam model; Putnam was essentially an evangelist. The Parr curve is a better model (sorry, no pdf), but lacked an evangelist.

      Accurate estimates are unrealistic: lots of variation between different people and development groups, the client keeps changing the requirements and developer turnover is high.

      I did turn up a few interesting data-sets and Rome came to the rescue in places.

      I have been promised more data and am optimistic some will arrive.

      As always, if you know of any interesting software engineering data, please tell me.

      I’m looking to rerun the workshop on analyzing software engineering data. If anybody has a venue in central London, that holds 30 or so people+projector, and is willing to make it available at no charge for a series of free workshops over several Saturdays, please get in touch.

      Reliability chapter next.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Microsoft R Open 3.4.2 now available

      Fri, 10/27/2017 - 16:50

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Microsoft R Open (MRO), Microsoft's enhanced distribution of open source R, has been upgraded to version 3.4.2 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to the latest R 3.4.2 and updates the bundled packages

      MRO is 100% compatible with all R packages. MRO 3.4.2 points to a fixed CRAN snapshot taken on October 15 2017, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

      MRO 3.4.2 is based on R 3.4.2, a minor update to the R engine (you can see the detailed list of updates to R here). This update is backwards-compatible with R 3.4.1 (and MRO 3.4.1), so you shouldn't encounter an new issues by upgrading. 

      We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

      MRAN: Download Microsoft R Open

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Substitute levels in a factor or character vector

      Fri, 10/27/2017 - 11:05

      I’ve been using the ggplot2 package a lot recently. When creating a legend or tick marks on the axes, ggplot2 uses the levels of a character or factor vector. Most of the time, I am working with coded variables that use some abbreviation of the “true” meaning (e.g. “f” for female and “m” for male or single characters for some single character for a location: “S” for Stuttgart and “M” for Mannheim).

      In my plots, I don’t want these codes but the full name of the level. Since I am not aware of any super-fast and easy to use function in base R (let me know in the comments if there is one), I came up with a very simple function and put this in my .Rprofile (that means that it is available whenever I start R). I called it “replace.levels“. The dot before the name means that it is invisible and does not show up in my Global Environment overview in RStudio. You have to call it with .replace.levels(), of course.

      This is how it works:
      .replace.levels takes two arguments: vec and replace.list

      vec is the vector where substitutions shall be made.
      replace.list is a list with named elements. The names of the elements are substituted for the contents of the elements.

      The function also checks if all elements (both old and new ones) appear in vec. If not, it throws an error. It is a very simple function, but it saves me a lot of typing.

      Example:

      a <- c(“B”, “C”, “F”)
      .replace.levels(a, list(“B” = “Braunschweig”, “C” = “Chemnitz”, “F” = “Frankfurt”))
      [1] “Braunschweig” “Chemnitz”     “Frankfurt”

      Here it is:

      .replace.levels <- function (vec, replace.list) {
        # Checking if all levels to be replaced are in vec (and other way around)
        cur.levels <- unique(as.character(vec))
        not.in.cur.levels <- setdiff(cur.levels, names(replace.list))
        not.in.new.levels <- setdiff(names(replace.list), cur.levels)
        if (length(not.in.cur.levels) != 0 | length(not.in.new.levels) != 0) {
          stop(“The following elements do not match: “, paste0(not.in.cur.levels, not.in.new.levels, collapse = “, “))
        }
        for (el in 1:length(replace.list)) {
          vec <- gsub(names(replace.list[el]), replace.list[[el]], vec)
        }
        vec
      }

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      Gold-Mining – Week 8 (2017)

      Thu, 10/26/2017 - 22:17

      (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

      Week 8 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

      The post Gold-Mining – Week 8 (2017) appeared first on Fantasy Football Analytics.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Estimating mean variance and mean absolute bias of a regression tree by bootstrapping using foreach and rpart packages

      Thu, 10/26/2017 - 20:40

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      by Błażej Moska, computer science student and data science intern 

      One of the most important thing in predictive modelling is how our algorithm will cope with various datasets, both training and testing (previously unseen). This is strictly connected with the concept of bias-variance tradeoff.

      Roughly speaking, variance of an estimator describes, how do estimator value ranges from dataset to dataset. It's defined as follows:

      \[ \textrm{Var}[ \widehat{f} (x) ]=E[(\widehat{f} (x)-E[\widehat{f} (x)])^{2} ] \]

      \[ \textrm{Var}[ \widehat{f} (x)]=E[(\widehat{f} (x)^2]-E[\widehat{f} (x)]^2 \]

      Bias is defined as follows:

      \[ \textrm{Bias}[ \widehat{f} (x)]=E[\widehat{f}(x)-f(x)]=E[\widehat{f}(x)]-f(x) \]

      One could think of a Bias as an ability to approximate function. Typically, reducing bias results in increased variance and vice versa.

      \(E[X]\) is an expected value, this could be estimated using a mean, since mean is an unbiased estimator of the expected value.

      We can estimate variance and bias by bootstrapping original training dataset, that is, by sampling with replacement indexes of an original dataframe, then drawing rows which correspond to these indexes and obtaining new dataframes. This operation was repeated over nsampl times, where nsampl is the parameter describing number of bootstrap samples.

      Variance and Bias is estimated for one value, that is to say, for one observation/row of an original dataset (we calculate variance and bias over rows of predictions made on bootstrap samples). We then obtain a vector containing variances/biases. This vector is of the same length as the number of observations of the original dataset. For the purpose of this article, for each of these two vectors a mean value was calculated. We will treat these two means as our estimates of mean bias and mean variance. If we don't want to measure direction of the bias, we can take absolute values of bias.

      Because bias and variance could be controlled by parameters sent to the rpart function, we can also survey how do these parameters affect tree variance. The most commonly used parameters are cp (complexity parameter), which describe how much each split must decrease overall variance of a decision variable in order to be attempted, and minsplit, which defines minimum number of observations needed to attempt a split.

      Operations mentioned above is rather exhaustive in computational terms: we need to create nsampl bootstrap samples, grow nsampl trees, calculate nsampl predictions, nrow variances, nrow biases and repeat those operations for the number of parameters (length of the vector cp or minsplit). For that reason the foreach package was used, to take advantage of parallelism. The above procedure still can't be considered as fast, but It was much faster than without using the foreach package.

      So, summing up, the procedure looks as follows:

      1. Create bootstrap samples (by bootstrapping original dataset)
      2. Train model on each of these bootstrap datasets
      3. Calculate mean of predictions of these trees (for each observation) and compare these predictions with values of the original datasets (in other words, calculate bias for each row)
      4. Calculate variance of predictions for each row (estimate variance of an estimator-regression tree)
      5. Calculate mean bias/absolute bias and mean variance 
      R Code

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Analysing Cryptocurrency Market in R

      Thu, 10/26/2017 - 15:01

      (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

      Cryptocurrency market has been growing rapidly that being an Analyst, It intrigued me what does it comprise of. In this post, I’ll explain how can we analyse the Cryptocurrency Market in R with the help of the package coinmarketcapr. Coinmarketcapr package is an R wrapper around coinmarketcap API.

      To get started, Let us load the library into our R session and Plot the top 5 Cryptocurrencies.

      library(coinmarketcapr) plot_top_5_currencies()

      Gives this plot:

      The above plot clearly shows how bitcoin is leading the market but that does not give us the picture of how the marketshare is split among various cryptocurrencies, so let us get the complete data of various cryptocurrencies.

      market_today <- get_marketcap_ticker_all() head(market_today[,1:8]) id name symbol rank price_usd price_btc X24h_volume_usd market_cap_usd 1 bitcoin Bitcoin BTC 1 5568.99 1.0 2040540000.0 92700221345.0 2 ethereum Ethereum ETH 2 297.408 0.0537022 372802000.0 28347433482.0 3 ripple Ripple XRP 3 0.204698 0.00003696 100183000.0 7887328954.0 4 bitcoin-cash Bitcoin Cash BCH 4 329.862 0.0595624 156369000.0 5512868154.0 5 litecoin Litecoin LTC 5 55.431 0.010009 124636000.0 2967255097.0 6 dash Dash DASH 6 287.488 0.0519109 46342600.0 2197137527.0

      Having extracted the complete data of various cryptocurrencies, let us try to visualize the marketshare split with a treemap. For plotting, let us extract only the two columns ID and market_cap_usd and convert the market_cap_usd into numeric type and a little bit of number formatting for the treemap labels.

      library(treemap) df1 <- na.omit(market_today[,c('id','market_cap_usd')]) df1$market_cap_usd <- as.numeric(df1$market_cap_usd) df1$formatted_market_cap <- paste0(df1$id,'\n','$',format(df1$market_cap_usd,big.mark = ',',scientific = F, trim = T)) treemap(df1, index = 'formatted_market_cap', vSize = 'market_cap_usd', title = 'Cryptocurrency Market Cap', fontsize.labels=c(12, 8), palette='RdYlGn')

      Gives this plot:

      The above visualization explains the whole cryptocurrency market is propped by two currencies primarily – Bitcoin and Etherum and even the second ranked Etherum is far behind than Bitcoin which is the driving factor of this market. But it is also fascinating (and shocking at the same time) that both Bitcoin and Etherum together create a 100 Billion Dollar (USD) market. Whether this is a sign of bubble or no – We’ll leave that for market analysts to speculate, but being a data scientist or analyst, We have a lot of insights to extract from the above data and it should be interesting analysing such an expensive market.

        Related Post

        1. Time Series Analysis in R Part 3: Getting Data from Quandl
        2. Pulling Data Out of Census Spreadsheets Using R
        3. Extracting Tables from PDFs in R using the Tabulizer Package
        4. Extract Twitter Data Automatically using Scheduler R package
        5. An Introduction to Time Series with JSON Data
        var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

        To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

        Two upcoming webinars

        Wed, 10/25/2017 - 22:20

        (This article was first published on Revolutions, and kindly contributed to R-bloggers)

        Two new Microsoft webinars are taking place over the next week that may be of interest:

        AI Development in Azure using Data Science Virtual Machines

        The Azure Data Science Virtual Machine (DSVM) provides a comprehensive development and production environment to Data Scientists and AI-savvy developers. DSVMs are specialized virtual machine images that have been curated, configured, tested and heavily used by Microsoft engineers and data scientists. DSVM is an integral part of the Microsoft AI Platform and is available for customers to use through the Microsoft Azure cloud. In this session, we will first introduce DSVM, familiarize attendees with the product, including our newest offering, namely Deep Learning Virtual Machines (DLVMs). That will be followed by technical deep-dives into samples of end-to-end AI development and deployment scenarios that involve deep learning. We will also cover scenarios involving cloud based scale-out and parallelization.

        This webinar runs from 10-11AM Pacific Time on Thursday October 26, and is presented by Gopi Kumar, Principal Program Manager, Paul Shealy, Senior Software Engineer, and Barnam Bora, Program Manager, at Microsoft. Register here.

        Document Collection Analysis

        With the extremely large volumes of data, especially unstructured text data, that are being collected every day, a huge challenge facing customers is the need for tools and techniques to organize, search, and understand this vast quantity of text. This webinar demonstrates an efficient and automated end-to-end workflow around analyzing large document collections and serving your downstream NLP tasks. We’ll demonstrate how to summarize and analyze a large collection of documents, including techniques such as phrase learning, topic modeling, and topic model analysis using the Azure Machine Learning Workbench.

        This webinar runs from 10-11AM Pacific Time on Tuesday October 31, and will be presented by Ke Huang, Data Scientist at Microsoft. Register here.

        var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

        To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

        Robust IRIS Dataset?

        Wed, 10/25/2017 - 20:10

        (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

        This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

        When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

        Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

        Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

        After some Python and R code, the results were in.

        I have tested following pairs (train:test sample size):

        • 80% – 20%
        • 60% – 40%
        • 50% – 50%
        • 30% – 70%
        • 10% – 90%

        Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

        for x in range(3000): exec(open("./classification.py").read(), globals()) x += 1

        At the end I have generated the file:

        With these results, simple R code to generate the scatter plot was used:

        library(ggplot2) setwd("C:\\Predictions\\") df_pred <- data.frame(read.table("results_split.txt", sep=";")) p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) p <- p + geom_point(aes(df_pred$V3)) p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset") p

        Which resulted as:

        The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

        Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

        Snippet of Python code to generate euclidean distance:

        def eucl_dist(set1, set2, length): distance = 0 for x in range(length): distance += pow(set1[x] - set2[x], 2) return math.sqrt(distance)

        and neighbours:

        def find_neighbors(train, test, nof_class): distances = [] length_dist = len(test) - 1 for x in range(len(train)): dist = eucl_dist(test, train[x], length_dist) distances.append((train[x],dist)) distances.sort(key=operator.itemgetter(1)) neighbour = [] for x in range(nof_class): neighbour.append(distances[x][0]) return neighbour

         

        Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

        Happy R & Python coding!

        var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

        To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

        Pages