Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 47 min ago

Press Release: Business Science Partners With Method Data Science To Accelerate Your Data Science Career

Wed, 07/18/2018 - 13:11

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

The goal is simple: to educate and empower future data scientists so they can help organizations gain data-driven results. This is why it was a no-brainer when the opportunity came up for Business Science to partner with Method Data Science, the go-to data science accelerator for aspiring data scientists. Now Method Data Scientists will get exclusive lectures from Business Science Instructors and have discounted access to Business Science University, the revolutionary online education platform for learning data science for business, along with instructor trainings as part of the Method Data Science accelerator program. This is big news for current and future data scientists seeking to gain real-world experience while learning how to deliver results to organizations!

“The goal is simple: to educate and empower future data scientists so they can help organizations gain data-driven results.”

Press Release: Business Science Partners With Method Data Science

Learn about our new partnership that delivers value to aspiring data scientists!

A Partnership That Delivers Real-World Experience And Education

Founded by data science leader and mentor, Beau Walker, Method Data Science helps data scientists (and future data scientists) gain real-world experience working with companies. It’s like an accelerated internship where Method Data Scientists work on teams of 3 to 4 helping companies with their data science needs.



Follow Method Data Science on LinkedIn

For data scientists, gaining real-world experience is often the biggest hurdle in advancing their career. This is the problem Method solves. Method’s students gain tangible experience they can add to their resumes and discuss with potential employers in interviews. This is invaluable.

“For data scientists, gaining real-world experience is often the biggest hurdle in advancing their career. This is the problem Method solves.”

Working on data science projects and teams is half the battle, the other half is gaining the knowledge of the data science tools and systems needed to generate Return-On-Investment (ROI) for the business. This is where Business Science provides the 1 + 1 = 10 effect.



Follow Business Science on LinkedIn

Business Science provides the premium online education service called Business Science University that teaches business-specific courses in R and coming soon Python!!! We just landed Favio Vazquez, Principle Data Scientist at OXXO, as an instructor for the Python version of DS4B 201!



Favio Vazquez, Principle Data Scientist at OXXO joins Business Science University as Python and R Instructor!

Learning While Gaining Experience

Business Science University is the perfect complement to the Method Data Science accelerator. The premium Data Science For Business (DS4B) Virtual Workshop teaches the fundamentals of Data Science For Business through an end-to-end data science project. The flagship course, Data Science For Business (DS4B 201), can be completed while the student is in the Method Data Science accelerator program. Taken in parallel, the DS4B 201 course multiplies their impact by combining top-tier data science education and real-world experience.

The end result is a strong foundation of data science tools, systems, and experience that help aspiring data scientists bring data-driven results to organizations.

“The end result is a strong foundation of data science tools, systems, and experience that help aspiring data scientists bring data-driven results to organizations.”

What does the partnership look like? More Value!

When you join Method Data Science, you’re now powered by Business Science:

  • Exclusive lectures on data science tools, systems, frameworks, and more from Business Science instructors!

  • Access to all Business Science University programs, workshops, and courses at a heavily discounted rate!

  • Access to Business Science Community Slack Channels for each course!

We look forward to teaching you as part of the next cohort of Method Data Scientists. Join Method Data Science today!

Watch Beau’s Method Data Science YouTube Video

Learn more about what Beau does at Method Data Science!

Watch The University YouTube Video

Learn more about what Matt does at Business Science University!

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), Business Science University is for you! Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

  • TO SOLVE A REAL WORLD CHURN PROBLEM: Employee Turnover!
  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement (ROI-Driven Data Science)
DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get 15% OFF!

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301 (Building A Shiny Web App)

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get 15% OFF!

Get The Best Resources In Data Science. Every Friday!

Sign up for our free “5 Topic Friday” Newsletter. Every week, I’ll send you the five coolest topics in data science for business that I’ve found that week. These could be new R packages, free books, or just some fun to end the week on.

Sign Up For Five-Topic-Friday!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyse de pronostics pour le Mondial 2018

Wed, 07/18/2018 - 08:00

(This article was first published on R – NC233, and kindly contributed to R-bloggers)

On est les champions ! Si nous n’avons pas eu le temps de faire un modèle de prédiction pour cette coupe du monde de football 2018 (mais FiveThirtyEight en a fait un très sympa, voir ici), cela ne nous a pas empêché de faire un concours de pronostics entre collègues et ex-collègues statisticiens, sur le site Scorecast. Les résultats obtenus sont les suivants :

Joueur Score Nic 102 Cle 100 Ron 100 Lud 96 Tho 90 Lio 88 Lis 87 Pap 86 Mau 84 Yan 78 Ant 78 Lau 75 Thi 71 Arn 56 Oli 28 Mar 7

Un autre système de points ?

Le système de points utilisé par Scorecast est le suivant : si on a le bon gagnant, on gagne un faible nombre de points ; si en plus du bon gagnant, on a bien prédit l’écart de buts, on gagne un peu plus de points ; et enfin, si on a le score exact, on gagne le nombre maximal de points. Ce nombre maximal de points augmente au fur et à mesure de la compétition : la finale vaut plus de points qu’un match de poules. Ce système ne tient pas compte de cotes préexistantes (comme le fait par exemple Mon petit prono), ou du fait que certains matchs sont bien prédits par tout le monde alors que pour d’autres seule une personne a bien trouvé, voire personne.

Je propose donc ici d’altérer légèrement l’attribution des points, de la façon suivante : on dispose d’un nombre de points équivalent pour chaque match d’une même manche (match de poule, de quart, etc.), qu’on répartit entre les joueurs qui ont bien prédit le score, avec un avantage pour ceux qui ont le bon écart de points ou le bon score exact. Le nombre de points à répartir augmente tout au long de la compétition, de sorte que les phases finales aient plus d’importance dans le classement final.

Pourquoi faire ça ? Pour favoriser les joueurs qui ont fait des paris plus originaux et potentiellement plus risqués, ou en tout cas qui étaient les seuls à avoir la bonne intuition. Voici les résultats :

Joueur Score Score modifié Mau 84 185 Lud 96 163 Nic 102 144 Tho 90 136 Ant 78 135 Cle 100 126 Ron 100 123 Lis 87 120 Lio 88 115 Pap 86 108 Yan 78 105 Lau 75 100 Thi 71 90 Arn 56 78 Oli 28 43 Mar 7 10

On constate que le classement évolue sensiblement avec cette nouvelle méthode de points ! Mais peut-être que certains auraient fait d’autres paris si ces règles étaient décidées…

Choix des scores

Une des principales difficultés du pronostic est qu’il ne suffit pas de savoir (ou de penser savoir) qui va gagner le match, mais il faut aussi indiquer le score attendu. Regardons si les prédictions de l’ensemble des parieurs de notre ligue ont été pertinentes par rapport aux vrais scores ! Pour cela, on détermine pour chaque score le pourcentage des matchs qui ont abouti à ce résultat d’une part, et le pourcentage des paris faits avec ce score. On regarde ensuite la différence entre les pourcentages, qu’on va illustrer par la heatmap ci-dessous. Les cases vertes correspondent aux scores des matchs trop rarement prédits ; les cases rouges aux scores très souvent prédits mais qui n’arrivent que peu ou pas.

On constate que l’on a surestimé largement le nombre de 2-1, de 3-0 et de 4-0 (score qui n’est jamais arrivé lors de cette coupe du monde) ; ce sont d’ailleurs les seuls “gros” scores qui ont été surestimés dans les prédictions : tous les autres ont été sous-évalués. Cela peut laisser penser que les paris ont été faits avec une logique conservative et en évitant de tenter des scores absurdes, comme 7-0 pour l’Arabie Saoudite contre la Russie !

Analyse de données et classification

Enfin, une dernière utilisation possible de ce jeu de données est d’en faire l’analyse pour en extraire des classes de parieurs ayant un peu le même profil (ou en tout cas les mêmes réussites), et pour voir ce qui les sépare. Plusieurs méthodes sont possibles pour cela.

Commençons par un grand classique : la Classification Ascendante Hiérarchique (CAH pour les intimes), qui est une méthode qui part de groupes d’une personne, et qui, à chaque étape, regroupe deux groupes de telle façon à ce que l’inertie intra augmente au minimum. De façon moins barbare, cela veut dire qu’on regroupe les deux groupes qui se ressemblent le plus, étape par étape, jusqu’à arriver à la population totale. On représente souvent ce type de méthodes par un dendogramme, qui ressemble un peu à un arbre phylogénétique en biologie de l’évolution, et qui illustre la construction des classes, de bas en haut.

On remarque qu’il y a de nombreux binômes qui sont cohérents, et qui signalent des parieurs avec des profils comparables (par exemple, Mar et Oli, qui correspondent à deux joueurs ayant raté une bonne partie de la compétition, soit en arrêtant les paris, soit en arrivant en cours), et qu’il y a une séparation entre les quatre joueurs de gauche et les autres (eux-mêmes largement séparés entre les 3 les plus à gauche et les autres).

Une autre possibilité est d’utiliser l’Analyse en Composantes Principales, que nous avions déjà utilisé dans un contexte footballistique ici ou ici (en). La logique est ici de chercher à résumer une matrice avec beaucoup d’informations (pour chaque joueur, l’ensemble des points obtenus via ses paris pour chaque match) en un nombre minimal de dimensions, dits d’axes, qui suffisent pour avoir une bonne idée de la logique d’organisation du jeu de données.

Si l’on réalise cette méthode ici, voici ce que l’on obtient sur les premiers axes :




L’axe 1 est souvent victime de ce qu’on appelle l’”effet taille” : on entend par là le fait que les individus ayant de grandes valeurs de certaines variables en ont souvent aussi pour les autres variables, et symétriquement pour les individus qui ont des petites valeurs. En effet, on voit que la variable supplémentaire, le total de points obtenus (avec la méthode Scorecast), en bleu, est proche de l’axe 1. Cela veut dire que les individus à droite de l’axe ont tendance à avoir un score important, tandis que ceux à gauche n’ont pas très bien réussi leurs prédictions.

On constate également que les représentations sur les plans constitués des dimensions 1-2, et 2-3, ont tendance à rapprocher les individus que la classification effectuée plus haut associait en binôme. Cela montre une certaine cohérence, ce qui est toujours rassurant !

Plus dans le détail, on voit que les axes 2 et 3 semblent correspondre aux paris suivants, qui sont donc discriminants entre les différents joueurs :

  • Pour l’axe 2, avoir réussi son pari sur les matchs Pérou-Danemark, Mexique-Suède, Brésil-Suisse, Espagne-Russie et Argentine-Croatie
  • Pour l’axe 3, avoir réussi son pari sur les matchs Japon-Sénégal, Suisse-Costa Rica, Danemark-France ou encore Brésil-Mexique

Difficile de trouver une interprétation de ces axes…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NC233. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Just use a scatterplot. Also, Sydney sprawls.

Wed, 07/18/2018 - 07:34

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Dual-axes at tipping-point

Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.

Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.

Let’s explore.

First: do we even believe the numbers? Comments on the article point out that the population density for Phoenix was corrected after publication, and question the precise meaning of city area. Hovering over the graphic to obtain the values, then visiting Wikipedia’s city pages, we can create a Google spreadsheet which I hope is publicly-visible at this link. Here’s a graphical summary.

It looks like the figures for population density used in the article (smh_density) correspond roughly to metro population divided by metro area (metro_density). We’ll come back to whether population density is a good predictor or comparator later.

Let’s think about how this data was visualised in the article. We have cities, each of which has two measurements: population density and average commute time. Bars for density, labelled with city name. Points for commute time, aligned with the bars. A y-axis for each measurement. Result: ugly.

Presumably the authors wanted to answer the question “how is commute time related to population density?” So why not just plot one versus the other?

  • Plot population density (x) versus average commute time as a scatterplot
  • Label each point with the city name
  • Optionally: draw attention to the values using size and/or colour
  • Optionally: add a summary statistic (mean, median)

Let’s do that and in addition, showcase the wonderful googlesheets R package, to import the Google sheet straight into R.

library(googlesheets) library(tidyverse) city_commutes <- gs_url("https://docs.google.com/spreadsheets/d/1JrOYsHx4fJAEVA8XvO5ey-Ud-tumX9RA-xiAvLyPFWM/edit?usp=sharing") %>% gs_read() city_commutes %>% ggplot(aes(smh_density, smh_commute)) + geom_point(aes(size = smh_commute, color = smh_density)) + guides(size = FALSE, color = FALSE) + geom_text(aes(label = city), vjust = -0.9, hjust = 0.74) + labs(x = "Density (people per sq. km)", y = "Average commute time (min)", title = "Average one-way commute time versus population density", subtitle = "Dashed lines indicate median values for the dataset") + geom_hline(aes(yintercept = median(smh_commute)), linetype = "dashed", color = "darkred") + geom_vline(aes(xintercept = median(smh_density)), linetype = "dashed", color = "darkred") + theme_bw() + scale_color_viridis_c()

Result. I’d argue that Sydney’s position as “low-density, high commute time” is apparent. You might also make statements such as “Sydney and San Francisco have comparable commute times to Toronto, despite having much lower population densities.”

Phoenix is just a weird outlier; I don’t know why it was even included in the original article.

But wait. Back to the table of values. Note that Wikipedia provides population density figures for all cities (wp_density). For most of the cities, this value is calculated as city_pop / city_area. Except for Sydney, where it is calculated as metro_pop / metro_area. If we plot commute time versus wp_density, things look rather different.

And if we drop Phoenix from the dataset, Sydney stands all alone – but note that the range of average commute time is only about 5 minutes.

What I think the data are really telling us is that Sydney thinks its city area is “all of Sydney”, whereas European and North American cities define city area differently. It’s not surprising then, that commute time compares unfavourably, since the population density value for Sydney is artificially low and not comparable with the other cities. We can make this apparent by plotting against metro area instead.

This makes more intuitive sense. If you’re further away your commute takes longer. Unless you’re in Los Angeles, which is all big roads.

In conclusion:

  • Scatterplots – good
  • News article’s interpretation of factors affecting commute time – poor
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Monte Carlo Shiny: Part Three

Wed, 07/18/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In previous posts, we covered how to run a Monte Carlo simulation and how to visualize the results. Today, we will wrap that work into a Shiny app wherein a user can build a custom portfolio, and then choose a number of simulations to run and a number of months to simulate into the future.

A link to that final Shiny app is here and here is a snapshot:

We will use RMarkdown to build our Shiny application by inserting runtime: shiny into the yaml header. This will alert the server (or our laptop) that this is an interactive document. The yaml header also gives us a space for the title and to specify the format as flexdashboard. This is what the yaml header looks like for the app.

--- title: "Monte Carlo" runtime: shiny output: flexdashboard::flex_dashboard: orientation: rows source_code: embed ---

Note that when using RMarkdown and runtime: shiny we do no need to worry about UI and server logic, just inputs and outputs.

Our first code chunk is the setup, wherein we can load packages or data, just as we do with an R Notebook or static RMarkdown file.

# This is the setup chunk library(tidyverse) library(highcharter) library(tidyquant) library(timetk)

Our first substantive task is to build an input sidebar where users choose five stocks and weights, a starting date, the number of months to be simulated and the number of simulations to be run.

The sidebar looks like this

The code for building that sidebar starts with textInput("stock1",...)) to create a space where the user can type a stock symbol and then numericInput("w1",...) to create a space where the user can enter a numeric weight. We want those entry spaces to be on the same line so we will nest them inside of a call to fluidRow().

Since we have five stocks and weights, we repeat this five times. Notice that the stock symbol field uses textInput() because the user needs to enter text, and the weight field uses numericInput() because the user needs to enter a number.

fluidRow( column(6, textInput("stock1", "Stock 1", "SPY")), column(5, numericInput("w1", "Portf. %", 25, min = 1, max = 100)) ) # Repeat this fluidRow() four more times, changing names to # stock2, stock3, stock4, stock5 and w2, w3, 4, w5

Let’s dissect one of those fluid rows line-by-line.

fluidRow() creates the row.

column(6...) creates a column for our stock ticker input with a length of 6.

textInput("stock1", "Stock 1", "SPY")) creates our first text input field.

We assigned it stock1 which means it will be referenced in downstream code as input$stock1. We labeled it with “Stock 1”, which is what the end user will see when viewing the app.

Finally, we set “SPY” as the default initial value. If the user does nothing, the value will be this default.

We also include a row where the user can choose a start date with dateInput(...).

fluidRow( column(7, dateInput("date", "Starting Date", "2013-01-01", format = "yyyy-mm-dd")) )

Next, we create the numericInput fields for the number of months to simulate and the number of simulations to run.

fluidRow( column(5, numericInput("sim_months", "Months", 120, min = 6, max = 240, step = 6)), column(5, numericInput("sims", "Sims", 51, min = 31, max = 101, step = 10)) )

We now have all the inputs from the user and are almost ready to start calculating. Before we do so, let’s ask the user to click submit.

actionButton("go", "Submit")

The ‘submit’ button is very important because it enables the use of eventReactive() to control our computation. An eventReactive() is a reactive function that will not start until it observes some event. Without, our reactives would start firing each time a user changed an input.

In the next code chunk (and our subsequent calculation chunks as well), we tell prices to wait for input$go by calling eventReactive(input$go...). When the user clicks, the reactive inputs get passed to our function.

prices <- eventReactive(input$go, { symbols <- c(input$stock1, input$stock2, input$stock3, input$stock4, input$stock5) getSymbols(symbols, src = 'yahoo', from = input$date, auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) })

From here, we pass the prices and weights to a portfolio returns code flow, which should look familiar from the first post. The only difference is that we are passing in the reactively chosen prices instead of the statically defined prices when we constructed the portfolio ourselves.

portfolio_returns_tq_rebalanced_monthly <- eventReactive(input$go, { prices <- prices() w <- c(input$w1/100, input$w2/100, input$w3/100, input$w4/100, input$w5/100) portfolio_returns_tq_rebalanced_monthly <- prices %>% to.monthly(indexAt = "last", OHLC = FALSE) %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% tq_portfolio(assets_col = asset, returns_col = returns, weights = w, col_rename = "returns", rebalance_on = "months") })

We now have a reactive object called portfolio_returns_tq_rebalanced_monthly and need to use it to find the mean and standard deviation of returns. Those are the parameters we need for the simulation.

mean_port_return <- eventReactive(input$go, { portfolio_returns_tq_rebalanced_monthly <- portfolio_returns_tq_rebalanced_monthly() mean(portfolio_returns_tq_rebalanced_monthly$returns) }) stddev_port_return <- eventReactive(input$go, { portfolio_returns_tq_rebalanced_monthly <- portfolio_returns_tq_rebalanced_monthly() sd(portfolio_returns_tq_rebalanced_monthly$returns) })

Next, we define one of our simulation functions that we discussed in the previous post.

simulation_accum_1 <- function(init_value, N, mean, stdev) { tibble(c(init_value, 1 + rnorm(N, mean, stdev))) %>% `colnames<-`("returns") %>% mutate(growth = accumulate(returns, function(x, y) x * y)) %>% select(growth) }

Then, we call eventReactive() to run the simulation following the same logic as we did above.

sims <- eventReactive(input$go, {input$sims}) monte_carlo_sim <- eventReactive(input$go, { sims <- sims() starts <- rep(1, sims) %>% set_names(paste("sim", 1:sims, sep = "")) map_dfc(starts, simulation_accum_1, N = input$sim_months, mean = mean_port_return(), stdev = stddev_port_return()) %>% mutate(month = seq(1:nrow(.))) %>% select(month, everything()) %>% `colnames<-`(c("month", names(starts))) %>% gather(sim, growth, -month) %>% group_by(sim) %>% mutate_all(funs(round(., 2))) })

We now have a reactive object called monte_carlo_sim() which holds our 51 simulations of the custom portfolio. We can visualize with highcharter(), exactly as we did in the visualization post. We pass the reactive object directly to highcharter by calling hchar(monte_carlo_sim()...). Note that we begin the chunk with renderHighchart(). That alerts the file that the visualization is a reactively defined plot, and not a statically defined plot. If this were a ggplot visualization, we would start the call with renderPlot().

renderHighchart( hchart(monte_carlo_sim(), type = 'line', hcaes(y = growth, x = month, group = sim)) %>% hc_title(text = paste(sims(), "Simulations", sep = " ")) %>% hc_xAxis(title = list(text = "months")) %>% hc_yAxis(title = list(text = "dollar growth"), labels = list(format = "${value}")) %>% hc_add_theme(hc_theme_flat()) %>% hc_exporting(enabled = TRUE) %>% hc_legend(enabled = FALSE) )

And finally, we isolate the minimum, median and maximum simulations for visualization and pass them to highcharter.

renderHighchart({ sim_summary <- monte_carlo_sim() %>% summarise(final = last(growth)) %>% summarise( max = max(final), min = min(final), median = median(final)) mc_max_med_min <- monte_carlo_sim() %>% filter( last(growth) == sim_summary$max || last(growth) == sim_summary$median || last(growth) == sim_summary$min) hchart(mc_max_med_min, type = 'line', hcaes(y = growth, x = month, group = sim)) %>% hc_title(text = "Min Max Median Simulations") %>% hc_xAxis(title = list(text = "months")) %>% hc_yAxis(title = list(text = "dollar growth"), labels = list(format = "${value}")) %>% hc_add_theme(hc_theme_flat()) %>% hc_exporting(enabled = TRUE) %>% hc_legend(enabled = FALSE) })

That wraps our Monte Carlo application and this showcases a powerful use for Shiny (we could have written any simulation function and then displayed the results). The end user sees the charts and we, as the R coders, can run whatever functions we wish under the hood. Thanks for reading and happy coding!

_____='https://rviews.rstudio.com/2018/07/18/monte-carlo-shiny-part-3/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Video: R for AI, and the Not Hotdog workshop

Tue, 07/17/2018 - 18:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Earlier this year at the QCon.ai conference, I gave a short presentation, "The Case for R, for AI developers". I also presented an interactive workshop, using R and the Microsoft Cognitive Services APIs, to automatically generate captions for images, and to create a tool to recognize images of hotdogs. Video from both the presentation, and the workshop (which starts at the 10:00 mark), is now available to view the QCon.ai website.

You can find the slides for the presentation here. The R code for the "Not Hotdog" workshop is available as an Azure Notebook which you can clone and use to follow along with the workshop.

InfoQ: R for AI developers

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to add Trend Lines to Visualizations in Displayr

Tue, 07/17/2018 - 14:00

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In Displayr, Visualizations of chart type Column, Bar, Area, Line and Scatter all support trend lines.  Trend lines can be  linear or non-parametric (cubic spline, Friedman’s super-smoother or LOESS).

Adding a linear trend line

Linear trend lines can be added to a chart by fitting a regression to each series in the data source. In the chart below, the linear trends are shown as dotted lines in the color corresponding to the data series. We see there is considerable fluctuation in the frequency of each search term. But the trend lines clarify that the overall trend for SPSS is downward, whereas the trend for Stata is increasing.

The data for this chart was generated by clicking Insert > More > Data > Google Trends. In the textbox for Topic(s) we typed in a comma-separated list of search terms (i.e., “SPSS, Stata”). This creates a table scoring the number of times each term was searched for each week. Any input table that has a similar structure to this can be used to create a chart. 

If we click on the table, the object inspector showing the properties of this output is shown on the right. Under the Properties tab, expand the General group to see the name of the table, in this case google.trends.

We create a chart by selecting Insert > Visualizations > Line. In the dropdown for Output in ‘Pages’ select the name of the input table (i.e., google.trends). On the Chart tab in the object inspector, look for the Trend lines group. Set the Line of best fit dropdown to Linear. We also tick the checkbox for Ignore last data point.  This option is useful for ignoring the last time period which may be incomplete if the data is in the process of being collected.

Trend lines using non-parametric smoothers

In many cases, we want to estimate a trend that is not constrained to a straight line. To estimate smooth (non-parametric) trend lines, we can use cubic splines, Friedman’s super smoother or LOESS. Note that LOESS uses a fixed span of 0.75 which may sometimes be overly large. In contrast, the cubic spline and Friedman’s smoother uses cross-validation to select a span, and they are usually better at identifying the important features. For example, in the figure below, the LOESS trend line suggests there is a gradual decrease in river flow from 1870 to 1900. However, the cubic spline and Friedman’s super smoother picks up a sharp decrease in 1989.

This example uses Nile, which is a built-in dataset in R. It is a time-series object, so to load the dataset in Displayr, first create an R Output and type the following code in the textbox for R code:

data(Nile) x = Nile

The second line is necessary to assign a name to the time series data. You will then be able to use the data set in a chart by selecting x in the Output in ‘Pages’ dropdown (under Data Source in the Inputs tab).

Trend lines are added by going to the Chart tab in the object inspector and selecting a method (cubic spline, Friedman’s super smoother or LOESS) for the line of best fit. Once this option is selected, other controls to customize the appearance of line of best fit will be shown under the Trend Lines group. You may also want to adjust the opacity or the color palette (under the Data series group) of the data (i.e., column bars).

To find out more about visualizations in Displayr, head to our blog now. Ready to use trend lines for your own data? Grab your free trial of Displayr now.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clean Your Data in Seconds with This R Function

Tue, 07/17/2018 - 14:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

All data needs to be clean before you can explore and create models. Common sense, right. Cleaning data can be tedious but I created a function that will help.

The function do the following:

  • Clean Data from NA’s and Blanks
  • Separate the clean data – Integer dataframe, Double dataframe, Factor dataframe, Numeric dataframe, and Factor and Numeric dataframe.
  • View the new dataframes
  • Create a view of the summary and describe from the clean data.
  • Create histograms of the data frames.
  • Save all the objects

This will happen in seconds.

Package

First, load Hmisc package. I always save the original file.
The code below is the engine that cleans the data file.

cleandata <- dataname[complete.cases(dataname),] The function

The function is below. You need to copy the code and save it in an R file. Run the code and the function cleanme will appear.

cleanme <- function(dataname){ #SAVE THE ORIGINAL FILE oldfile <- write.csv(dataname, file = "oldfile.csv", row.names = FALSE, na = "") #CLEAN THE FILE. SAVE THE CLEAN. IMPORT THE CLEAN FILE. CHANGE THE TO A DATAFRAME. cleandata <- dataname[complete.cases(dataname),] cleanfile <- write.csv(cleandata, file = "cleanfile.csv", row.names = FALSE, na = "") cleanfileread <- read.csv(file = "cleanfile.csv") cleanfiledata <- as.data.frame(cleanfileread) #SUBSETTING THE DATA TO TYPES logicmeint <- cleanfiledata[,sapply(cleanfiledata,is.integer)] logicmedouble <- cleanfiledata[,sapply(cleanfiledata,is.double)] logicmefactor <- cleanfiledata[,sapply(cleanfiledata,is.factor)] logicmenum <- cleanfiledata[,sapply(cleanfiledata,is.numeric)] mainlogicmefactors <- cleanfiledata[,sapply(cleanfiledata,is.factor) | sapply(cleanfiledata,is.numeric)] #VIEW ALL FILES View(cleanfiledata) View(logicmeint) View(logicmedouble) View(logicmefactor) View(logicmenum) View(mainlogicmefactors) #describeFast(mainlogicmefactors) #ANALYTICS OF THE MAIN DATAFRAME cleansum <- summary(cleanfiledata) print(cleansum) cleandec <- describe(cleanfiledata) print(cleandec) #ANALYTICS OF THE FACTOR DATAFRAME factorsum <- summary(logicmefactor) print(factorsum) factordec <- describe(logicmefactor) print(factordec) #ANALYTICS OF THE NUMBER DATAFRAME numbersum <- summary(logicmenum) print(numbersum) numberdec <- describe(logicmefactor) print(numberdec) mainlogicmefactorsdec <- describe(mainlogicmefactors) print(mainlogicmefactorsdec) mainlogicmefactorssum <- describe(mainlogicmefactors) print(mainlogicmefactorssum) #savemenow <- saveRDS("cleanmework.rds") #readnow <- readRDS(savemenow) #HISTOGRAM PLOTS OF ALL TYPES hist(cleanfiledata) hist(logicmeint) hist(logicmedouble) hist(logicmefactor) hist(logicmenum) #plot(mainlogicmefactors) save(cleanfiledata, logicmeint, mainlogicmefactors, logicmedouble, logicmefactor, logicmenum, numberdec, numbersum, factordec, factorsum, cleandec, oldfile, cleandata, cleanfile, cleanfileread, file = "cleanmework.RData") }

Type in and run:

cleanme(dataname)

When all the data frames appear, type to load the workspace as objects.

load("cleanmework.RData")

Enjoy

    Related Post

    1. Hands-on Tutorial on Python Data Processing Library Pandas – Part 2
    2. Hands-on Tutorial on Python Data Processing Library Pandas – Part 1
    3. Using R with MonetDB
    4. Recording and Measuring Your Musical Progress with R
    5. Spark RDDs Vs DataFrames vs SparkSQL – Part 4 Set Operators
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    pinp 0.0.6: Two new options

    Tue, 07/17/2018 - 12:22

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A small feature release of our pinp package for snazzier one or two column vignettes get onto CRAN a little earlier.

    It offers two new options. Saghir Bashir addressed a longer-standing help needed! issue and contributed code to select papersize options via the YAML header. And I added support for the collapse option of knitr, also via YAML header selection.

    A screenshot of the package vignette can be seen below. Additional screenshots of are at the pinp page.

    The NEWS entry for this release follows.

    Changes in pinp version 0.0.6 (2018-07-16)
    • Added YAML header option ‘papersize’ (Saghir Bashir in #54 and #58 fixing #24).

    • Added YAML header option ‘collapse’ with default ‘false’ (#59).

    Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    10 Jobs for R users from around the world (2018-07-17)

    Tue, 07/17/2018 - 10:32
    To post your R job on the next post

    Just visit  this link and post a new R job  to the R community.

    You can post a job for  free  (and there are also “featured job” options available for extra exposure).

    Current R jobs

    Job seekers:  please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs

     

    All New R Jobs
    1. Freelance
      R programmer & population data scientist United Nations, DESA, Population Division – Posted by pgerland
      Anywhere
      17 Jul 2018
    2. Freelance
      Data Analytics Instructor Level Education From Northeastern University – Posted by LilyMeyer
      Boston Massachusetts, United States
      16 Jul 2018
    3. Freelance
      Regex enthusiast needed for summer job MoritzH
      Anywhere
      13 Jul 2018
    4. Full-Time
      Financial Systems Analyst National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    5. Full-Time
      Data Scientist National Audit Office – Posted by ahsinebadian
      London England, United Kingdom
      13 Jul 2018
    6. Full-Time
      Shiny-R-Developer INSERM U1153 – Posted by menyssa
      Paris Île-de-France, France
      9 Jul 2018
    7. Freelance
      Senior Data Scientist Data Science Talent – Posted by damiendeighan
      Frankfurt am Main Hessen, Germany
      6 Jul 2018
    8. Full-Time
      Behavioral Risk Research Associate @ New York EcoHealth Alliance – Posted by EcoHealth Alliance
      New York New York, United States
      23 Jun 2018
    9. Full-Time
      postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
      Anywhere
      18 Jun 2018
    10. Full-Time
      Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
      New York New York, United States
      15 Jun 2018

     

    In  R-users.com  you can see  all  the R jobs that are currently available.

    R-users Resumes

    R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

    (you may also look at  previous R jobs posts ).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Using leaflet, just because

    Tue, 07/17/2018 - 08:22

    (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

    I love it when researchers take the time to share their knowledge of the computational tools that they use. So first, let me point you at Environmental Computing, a site run by environmental scientists at the University of New South Wales, which has a good selection of R programming tutorials.

    One of these is Making maps of your study sites. It was written with the specific purpose of generating simple, clean figures for publications and presentations, which it achieves very nicely.

    I’ll be honest: the sole motivator for this post is that I thought it would be fun to generate the map using Leaflet for R as an alternative. You might use Leaflet if you want:

    • An interactive map that you can drag, zoom, click for popup information
    • A “fancier” static map with geographical features of interest
    • concise and clean code which uses pipes and doesn’t require that you process shapefiles

    Without further ado:

    library(leaflet) library(readr) point_data <- read_csv("http://environmentalcomputing.net/wp-content/uploads/2018/03/point_data.csv") leaflet(point_data) %>% fitBounds(lng1 = min(point_data$lon) - 0.11, lat1 = min(point_data$lat) - 0.11, lng2 = max(point_data$lon) + 0.11, lat2 = max(point_data$lat) + 0.11) %>% addCircleMarkers(~lon, ~lat, radius = 3, opacity = 100, color = "white", label = as.character(point_data$Site), labelOptions = labelOptions(noHide = TRUE, textOnly = TRUE, style = list(color = "white"), direction = "auto", offset = c(0, -10))) %>% addScaleBar(options = list(imperial = FALSE)) %>% addProviderTiles(providers$Esri.WorldImagery)

    Esri.WorldImagery

    Building a Leaflet map is similar to ggplot2, except that the pipe symbol is used to add layers instead of “+”. In the example above we create a leaflet object from the coordinate data file, add a bounding box based on the latitude/longitude values, add markers and a scale bar.
    In the final step we add a map tile from a provider. The example here is named Esri.WorldImagery. You can preview the complete set of providers at this site.

    Hydda.Base

    Leaflet maps are designed to be pretty, so there are not many provider tiles that might be called plain or minimalist. For something less “satellite imagery” and more “traditional map”, try changing the provider to Hydda.Base.

    Esri.WorldGrayCanvas

    In the final example, we change the label colour to black and use the Esri.WorldGrayCanvas, which is about as plain and monochrome as Leaflet gets.

    If you use RStudio, Leaflet maps are rendered in the Viewer pane, from where you can export to HTML or image files (though the latter is a little broken for me right now). RStudio also makes it easy to explore and install other useful Leaflet-associated packages, such as leaflet.extras.

    So that’s Leaflet for R. One issue is that there are quite a few new functions to learn, some of which have many options. Fortunately the introductory documentation is very good.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    What’s inside? pkginspector provides helpful tools for inspecting package contents

    Tue, 07/17/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)



    R packages are widely used in science, yet the code behind them often does not come under scrutiny. To address this lack, rOpenSci has been a pioneer in developing a peer review process for R packages. The goal of pkginspector is to help that process by providing a means to better understand the internal structure of R packages. It offers tools to analyze and visualize the relationship among functions within a package, and to report whether or not functions’ interfaces are consistent. If you are reviewing an R package (maybe your own!), pkginspector is for you.

    We began building pkginspector during unconf18, with support from rOpenSci and guidance from Noam Ross. The package focuses on facilitating a few of the many tasks involved in reviewing a package; it is one of a collection of packages, including pkgreviewr (rOpenSci) and goodpractice, among others, that are devoted to this project. (The division of labor among these packages is under discussion). If you’re not familiar with rOpenSci’s package review process, “How rOpenSci uses Code Review to Promote Reproducible Science” provides context.

    Function calls

    rev_fn_summary() helps you analyze function calls. It takes a package path and returns a table of information about its functions. Consider this example included in pkginspector:

    # devtools::install_github("ropenscilabs/pkginspector") library(pkginspector) path <- pkginspector_example("viridisLite") rev_fn_summary(path) ## f_name ## 1 cividis ## 2 inferno ## 3 magma ## 4 plasma ## 5 viridis ## 6 viridisMap ## f_args ## 1 cividis (n, alpha = 1, begin = 0, end = 1, direction = 1) ## 2 inferno (n, alpha = 1, begin = 0, end = 1, direction = 1) ## 3 magma (n, alpha = 1, begin = 0, end = 1, direction = 1) ## 4 plasma (n, alpha = 1, begin = 0, end = 1, direction = 1) ## 5 viridis (n, alpha = 1, begin = 0, end = 1, direction = 1, option = "D") ## 6 viridisMap (n = 256, alpha = 1, begin = 0, end = 1, direction = 1, ## calls called_by dependents ## 1 1 0 0 ## 2 1 0 0 ## 3 1 0 0 ## 4 1 0 0 ## 5 0 4 4 ## 6 0 0 0

    The example shows that the number of functions called by cividis(), inferno(), magma() and plasma() is 1, 1, 1 and 1, and that these functions are called by 0, 0, 0 and 0 functions. viridis(), in contrast, calls 0 functions but is called by 4 functions. In this case, the number of dependents is 4. Dependents are counted recursively and include any functions in the calling chain. For example, if A calls B and B calls C, we would say that C is called by 1 (B) but has 2 dependents (A & B). rev_fn_summary() also provides information about function parameters in the f_args column.

    What’s not working: We know that we miss function calls if they are passed as parameters to purrr::map() and do.call() functions. There may be other systematic misses as well.

    Visualization

    vis_package() helps you visualize the network of functions’ dependencies (interactive example).

    vis_package(path, physics = FALSE)

    Argument default usage

    rev_args() identifies all the functions’ arguments used in a given package. It returns a dataframe whose main column, default_consistent, indicates whether or not the default value of the argument is consistent across the functions that use it. This helps to evaluate the complexity of the package and to identify potential sources of confusion, for example, if the meaning or default value of the same argument varies across functions.

    rev_args(path)$arg_df ## arg_name n_functions default_consistent default_consistent_percent ## 1 n 6 FALSE 83.33333 ## 2 alpha 6 TRUE 100.00000 ## 3 begin 6 TRUE 100.00000 ## 4 end 6 TRUE 100.00000 ## 5 direction 6 TRUE 100.00000 ## 6 option 2 TRUE 100.00000

    The example shows that the parameter n is used inconsistently. A look at the viridisLite code reveals that the default value of n is 256 in one function but missing in all others. This flags a potential issue that deserves further investigation. In this case, the odd function out – viridisMap() – has a clear use case that is different from the others.

    In sum

    If you are building or reviewing an R package, pkginspector can help you better understand its complex structure. This is an important step towards improving your code and research. While pkginspector will expand in scope, the features built during and since unconf18 are already useful. For example, if you’ve tried sketching out the relationship among functions in a package with pencil and paper, you will appreciate the ability to call vis_package() to create a network diagram.

    Our broader vision for pkginspector is a tool that guides both the development and review of R packages and provides automated checks on subtle differences in package functions that inevitably arise during the development process. The package will (hopefully) grow and exist as a living toolbox for development and review. If you have ideas for tools that could be added to pkginspector to facilitate the process of reviewing a package, we encourage you to open an issue.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Hamiltonian tails

    Tue, 07/17/2018 - 00:18

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    “We demonstrate HMC’s sensitivity to these parameters by sampling from a bivariate Gaussian with correlation coefficient 0.99. We consider three settings (ε,L) = {(0.16; 40); (0.16; 50); (0.15; 50)}” Ziyu Wang, Shakir Mohamed, and Nando De Freitas. 2013

    In an experiment with my PhD student Changye Wu (who wrote all R codes used below), we looked back at a strange feature in an 2013 ICML paper by Wang, Mohamed, and De Freitas. Namely, a rather poor performance of an Hamiltonian Monte Carlo (leapfrog) algorithm on a two-dimensional strongly correlated Gaussian target, for very specific values of the parameters (ε,L) of the algorithm.

    The Gaussian target associated with this sample stands right in the middle of the two clouds, as identified by Wang et al. And the leapfrog integration path for (ε,L)=(0.15,50)

    keeps jumping between the two ridges (or tails) , with no stop in the middle. Changing ever so slightly (ε,L) to (ε,L)=(0.16,40) does not modify the path very much

    but the HMC output is quite different since the cloud then sits right on top of the target

    with no clear explanation except for a sort of periodicity in the leapfrog sequence associated with the velocity generated at the start of the code. Looking at the Hamiltonian values for (ε,L)=(0.15,50)

    and for (ε,L)=(0.16,40)

    does not help, except to point at a sequence located far in the tails of this Hamiltonian, surprisingly varying when supposed to be constant. At first, we thought the large value of ε was to blame but much smaller values still return poor convergence performances. As below for (ε,L)=(0.01,450)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Continuous deployment of package documentation with pkgdown and Travis CI

    Mon, 07/16/2018 - 19:32

    (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

    The problem

    pkgdown is an R package that can create a beautifully looking website for your own R package. Built and maintained by Hadley Wickham and his gang of prolific contributors, this package can parse the documentation files and vignettes for your package and builds a website from them with a single command: build_site(). This is what such a pkgdown-generated website looks like in action.

    The html files that pkgdown generated are stored in a docs folder. If your source code is hosted on GitHub, you just have to commit this folder to GitHub, navigate to the Settings panel of your GitHub repo and enable GitHub pages to host the docs folder at https://.github.io/. It’s remarkably easy and a great first step. In fact, this is how the pkgdown-built website for pkgdown itself is hosted.

    Although it’s an elegant flow, there are some issues with this approach. First, you’re committing files that were automatically generated even though the source required to build them is already stored in the package. In general, it’s not good practice to commit automatically generated files to your repo. What if you update your documentation, and commit the changes without rerendering the pkgdown website locally? Your repo files will be out of sync, and the pkgdown website will not reflect the latest changes. Second, there is no easy way to control when you release your documentation. Maybe you want to work off of the master branch, but you don’t want to update the docs until you’ve done a CRAN release and corresponding GitHub release. With the ad-hoc approach of committing the docs folder, this would be tedious.

    The solution

    There’s a quick fix for these concerns though, and that is to use Travis CI. Travis CI is a continuous integration tool that is free for open-source projects. When configured properly, Travis will pick up on any changes you make to your repo. For R packages, Travis is typically used to automatically run the battery of unit tests and check if the package builds on several previous versions of R, among other things. But that’s not all; Travis is also capable of doing deployments. In this case, I’ll show you how you can set up Travis so it automatically builds the pkgdown website for you, and commits the web files to the gh-pages branch, which is then subsequently used by GitHub to host your package website. To see how it’s set up for a R package in production check out the testwhat package on GitHub, which we use at DataCamp to grade student submissions and give useful feedback. In this tutorial, I will set up pkgdown for the tutorial package, another one of DataCamp’s open-source projects to make your blogs interactive.

    The steps
    1. Go to https://travis-ci.org and link your GitHub account.
    2. On your Travis CI profile page, enable Travis for the project repo that you want to build the documentation for. The next time you push a change to your GitHub project, Travis will be notified and will try to build your project. More on that later.

    3. In the DESCRIPTION file of your R package, add pkgdown to the Suggests list of packages. This ensures that when travis builds/installs your package, it will also install pkgdown so we can use it for building the website.

    4. In the .gitignore file, make sure that the entire docs folder is ignored by git: add the line docs/*.
    5. Add a file with the name .travis.yml to your repo’s root folder, with the following content:

      language: r cache: packages after_success: - Rscript -e 'pkgdown::build_site()' deploy: provider: pages skip-cleanup: true github-token: $GITHUB_PAT keep-history: true local-dir: docs on: branch: master

      This configuration file is very short, but it’s doing a lot of different things. Jeroen Ooms and Jim Hester are maintaining a default Travis build configuration for R packages that does a lot of things for you out of the box. A Travis config file with only the language: r tag would already build, test and check your package for inconsistencies. Let’s go over the other fields:

      • cache: packages tells Travis to cache the package installs between builds. This will significantly speed up your package build time if you have some package dependencies.
      • after_success tells Travis which steps to take when the R CMD CHECK step has succeeded. In our case, we’re telling Travis to build the pkgdown website, which will create a docs folder on Travis’s servers.
      • Finally, deploy asks Travis to go ahead and upload the files in the docs folder (local-dir) to GitHub pages, as specified through provider: pages. The on field tells Travis to do this deployment step if the change that triggered a build happened on the master branch.

      For a full overview of the settings, you can visit this help article. We do not have to specify the GitHub target branch where the docs have to be pushed to, as it defaults to gh-pages.

    6. Notice that the deploy step also features a github-token field, that takes an environment variable. Travis needs this key to make changes to the gh-pages branch. To get these credentials and make sure Travis can find them:

      • Go to your GitHub profile settings and create a new personal access token (PAT) under the Developer Settings tab. Give it a meaningful description, and make sure to generate a PAT that has either the public_repo (for public packages) or repo (for private packages) scope.

      • Copy the PAT and head over to the Travis repository settings, where you can specify environment variables. Make sure to name the environment variable GITHUB_PAT.

    7. The build should be good to go now! Commit the changes to your repo (DESCRIPTION and .travis.yml) to the master branch of your GitHub repo with a meaningful message.

    8. Travis will be notified and get to work: it builds the package, checks it, and if these steps are successful, it will build the pkgdown website and upload it to gh-pages.

    9. GitHub notices that a gh-pages branch has been created, and immediately hosts it at https://.github.io/. In our case, that is https://datacamp.github.io/tutorial. Have a look. What a beauty! Without any additional configuration, pkgdown has built a website with the GitHub README as the homepage, a full overview of all exported functions, and vignettes under the Articles section.

    From now on, every time you update the master branch of your package and the package checks pass, your documentation website will be updated automatically. You no longer have to worry about keeping the generated files in sync with your actual in-package documentation and vignettes. You can also easily tweak the deployment process so it only builds the documentation whenever you make a GitHub release. Along the way, you got continuous integration for your R package for free: the next time you make a change, Travis will notify you they broke any tests or checks.

    Happy packaging!

    References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Twitter coverage of the useR! 2018 conference

    Mon, 07/16/2018 - 14:09

    (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

    In summary:

    The code that generated the report (which I’ve used heavily and written about before) is at Github too. A few changes required compared with previous reports, due to changes in the rtweet package, and a weird issue with kable tables breaking markdown headers.

    I love that the most popular media attachment is a screenshot of a Github repo.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Smoothing Time Series Data

    Mon, 07/16/2018 - 06:58

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    These include both global methods, which involve fitting a regression over the whole time series; and more flexible local methods, where we relax the constraint by a single parametric function. Further details about how to construct estimated smooths in R can be found here.

    1. Global trends over time i. Linear

    One of the simplest methods to identify trends is to fit the time series to the linear regression model.

    ii. Quadratic

    For more flexibility, we can also fit the time series to a quadratic expression — that is, we use linear regression with the expanded basis functions (predictors) 1, x, x2.

    iii. Polynomial

    If the linear model is not flexible enough, it can be useful to try a higher-order polynomial.  In practice, polynomials of degrees higher than three are rarely used. As demonstrated in the example below, changing from quadratic and cubic trend lines does not always significantly improve the goodness of fit.

     

    2. Local smoothers

    The first three approaches assume that the time series follows a single trend. Often, we want to relax this assumption. For example, we do not want variation at the beginning of the time-series to affect estimates near the end of the time series. In the following section, we demonstrate the use of local smoothers using the Nile data set (included in R’s built in data sets). It contains measurements of the annual river flow of the Nile over 100 years and is less regular than the data set used in first example.

    i. Moving averages

    The easiest local smoother to grasp intuitively is the moving average (or running mean) smoother. It consists of taking the mean of a fixed number of nearby points. As we only use nearby points, adding new data to the end of the time series does not change estimated values of historical results. Even with this simple method we see that the question of how to choose the neighborhood is crucial for local smoothers. Increasing the bandwidth from 5 to 20 suggests that there is a gradual decrease in annual river flow from 1890 to 1905, instead of a sharp decrease at around 1900.

    ii. Running line

    The running-line smoother reduces this bias by fitting a linear regression in a local neighborhood of the target value xi. A popular algorithm using the running line smoother is Friedman’s super-smoother, which uses cross-validation to find the best span. As seen in the plot below, the Friedman’s super-smoother with the cross-validated span is able to detect the sharp decrease in annual river flow at around 1900.

    iii. Kernel smoothers

    An alternative approach to specifying a neighborhood is to decrease weights further away from the target value. In the figure below, we see that the continuous Gaussian kernel gives a smoother trend than a moving average or running-line smoother.

    iv. Smoothing splines

    Splines consist of a piece-wise polynomial with pieces defined by a sequence of knots where the pieces join smoothly. It is most common to use cubic splines. Higher order polynomials can have erratic behavior at the boundaries of the domain.

    The smoothing spline avoids the problem of over-fitting by using regularized regression. This involves minimizing a criterion that includes both a penalty for the least squares error and roughness penalty. Knots are initially placed at all of the data points. But the smoothing spline avoids over-fitting because the roughness penalty shrinks the coefficients of some of the basis functions towards zero. The smoothing parameter lambda controls the trade-off between goodness of fit and smoothness. It can be chosen by cross-validation.

    v. LOESS

    LOESS (locally estimated scatterplot smoother) combines local regression with kernels by using locally weighted polynomial regression (by default, quadratic regression with tri-cubic weights). It is one of the most frequently used smoothers because of its flexibility. However, unlike Friedman’s super smoother or the smoothing spline, LOESS does not use cross-validation to select a span.

    Find out more about data visualizations here. 

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Customizing styler – the quick way

    Mon, 07/16/2018 - 02:00

    (This article was first published on Lorenz Walthert, and kindly contributed to R-bloggers)

    I am currently experiencing problems with getting my posts in full length on
    r-bloggers. You can continue here with
    reading in case only the first paragraph is rendered.

    One cool thing that happens if you work resonates in the community is that you
    see other people using it. In this blog post I am going to address a typical
    question people have when they want to use a source code formatter – in
    particular styler:

    I don’t like rule xyz of the tidyverse style guide, which is the default
    style guide implemented in styler. How can I tell styler not to apply it?

    Theory

    First, I think reading the docs would be a good approach. There are two
    resources:

    • The help file
      for the function tidyverse_style(), which returns the transformer functions that prettify your code. It has a few interesting arguments, some of which are rather complex.1
    • If you can’t get styler behaving the way you want using the arguments of
      tidyverse_style(), you have another option, which is described in a vignette: Creating your own style guide. Yes, I admit, it’s pretty long and if you don’t want to become a styler expert, it may be a little bit overwhelming.

    If you don’t care about how to create new rules but you simply want to remove
    a rule, I have good news for you: There is a quick way to do it. These are the
    steps you need to complete in order to do it:

    • Figure out which transformer function in the transformers returned
      by tidyerse_style() corresponds to the rule you want to remove.
    • Set that element in the list to NULL, which is equivalent to removing it.
    • Pass the list to style_text as a transformer.
    Practice

    Lets assume you want to remove the rule that turns = into <- for assignment.
    That means you want string = "hi there"

    to remain unchanged after applying styler. This is not the case if you use the
    default style guide of styler:

    library(styler) style_text("string = 'hi there'") string <- "hi there"

    So you need to figure out which rule is responsible for this. Let’s check the
    transformer categories used with the tidyverse style guide.

    transformers <- tidyverse_style() names(transformers) ## [1] "initialize" "line_break" "space" ## [4] "token" "indention" "use_raw_indention" ## [7] "reindention"

    From the aforementioned
    vignette:

    We note that there are different types of transformer functions. initialize
    initializes some variables in the nested parse table (so it is not actually a
    transformer), and the other elements modify either spacing, line breaks or
    tokens. use_raw_indention is not a function, it is just an option.

    Now, we can look at the names of the rules that are sub-elements of the
    transformer categories.

    library(styler) purrr::modify_depth(transformers, 1, names) ## $initialize ## [1] "initialize" ## ## $line_break ## [1] "remove_line_break_before_curly_opening" ## [2] "remove_line_break_before_round_closing_after_curly" ## [3] "remove_line_break_before_round_closing_fun_dec" ## [4] "style_line_break_around_curly" ## [5] "set_line_break_after_opening_if_call_is_multi_line" ## [6] "set_line_break_before_closing_call" ## [7] "remove_line_break_in_empty_fun_call" ## [8] "add_line_break_after_pipe" ## ## $space ## [1] "indent_braces" ## [2] "unindent_fun_dec" ## [3] "indent_op" ## [4] "indent_eq_sub" ## [5] "indent_without_paren" ## [6] "fix_quotes" ## [7] "remove_space_before_closing_paren" ## [8] "remove_space_before_opening_paren" ## [9] "add_space_after_for_if_while" ## [10] "add_space_before_brace" ## [11] "remove_space_before_comma" ## [12] "style_space_around_math_token" ## [13] "style_space_around_tilde" ## [14] "spacing_around_op" ## [15] "spacing_around_comma" ## [16] "remove_space_after_opening_paren" ## [17] "remove_space_after_excl" ## [18] "set_space_after_bang_bang" ## [19] "remove_space_before_dollar" ## [20] "remove_space_after_fun_dec" ## [21] "remove_space_around_colons" ## [22] "start_comments_with_space" ## [23] "remove_space_after_unary_pm_nested" ## [24] "spacing_before_comments" ## [25] "set_space_between_levels" ## [26] "set_space_between_eq_sub_and_comma" ## ## $token ## [1] "force_assignment_op" ## [2] "resolve_semicolon" ## [3] "add_brackets_in_pipe" ## [4] "remove_terminal_token_before_and_after" ## [5] "wrap_if_else_multi_line_in_curly" ## ## $indention ## [1] "update_indention_ref_fun_dec" ## ## $use_raw_indention ## NULL ## ## $reindention ## [1] "indention" "comments_only"

    Spotted the rule we want to get rid of? It’s under token and it’s called
    force_assignment_op. I agree, we could have chosen a better name. If you are
    not sure if you can guess from the name of the rule what it does you can also
    have a look at the function declaration of this (unexported) function.

    styler:::force_assignment_op ## function (pd) ## { ## to_replace <- pd$token == "EQ_ASSIGN" ## pd$token[to_replace] <- "LEFT_ASSIGN" ## pd$text[to_replace] <- "<-" ## pd ## } ## ##

    Next, you simply set that element to NULL.

    transformers$token$force_assignment_op <- NULL

    And you can use the modified transformer list as input to style_text()

    style_text("string = 'hi there'", transformers = transformers) string = "hi there"

    That’s it. Note that the transformer functions and how they are returned by
    tidyverse_style() is not part of the exposed API. This means that the order,
    the naming etc. may change. For example, I only recently spotted that the rule
    to remove quotes (fix_quotes)is in the category space, which is clearly
    wrong and I think I will move it over to token in a future release of styler.

    Some other rules and their tranformers
    • You don’t like multi-line ifelse statements getting wrapped around curly
      braces: transformers$token$wrap_if_else_multi_line_in_curly.
    • You don’t like mutli-line calls to be broken before the first named argument:
      transformers$line_break$set_line_break_after_opening_if_call_is_multi_line (interacting with transformers$line_break$set_line_break_before_closing_call).
    • You don’t like the line being broken after the pipe:
      transformers$line_break$add_line_break_after_pipe
    • You don’t like single quotes to be replaced by double quotes:
      transformers$space$fix_quotes.
    • You don’t like comments to start with one space:
      transformers$space$start_comments_with_space

    I think you get the idea. I nevertheless recommend using the tidyverse style
    guide
    as is since

    • it is a well-established, thought-through style.
    • using a consistent style (no matter which) reduces fraction in the community.

    In case you want to add a custom rule, the vignette Customizing
    styler
    is still the
    way to go. If you have questions, don’t hesitate to post on Stackoverflow or
    leave a comment below.

    1. One example is math_token_spacing. It requires an input that is typically easiest created with another function, e.g. specify_math_token_spacing() [return]
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Lorenz Walthert. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Sleek & Shiny

    Mon, 07/16/2018 - 02:00

    (This article was first published on R on Gianluca Baio, and kindly contributed to R-bloggers)

    I was trying to find an apt image to accompany this post, but googling “sleek and shiny” mostly brought up images of women with ridiculously, well, sleek and shiny hair (and magical products to achieve just that), which has nothing to do with what I want to talk about here (not that this has ever stopped me including pics that have nothing to do with the actual post, of course…).

    Anyway, Andrea has been doing lots of work and has suggested an upgrade to the BCEAweb webapp, mainly involving a move towards Shiny navbarPage. I think the results are quite nice — well, dare I say sleek & shiny?

    All the code we have used in our GitHub repository, of course.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Gianluca Baio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppClassic 0.9.11

    Mon, 07/16/2018 - 01:44

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new maintenance release, now at version 0.9.11, of the RcppClassic package arrived earlier today on CRAN. This package provides a maintained version of the otherwise deprecated initial Rcpp API which no new projects should use as the normal Rcpp API is so much better.

    Per another request from CRAN, we updated the source code in four places to no longer use dynamic exceptions specification. This is something C++11 deprecated, and g++-7 and above now complain about each use. No other changes were made.

    CRANberries also reports the changes relative to the previous release.

    Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Six Sigma DMAIC Series in R – Part 3

    Sun, 07/15/2018 - 15:22

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Hope you liked the Part 1 and Part 2 of this Series. In this Part 3, we will go through the tools used during the analyze phase of Six Sigma DMAIC cycle. In this phase, available data is used to identify the key process inputs and their relation to the output.

    We will go through the following tools:

    • Charts in R
    • Hypothesis Testing in R
    Charts in R

    In Six Sigma projects, Charts plays a very important role. They are used to support the interpretation of data, as a graphical representation of data provides a better understanding of the process. So, Charts are used not only in analyze phase, but at every step of Six Sigma projects.

    Bar Chart

    A bar chart is a very simple chart where some quantities are shown as the height of bars. Each bar represents a factor where the variable under study is being measured. A bar chart is usually the best graphical representation for counts. For example, the printer cartridge manufacturer distributes its product to five regions. An unexpected amount of defective cartridges has been returned in the last month. Bar plot of defects by region is as below:

    library("SixSigma") with(ss.data.pc.r, barplot(pc.def, names.arg = pc.regions, las = 2, main = "Barplot of defects by region", sub = "Printer cartridge example"))

    Output as below:

    Histogram

    A histogram is a bar chart for continuous variables. It shows the distribution of the measurements of variables. A histogram is used to find the distribution of a variable. That is, the variable is centered or biased. Are the observations close to the central values, or is it a spread distribution? Is it a normal distribution?

    For example, histograms for the variables volume and density in the ss.data.pc data set.

    hist(ss.data.pc$pc.volume, main="Printer Cartridge Volume", xlab="Volume", col="#DDDDDD")

    Output as below:

    Superimpose a density line for a theoretical normal distribution whose mean is 16 and standard deviation is 1

    hist(ss.data.pc$pc.volume, main = "Printer Cartridge Volume", xlab = "Volume", col = "#BBBBBB", border = "white", bg = "red", freq = FALSE, ylim = c(0,0.4)) curve(dnorm(x,16,1), add = TRUE, lty = 2, lwd = 2) lines(density(ss.data.pc$pc.volume), lwd = 2)

    Output as below:

    Scatterplot

    A scatter plot is an important tool to reveal relationships between two variables. There are three types of correlation between two variables: Positive correlation (high values of one variable lead to high values of the other one ). A negative correlation (Inversely proportional, high values of one variable lead to low values of the other one). No correlation: the variables are independent.

    In the ss.data.pc dataset, we have two continuous variables: pc.volume and pc.density, Scatterplot can be used to find patterns for this relation.

    plot(pc.volume ~ pc.density, data = ss.data.pc, main = "Searching correlation between Density and Volume", col = "#666666", pch = 16, sub = "Printer Cartridge Example", xlab = "Volume of Ink", ylab = "Density")

    Output as below:

    Boxplot

    The box-whisker chart is also known as the box plot. It graphically summarizes the distribution of a continuous variable. The sides of the box are the first and third quartiles (25th and 75th percentile, respectively). Thus, inside the box, we have the middle 50% of the data. The median is plotted as a line that crosses the box The box plot tells us if the distribution is centered or biased (the position of the median with respect to the rest of the data), if there are outliers (points outside the whiskers), or if the data are close to the center values (small whiskers or boxes).

    It is useful when we want to compare groups and check if there are differences among them. Example: In a production line, we have three fillers for the printer cartridges. We want to determine if there are any differences among the fillers and identify outliers.

    boxplot(pc.volume ~ pc.filler, data = ss.data.pc, col = "#CCCCCC", main = "Box Plot of Volume by Filler", sub = "Printer Cartridge Example", xlab = "Filler", ylab = "Volume")

    Output as below:

    Hypothesis Testing

    A statistical hypothesis is an assumption about a population parameter. This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. So it is intended to confirm or validate some conjectures about the process we are analyzing. For example, if we have data from a process that are normally distributed and we want to verify if the mean of the process has changed with respect to the historical mean, we should make the following hypothesis test:
    H0: μ = μ0
    H1: μ not equal μ0
    where H0 denotes the null hypothesis and H1 denotes the alternative hypothesis. Thus we are testing H0 (“the mean has not changed”) vs. H1 (“the mean has changed”).

    Hypothesis testing can be performed in two ways: one-sided tests and two-sided tests. An example of the latter is when we want to know if the mean of a process has increased:
    H0: μ = μ0,
    H1: μ >μ0.
    Hypothesis testing tries to find evidence about the refutability of the null hypothesis using probability theory. We want to check if a situation of the alternative hypothesis is arising. Subsequently, we will reject the null hypothesis if the data do not support it with “enough evidence.” The threshold for enough evidence can be set by expressing a significance level α. A 5% significance level is a widely accepted value in most cases.

    There are some functions in R to perform hypothesis tests, for example, t.test for means, prop.test for proportions, var.test and bartlett.test for variances, chisq.test for contingency table tests and goodness-of-fit tests, poisson.test for Poisson distributions, binom.test for binomial distributions, shapiro.test for normality tests. Usually, these functions also provide a confidence interval for the parameter tested.

    Example: Hypothesis test to verify if the length of the strings is different from the target value of 950 mm:
    H0 : μ = 950,
    H1 : μ not equal to 950.

    t.test(ss.data.strings$len, mu = 950, conf.level = 0.95) One Sample t-test data: ss.data.strings$len t = 0.6433, df = 119, p-value = 0.5213 alternative hypothesis: true mean is not equal to 950 95 percent confidence interval: 949.9674 950.0640 sample estimates: mean of x 950.0157

    As the p-value >0.05, we accept that the mean is not different from the target.

    Two means can also be compared, for example, for two types of strings. To compare the length of the string types E6 and E1, we use the following code:

    data.E1 <- ss.data.strings$len[ss.data.strings$type == "E1"] data.E6 <- ss.data.strings$len[ss.data.strings$type == "E6"] t.test(data.E1, data.E6) Welch Two Sample t-test data: data.E1 and data.E6 t = -0.3091, df = 36.423, p-value = 0.759 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1822016 0.1339911 sample estimates: mean of x mean of y 949.9756 949.9997

    Variances can be compared using the var.test function:

    var.test(data.E1, data.E6) F test to compare two variances data: data.E1 and data.E6 F = 1.5254, num df = 19, denom df = 19, p-value = 0.3655 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.6037828 3.8539181 sample estimates: ratio of variances 1.525428

    In this case, the statistic used is the ratio between variances, and the null hypothesis is “the ratio of variances is equal to 1,” that is, the variances are equal.
    Proportions can be compared using the prop.test function. Do we have the same defects for every string type? For example, compare types E1 and A5:

    defects <- data.frame(type = ss.data.strings$type, res = ss.data.strings$res < 3) defects <- aggregate(res ~ type, data = defects, sum) prop.test(defects$res, rep(20,6)) 6-sample test for equality of proportions without continuity correction data: defects$res out of rep(20, 6) X-squared = 5.1724, df = 5, p-value = 0.3952 alternative hypothesis: two.sided sample estimates: prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 0.05 0.00 0.00 0.10 0.00 0.05

    The p-value for the hypothesis test of equal proportions >0.05, so we cannot reject the null hypothesis, and therefore we do not reject that the proportions are equal.
    A normality test to check if the data follow a normal distribution can be performed with the shapiro.test() function:

    shapiro.test(ss.data.strings$len) Shapiro-Wilk normality test data: ss.data.strings$len W = 0.98463, p-value = 0.1903

    The statistic used to perform this hypothesis test is the Shapiro–Wilk statistic. In
    this test, the hypotheses are as follows:
    H0 : The data are normally distributed.
    H1 : The data are not normally distributed.

    The p-value > 0.05, we cannot reject the hypothesis of normality for the data, so we do not have enough evidence to reject normality.

    A type I error occurs when we reject the null hypothesis and it is true. We commit a type II error when we do not reject the null hypothesis and it is false. The probability of the former is represented as α, and it is the significance level of the hypothesis test (1 − α is the confidence level). The probability of the latter is represented as β, and the value 1−β is the statistical power of the test.
    The R functions power.prop.test and power.t.test can be used to determine the power or the sample size.

    Example: Black belt plan to perform a new analysis to find out the sample size needed to estimate the mean length of the strings with a maximum error of δ = 0.1 cm. He sets the significance level (α = 0.05) and the power (1−β = 0.90). The sample size can be determined using the following command:

    power.t.test(delta = 0.1, power = 0.9, sig.level = 0.05, sd = sd (ss.data.strings$len)) Two-sample t test power calculation n = 151.2648 delta = 0.1 sd = 0.2674321 sig.level = 0.05 power = 0.9 alternative = two.sided NOTE: n is number in *each* group

    So the sample size must be 152.

    In next part, we will go through Improve Phase of Six Sigma DMAIC process. Please let me know your feedback in the comments section. Make sure to like & share it. Happy Learning!!

    References

      Related Post

      1. Prediction Interval, the wider sister of Confidence Interval
      2. Six Sigma DMAIC Series in R – Part 2
      3. Six Sigma DMAIC Series in R – Part 1
      4. Implementation and Interpretation of Control Charts in R
      5. Time series model of forecasting future power demand
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      A thought experiment: How CRAN saved 3,620 (working) lives

      Sun, 07/15/2018 - 14:00

      (This article was first published on Topics in R, and kindly contributed to R-bloggers)

      Given the vast amount of R packages available today, it makes sense (at least to me, as a trained economist) to ask a simple yet difficult question: How much value has been created by all those packages?

      As all R stuff on CRAN is open-source (which is a blessing), there is no measureable GDP contribution in terms of market value that we can use to provide a quick answer. But all of us R users know the pleasant feeling, if not to say the excitement, of finding a package that provides exactly the functionality we have been looking for so long. This saves us the time of developing the functionality ourselves. So, apparantly, the time saving is one way to estimate the beneficial effect of the package sharing on CRAN. Here comes a simple (and not too serious) approach to estimating this effect.  (Side note: I am well aware of the extremely high concentration of capable statisticians and data scientists in the R community, so be clement with my approach, I am, as you will see shortly, not aiming at delivering a scientific paper on the matter, although it might be worthwhile to do so; if there are already papers on the topic out there, I am sure they have figured out much better approaches; in this case, please simply leave a comment below). Without further ado, let’s get right into it: Since the recordings began, the RStudio CRAN server has seen 1,121,724,508 package downloads as of today (afternoon [CET] of July 14th, 2018) (this number has been generated by running through all the 12,781 R packages identified with the CRAN_package_db() function from the tools package, and adding up their download figures which I have retrieved from the CRAN server logs via RStudio CRAN’s HTTP interface; this interface returns a JSON result which can easily be read using the fromJSON() function from the jsonlite package; to be a bit more precise: the whole operation was done with the buildIndex() function from my package packagefinder as this integrates all this functionality). Let’s assume 30% of these downloads are ‘false positives‘, i.e. cases in which the user realized the package is not really suitable for his/her purposes (and of course, in a more sophisticated approach we would need to account for package dependencies, as well; we neglect them here for the sake of simplicity). Removing the ‘false positives‘ leaves us with 785,207,156 downloads. Next, we assume that everyone who has downloaded a package would have developed the package’s functionality on his/her own if the package had not been available on CRAN. And let us further assume that this development effort would have taken one hour of work on average for each package. (You can play with the parameters of the model, but one hour seems really, really low, at least to me, but let’s keep it conservative for now.) But R users are not only extremely capable, almost ingenious programmers, they also have an incredible work ethic: Of course, everyone who works with R is an Elon Musk-style worker, that means he or she “puts in 80 to 90 hour work weeks, every week” (Musk in his own words). So, let’s be conservative and assume an agreeable 80-hour work week (there should be at least some work-life balance, after all; I mean, some people even have a family!).

      Calculating our model with these parameters leads to the almost incredible amount of 188,235 work years saved (if you assume a year of 365/7 = 52.14 weeks; of course, our hard-working R user does not have any time for vacation or any other time off). If you assume a working life is between the age of 18 and 70 this means an amount of time has been saved by sharing packages on CRAN that is is equivalent to 3,620 working lives. A truly incredible number. For all those who want to do the math themselves, here is the R code I used: library(packagefinder) # Attention: The next statement takes quite some time to run  # (> 1 hour on my machine) buildIndex(“searchindex.rdata”, TRUE) load(“searchindex.rdata”) package.count <- sum(searchindex$index$DOWNL_TOTAL) package.count.corrected <- package.count*0.7 package.hours <- package.count.corrected*1 package.workweeks <- package.hours/80 package.workyears <- package.workweeks/(365/7) package.worklifes <- package.workyears/(70-18)

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages