Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 44 min ago

rOpenSci 2019 Code of Conduct Transparency Report

Thu, 01/16/2020 - 01:00

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In January 2019, we announced the release of rOpenSci’s Code of Conduct version 2.0. This includes a named Committee, greater detail about unacceptable behaviors, instructions on how to make a report, and information on how reports are handled. We are committed to transparency with our community while upholding of victims and people who report incidents.

Our Code of Conduct applies to all people participating in the rOpenSci community, including rOpenSci staff and leadership. It applies to all modes of interaction online including GitHub project repositories, the rOpenSci discussion forum, Slack, Community Calls, and in person at rOpenSci-hosted events or events officially endorsed by rOpenSci, including social gatherings affiliated with an event.

In 2019, we received one Code of Conduct incident report. We published a brief, anonymized summary in our forum in a dedicated, read-only category.

Summary of Reported Incidents

We received a code of conduct report for demeaning comments made on rOpenSci’s Slack. One committee member was recused from handling this report; the remaining members reviewed the messages and determined that they violated the code of conduct. We contacted the sender and informed them that such comments are not acceptable under our CoC. We also informed the reporter of our decision and the outcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

rOpenSci Code of Conduct Annual Review

Thu, 01/16/2020 - 01:00

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One year ago, we released our new Code of Conduct. At that time, the Code of Conduct Committee (authors of this post) agreed to do an annual review of the text, reporting form, and our internal guidelines.

We have made one change to the Code of Conduct text. Because some people who have experienced abuse prefer not to label themselves as a victim, in “We are committed to transparency with our community while upholding the privacy of victims” we edited to “… upholding the privacy of victims and people who report incidents”.

With this change, this will be version 2.1, dated January 15, 2020.

We are pleased to report that we, the Code of Conduct Committee members, Stefanie Butland (rOpenSci Community Manager), Scott Chamberlain (rOpenSci Co-founder and Technical Lead), and Kara Woo (independent community member), all wish to continue serving in 2020. We are responsible for receiving, investigating, deciding, enforcing and reporting on all reports of potential violations of our Code.

We welcome your feedback by email to conduct at ropensci.org, and we thank you for working with us to make rOpenSci a safe, enjoyable, friendly and enriching experience for everyone who participates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On Cochran Theorem (and Orthogonal Projections)

Wed, 01/15/2020 - 17:46

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Cochran Theorem – from The distribution of quadratic forms in a normal system, with applications to the analysis of covariance published in 1934 – is probably the most import one in a regression course. It is an application of a nice result on quadratic forms of Gaussian vectors. More precisely, we can prove that if \boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{0},\mathbb{I}_d) is a random vector with d \mathcal{N}(0,1) variable then (i) if A is a (squared) idempotent matrix \boldsymbol{Y}^\top A\boldsymbol{Y}\sim\chi^2_r where r is the rank of matrix A, and (ii) conversely, if \boldsymbol{Y}^\top A\boldsymbol{Y}\sim\chi^2_r then A is an idempotent matrix of rank r. And just in case, A is an idempotent matrix means that A^2=A, and a lot of results can be derived (for instance on the eigenvalues). The prof of that result (at least the (i) part) is nice: we diagonlize matrix A, so that A=P\Delta P^\top, with P orthonormal. Since A is an idempotent matrix observe thatA^2=P\Delta P^\top=P\Delta P^\top=P\Delta^2 P^\topwhere \Delta is some diagonal matrix such that \Delta^2=\Delta, so terms on the diagonal of \Delta are either 0 or 1‘s. And because the rank of A (and \Delta) is r then there should be r 1‘s and d-r 1‘s. Now write\boldsymbol{Y}^\top A\boldsymbol{Y}=\boldsymbol{Y}^\top P\Delta P^\top\boldsymbol{Y}=\boldsymbol{Z}^\top \Delta\boldsymbol{Z}where \boldsymbol{Z}=P^\top\boldsymbol{Y} that satisfies\boldsymbol{Z}\sim\mathcal{N}(\boldsymbol{0},PP^\top) i.e. \boldsymbol{Z}\sim\mathcal{N}(\boldsymbol{0},\mathbb{I}_d). Thus \boldsymbol{Z}^\top \Delta\boldsymbol{Z}=\sum_{i:\Delta_{i,i}-1}Z_i^2\sim\chi^2_rNice, isn’t it. And there is more (that will be strongly connected actually to Cochran theorem). Let A=A_1+\dots+A_k, then the two following statements are equivalent (i) A is idempotent and \text{rank}(A)=\text{rank}(A_1)+\dots+\text{rank}(A_k) (ii) A_i‘s are idempotents, A_iA_j=0 for all i\neq j.

Now, let us talk about projections. Let \boldsymbol{y} be a vector in \mathbb{R}^n. Its projection on the space \mathcal V(\boldsymbol{v}_1,\dots,\boldsymbol{v}_p) (generated by those p vectors) is the vector \hat{\boldsymbol{y}}=\boldsymbol{V} \hat{\boldsymbol{a}} that minimizes \|\boldsymbol{y} -\boldsymbol{V} \boldsymbol{a}\| (in \boldsymbol{a}). The solution is\hat{\boldsymbol{a}}=( \boldsymbol{V}^\top \boldsymbol{V})^{-1} \boldsymbol{V}^\top \boldsymbol{y} \text{ and } \hat{\boldsymbol{y}} = \boldsymbol{V} \hat{\boldsymbol{a}}
Matrix P=\boldsymbol{V} ( \boldsymbol{V}^\top \boldsymbol{V})^{-1} \boldsymbol{V}^\top is the orthogonal projection on \{\boldsymbol{v}_1,\dots,\boldsymbol{v}_p\} and \hat{\boldsymbol{y}} = P\boldsymbol{y}.

Now we can recall Cochran theorem. Let \boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{\mu},\sigma^2\mathbb{I}_d) for some \sigma>0 and \boldsymbol{\mu}. Consider sub-vector orthogonal spaces F_1,\dots,F_m, with dimension d_i. Let P_{F_i} be the orthogonal projection matrix on F_i, then (i) vectors P_{F_1}\boldsymbol{X},\dots,P_{F_m}\boldsymbol{X} are independent, with respective distribution \mathcal{N}(P_{F_i}\boldsymbol{\mu},\sigma^2\mathbb{I}_{d_i}) and (ii) random variables \|P_{F_i}(\boldsymbol{X}-\boldsymbol{\mu})\|^2/\sigma^2 are independent and \chi^2_{d_i} distributed.

We can try to visualize those results. For instance, the orthogonal projection of a random vector has a Gaussian distribution. Consider a two-dimensional Gaussian vector

library(mnormt)
r = .7
s1 = 1
s2 = 1
Sig = matrix(c(s1^2,r*s1*s2,r*s1*s2,s2^2),2,2)
Sig
Y = rmnorm(n = 1000,mean=c(0,0),varcov = Sig)
plot(Y,cex=.6)
vu = seq(-4,4,length=101)
vz = outer(vu,vu,function (x,y) dmnorm(cbind(x,y),
mean=c(0,0), varcov = Sig))
contour(vu,vu,vz,add=TRUE,col='blue')
abline(a=0,b=2,col="red")

Consider now the projection of points \boldsymbol{y}=(y_1,y_2) on the straight linear with directional vector \overrightarrow{\boldsymbol{u}} with slope a (say a=2). To get the projected point \boldsymbol{x}=(x_1,x_2) recall that x_2=ay_1 and \overrightarrow{\boldsymbol{x},\boldsymbol{y}}\perp\overrightarrow{\boldsymbol{u}}. Hence, the following code will give us the orthogonal projections

p = function(a){
x0=(Y[,1]+a*Y[,2])/(1+a^2)
y0=a*x0
cbind(x0,y0)
}

with

P = p(2)
for(i in 1:20) segments(Y[i,1],Y[i,2],P[i,1],P[i,2],lwd=4,col="red")
points(P[,1],P[,2],col="red",cex=.7)

Now, if we look at the distribution of points on that line, we get… a Gaussian distribution, as expected,

z = sqrt(P[,1]^2+P[,2]^2)*c(-1,+1)[(P[,1]>0)*1+1]
vu = seq(-6,6,length=601)
vv = dnorm(vu,mean(z),sd(z))
hist(z,probability = TRUE,breaks = seq(-4,4,by=.25))
lines(vu,vv,col="red")

Or course, we can use the matrix representation to get the projection on \overrightarrow{\boldsymbol{u}}, or a normalized version of that vector actually

a=2
U = c(1,a)/sqrt(a^2+1)
U
[1] 0.4472136 0.8944272
matP = U %*% solve(t(U) %*% U) %*% t(U)
matP %*% Y[1,]
[,1]
[1,] -0.1120555
[2,] -0.2241110
P[1,]
x0 y0
-0.1120555 -0.2241110

(which is consistent with our manual computation). Now, in Cochran theorem, we start with independent random variables,

Y = rmnorm(n = 1000,mean=c(0,0),varcov = diag(c(1,1)))

Then we consider the projection on \overrightarrow{\boldsymbol{u}} and \overrightarrow{\boldsymbol{v}}=\overrightarrow{\boldsymbol{u}}^\perp

U = c(1,a)/sqrt(a^2+1)
matP1 = U %*% solve(t(U) %*% U) %*% t(U)
P1 = Y %*% matP1
z1 = sqrt(P1[,1]^2+P1[,2]^2)*c(-1,+1)[(P1[,1]>0)*1+1]
V = c(a,-1)/sqrt(a^2+1)
matP2 = V %*% solve(t(V) %*% V) %*% t(V)
P2 = Y %*% matP2
z2 = sqrt(P2[,1]^2+P2[,2]^2)*c(-1,+1)[(P2[,1]>0)*1+1]

We can plot those two projections

plot(z1,z2)

and observe that the two are indeed, independent Gaussian variables. And (of course) there squared norms are \chi^2_{1} distributed.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Appsilon Data Science is now an RStudio Full Service Certified Partner

Wed, 01/15/2020 - 17:10

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In November 2019, Appsilon Data Science became an RStudio Full Service Certified Partner.

We now officially provide support services for existing Shiny applications, Plumber APIs, and RStudio infrastructure. Further, we now provide Managed Services for the R environment. This means that we can handle ongoing support for your servers and infrastructure to keep applications running smoothly.  If you are interested in learning more about these services, please reach out!

This partnership solidifies our position as one of the leaders in the R/Shiny space. Building advanced Decision Support Systems has been a part of our path since we started.

RStudio Full Service Certified Partners resell RStudio professional products and employ RStudio certified R administrators and/or R user instructors who offer training, implementation, management, and development services that help customers succeed with R and RStudio.

This is a natural step for us, as we have been R evangelists since the first days of our company. We have even worked side by side with RStudio on many occasions with different clients.

We also share values with RStudio — a passion for community, open source, and empowering data scientists and companies everywhere. We both love to conference and share what we’ve learned in our day to day work. The broader community benefits from this cooperation as well — our open source packages have been downloaded over 39,000 times and are praised by companies like Viacom and BCG.

Becoming a certified partner with RStudio fortifies the communication between our companies so that we can better serve our clients and help them with every step of the RStudio product implementation process. Some RStudio products that we can help you with include:

RStudio Connect is a publishing platform that will boost your R workflow with a click of a button. Publish your Shiny applications, R Markdown reports & dashboards, Plumber APIs, Python notebooks, and more in one convenient IT-managed location. Periodic reports can be scheduled with just several clicks.

RStudio Package Manager is a repository management server optimized for R and RStudio. It allows you to organize, search, and browse R packages within your team and across your organization. Access all of CRAN or define curated subsets. Add your local packages and packages from Github. Control and manage all of the packages that your data scientists and data products depend on.

Shiny Server Pro is intended for projects requiring pure Shiny work. This is a field tested solution which can be readily scaled up and down. Shiny Server Pro doesn’t have all the features that you get with Connect, but if your use case doesn’t require RStudio Connect, then Shiny Server Pro might be a more economical solution. As with all other R products, we are happy to help with Shiny Server Pro.

Moving forward, we will continue to develop R Shiny applications for the data science and business communities alongside our new Partner. This will translate into a new, higher level of quality for our clients. Now, we can not only advise on solutions but also optimize their license and maintenance costs.

As you might know, we build decision support systems for enterprise and other types of organizations, such as the International Institute for Applied Systems Analysis (IIASA). We recently built a Shiny dashboard that will help IIASA understand and visualize natural disaster risk in Madagascar. Last year we also built R Shiny projects for Fortune 500 companies (including companies in the pharmaceutical, insurance, and logistics industries) and for universities and educational organizations (natural language processing). We intend to continue these endeavors in the future.

Plus we are happy to continue to offer services associated with setting up a production environment and a set of post-implementation support options such as development support, hosting, and administration services.

I am very excited about our collaboration with the RStudio team, and we look forward to fruitful cooperation in the future!

Meet us

We hope to see you at rstudio::conf in San Francisco on January 27. Please drop by the Appsilon Data Science booth in the Yosemite Room and say hello. Damian Rodziewicz is giving a presentation “Building a native iPad dashboard using plumber and RStudio Connect in Pharma” (Jan. 29, 4:23pm, Room 4). He will also give an e-poster “What does it take to make a Shiny prototype ready for production?Pedro Coutinho Silva will give an e-poster “From Front-End to Shiny development. Augmenting the power of your dashboards with modern front-end tools.”

E-posters will be shown during the opening reception on Tuesday (Jan. 28) evening from 5:30 to 7:30pm. You can also find us at the Bay Area useR Group meetup  (conference tickets not required) at 6:30pm on January 27 at the Hilton San Francisco Union Square. 

Say hello to Pedro Coutinho (4th from left), Damian and Pawel (10th and 11th from left) at rstudio::conf(2020)

Thanks for reading. For more, follow me on Linkedin and Twitter.

Follow Appsilon Data Science on Social Media

Article Appsilon Data Science is now an RStudio Full Service Certified Partner comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RQuantLib 0.4.11: More polish

Wed, 01/15/2020 - 13:22

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

New year, new RQuantLib! A new release 0.4.11 of RQuantLib arrived overnight on CRAN; and a Debian upload will follow shortly.

QuantLib is a very comprehensice free/open-source library for quantitative finance; RQuantLib connects it to the R environment and language.

This version does three new things. First, we fixed an oversight on our end and now allow a null calendar (as the C++ API). Second, the package switched to tinytest as a few of my other packages have done, allowing for very flexible testing during development and deployment—three cheers for easily testing installed packages too. Third, and per a kind nag from Kurt Hornik I updated a few calls which the current QuantLib 1.17 marks as deprecated. That lead to a compile issue with 1.16 so the change is conditional in one part. The complete set of changes is listed below:

Changes in RQuantLib version 0.4.11 (2020-01-15)
  • Changes in RQuantLib code:

    • The ‘Null’ calendar without weekends or holidays is now recognized.

    • The package now uses tinytest for unit tests (Dirk in #140).

    • Calls deprecated-in-QuantLib 1.17 were updated (Dirk in #144).

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the RQuantLib page. Questions, comments etc should go to the new rquantlib-devel mailing list. Issue tickets can be filed at the GitHub repo.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mapping World Languages’ Difficulty Relative to English

Wed, 01/15/2020 - 01:00

[This article was first published on Educators R Learners, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I was reading r/MapPorn and saw the image below:

As fate would have it, the very first comment was from Guridkt who said, “I’d like to see asia too, i wonder if there’s a map of that. (sic)”

So would I, so lets do it.

About the Data

This map is based on data published by the US Foreign Service Institute and, according to their website, it is based on the institutes’ “70 years of experience in teaching languages to U.S. diplomats, and illustrate the time usually required for a student to reach “Professional Working Proficiency” in the language, or a score of “Speaking-3/Reading-3” on the Interagency Language Roundtable scale” (emphasis added.)

And how would you describe someone with “professional working proficiency?” Their website says someone with a 3 in speaking can “use the language to satisfy professional needs in a wide range of sophisticated and demanding tasks.”

Fair enough, but how do they measure whether someone can actually achieve those tasks? Through a 15 – 30 minute TELEPHONE interview. Yikes!!!!

Now for a few caveats about the map.

First, for the sake of ease and simplicity, I scraped Wikipedia to get the official languages of each country.

Second, in order to simplify the task of displaying the relative ease/difficulty for English speakers to learn the official language of a country, I selected just one official language for each country.

Finally, in the instances when a country had multiple official languages but only one is listed by the FSI, (e.g. Afghanistan with Pashto and Dari) I selected the one listed by FSI for display purposes.

Now, for the map:

Clearly, according to the FSI, officials being sent to countries where Arabic, Mandarin, Cantonese, Korean or Japanese is the official language should receive more language training than those being sent to areas where Bahasa is spoken, (i.e., Malaysia and Indonesia).

Finally, if you’re interested in the code I used to create the world map in the thumbnail and the map of Asia above, please find it here.

Happy coding!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Educators R Learners. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

an elegant sampler

Wed, 01/15/2020 - 00:20

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Following an X validated question on how to simulate a multinomial with fixed average, W. Huber produced a highly elegant and efficient resolution with the compact R code

tabulate(sample.int((k-1)*n, s-n) %% n + 1, n) + 1

where k is the number of classes, n the number of draws, and s equal to n times the fixed average. The R function sample.int is an alternative to sample that seems faster. Breaking the outcome of

sample.int((k-1)*n, s-n)

as nonzero positions in an n x (k-1) matrix and adding a adding a row of n 1’s leads to a simulation of integers between 1 and k by counting the 1’s in each of the n columns, which is the meaning of the above picture. Where the colour code is added after counting the number of 1’s. Since there are s 1’s in this matrix, the sum is automatically equal to s. Since the s-n positions are chosen uniformly over the n x (k-1) locations, the outcome is uniform. The rest of the R code is a brutally efficient way to translate the idea into a function. (By comparison, I brute-forced the question by suggesting a basic Metropolis algorithm.)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

sklearn Pipe Step Interface for vtreat

Tue, 01/14/2020 - 20:13

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).

This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data), coder$fit(train_data_cal)$transform(train_data_model), and coder$fit(application_data).

We have also regenerated the current task-oriented vtreat documentation to demonstrate the new nested bias warning feature:

And we now have new versions of these documents showing the sklearn $fit_transform() style notation in R.

The original R interface is going to remain the standard interface for vtreat. It is more idiomatic R, and is taught in chapter 8 of Zumel, Mount; Practical Data Science with R, 2nd Edition, Manning 2019.

In contrast, the $fit_transform() notation will always just be an adaptor over the primary R interface. However, there is a lot to be learned from sklearn’s organization and ideas, so we felt we would use make their naming convention available as a way of showing appreciation and giving credit. Some more of my notes on the grace of the sklearn interface in being a good way to manage state and generative effects (see Brendan Fong, David I. Spivak; An Invitation to Applied Category Theory, Cambridge University Press, 2019) can be found here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

LondonR & R-Ladies to host ‘watch party’ at RStudio Conference, Jan 29th

Tue, 01/14/2020 - 11:54

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Mango Solutions, a full service certified partner of RStudio, are delighted to be supporting the rstudio::conf 2020 conference in San Francisco, on January 27th-30th.  This annual tech conference promises to be ‘all things R’ and includes an impressive agenda covering tutorials, presentations and lighting talks from recognised R experts and developers.

As long time champions of the R community, and organisers of the LondonR user group, Mango has been invited to host the London version of the rstudio::conf 2020 around the world ‘watch party’ which we are hosting in conjunction with R-Ladies London. This is the first year that watch parties are being run and will allow the London based R community the opportunity to see live-streams of the keynote sessions and interact with the San Francisco based hosts and guests in real time.

Liz Matthews, Head of Community & Education at Mango Solutions, said “this will be the first event of its kind and promises to be an exciting opportunity for the London R community. The watch party will be the next best thing to actually being in San Francisco and hearing about the latest developments in R.”

Celebrating a 10 year strategic relationship with RStudio, Liz said, “sponsoring the RStudio conference is a pivotal part of our relationship as an RStudio full service certified partner. Mango offers a full service offering for the RStudio suite of products, from the installation of an RStudio environment, best practice operating models, training and supporting the usage of R in production environments”.

Samantha Toet, Partner Marketing Specialist at RStudio Inc, said “I’m thrilled that Mango Solutions will be joining us both in person at the conference and remotely for our watch parties. This is a great way that we can support not only our partnership, but the data science community as a whole.”

Both the RStudio conference and Mango’s annual EARL conference provide the growing R community with an abundance of networking opportunities and learning experiences centred around R.

Having developed the R capability for some of the world’s largest organisations, Mango’s extensive knowledge of the R ecosystem in addition to ancillary technologies such as python and spark, makes us a well-placed and trusted partner for data science and for scaling an appropriate environment R using RStudio platforms. At the conference, Mango Solutions will be providing details of their new support packages to assist enterprise clients with accelerating in the use of RStudio products.

We look forward to meeting you in San Francisco and welcoming you to our live stream at LondonR in conjunction with R-Ladies.

The post LondonR & R-Ladies to host ‘watch party’ at RStudio Conference, Jan 29th appeared first on Mango Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Get Better: early career researcher development

Tue, 01/14/2020 - 11:52

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How can we contribute to the development of early career researchers in a lab environment?

I’m talking about how people in the lab acquire “soft skills” or “get better” in ways that are parallel to doing research. This sort of training can get overlooked in the chase for new results and the excitement of doing biomedical research.

I’m testing out a strategy to develop the skills of people in the lab. It’s an experiment. I’ll share what we’re trying out here in case others want to pursue a similar approach.

My idea was take a place in the lab meeting rota and use this for sessions aimed at developing the skills of the people in the lab. This is a way to force me to schedule something at a set time. The list of things we’ve covered is below. I will update the links as-and-when the posts are ready.

  • Creating a research profile [link]
  • CV clinic
  • Writing a paper
  • Learning LaTeX and Overleaf
  • Learning to code in R

The post title comes from Get Better by The New Fast Automatic Daffodils. The version I have is on a compilation of Martin Hannett produced tracks called “And Here Is The Young Man”

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Exploring the tightened EU CO2 emission standards for cars in 2020 – Why now selling an electric car can be worth an extra 18000€ for producers.

Tue, 01/14/2020 - 10:00

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

img {max-width: 100%}

In this blog post I want to explore the EU regulation for average CO2 emissions of manufacturer’s car fleets using the EU data set of newly registered cars. The data was already studied on an aggregate level in my earlier post. Here we explore, in particular, the monetary implications of the regulatory details for different car manufactures and car types. Commenting is possible at the bottom of the original post.

Let’s first have a look at the data set, which is available on a Github repository:

dat = fst::read_fst("cars2.fst") NROW(dat) ## [1] 260639 # Show 2 example rows dat[c(260566,100789),] ## year pool firm country cn ft q it co2 ## 260566 2018 vw group pc seat SK LEON diesel 5 110.6 ## 100789 2014 bmw group bmw SE I3 petrol-electric 132 13.0 ## mass cm3 power elec wltp reduction_nedc ## 260566 1390.8 1820 105 NaN NaN NaN ## 100789 1390.0 647 NaN 115 NA NA

The data contains one row for each combination of year, country where the car was registered, manufacturing firm, car name cn and fuel type ft. The variable q measures the number of registered such cars and co2 the CO2 emissions in gram per km using the NEDC procedure.

Before 2020 the EU had the target that on average newly registered cars should not emit more than 130 g CO2/km and for 2020 the target has been reduced to 95 g CO2/km with further reduction in subsequent years. Let us compare those targets with the average tested emissions of all newly registered cars:

dat %>% group_by(year) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + # CO2 target before 2020 geom_hline(yintercept = 130, linetype="dashed") + # CO2 target in 2020 geom_hline(yintercept = 95, linetype="dashed")

Unless there is a big jump, it does not look like the new target will be reached in 2020. Let us take a look at the fleet averages for the 11 biggest (with respect to the number of new cars sold in Europe) car firms.

library(dplyrExtras) library(forcats) dat = dat %>% mutate_rows(is.na(firm), firm="Other") %>% mutate( firm = fct_reorder(firm, q, sum, .desc=TRUE), firm11 = firm %>% fct_other(.,levels(.)[1:11]), ) dat %>% group_by(year, firm11) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + facet_wrap(~firm11) + geom_hline(yintercept = 130, linetype="dashed") + geom_hline(yintercept = 95, linetype="dashed")

We see quite a bit of diversity. Toyota has the lowest fleet averages and may well manage to reach the new 2020 target. Also the french car producers Renault, Peugeot and Citroen have low average emissions even though they increased their emissions from 2016 to 2018. The highest emissions have the 3 German premium car manufacturers BMW, Daimler and Audi.

If its fleet of newly registered cars produces too much CO2 on average the manufacturer has to pay fine to the EU. From 2019 onward the total fine is given by

Excess emissions * 95 €/(g CO2/km) * number of new passenger cars.

Let us for the moment ignore further adjustments specified in the regulation. Then the excess emissions would simply be the average emissions per car in the fleet minus the target (which is 95g CO2/km in 2020). If all cars would be produced by a single firm and the fleet stayed the same as in 2018 (with average CO2 emissions of 121 g CO2/km) the fine per car under the new emission target in 2020 would be given by:

(121-95) * 95€ = 26 * 95€ = 2470€

Quite a substantial amount. If the fleets are differentiated by firm, the fine would range from (102-95)*95€ = 665€ for an average Toyota to (133-95)*95€ = 3610€ for an average Daimler.

However, there are a series of further adjustments that affect the emission targets and fines. We will explore these adjustments step by step in this post. All details are taken from the official EU regulation here.

Manufacturer Pools

The EU allows groups of manufacturers to form pools such that every newly registered car of a pool will be part of a single fleet. If we have one manufacturer whose fleet is below the target and another one whose fleet is above, forming a joint pool will reduce the sum of the fines that have to be paid.

Firms belonging to the same mother company naturally belong to the same pool (e.g. Seat, Skoda, Audi, Volkswagen all belong to a common pool). But also unrelated firms can form a joint pool, e.g. Toyota and Mazda form a common pool in 2018. The rationale for allowing pools of unrelated firms is that otherwise the regulation could create incentives for mergers (e.g. Tesla would be an extremely attractive target) that increase market concentration and may harm consumers.

The following plot shows the development of average CO2 emissions of newly registered cars for the 8 biggest pools:

library(dplyrExtras) library(forcats) dat = dat %>% mutate_rows(is.na(pool), pool="Other") %>% mutate( pool = fct_reorder(pool, q, sum, .desc=TRUE), pool8 = pool %>% fct_other(.,levels(.)[1:8]), ) dat %>% group_by(year, pool8) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + facet_wrap(~pool8) + # CO2 target before 2020 geom_hline(yintercept = 130, linetype="dashed") + # CO2 target in 2020 geom_hline(yintercept = 95, linetype="dashed")

Qualitatively, the picture looks similar to the one of individual firms. The Japanese Toyota-Mazda pool leads, followed by the pools lead by French companies (psa-opel and renault), while the pools lead by German or Italian companies (vw, ford, bmw, fca, daimler) perform worse.

Even though climate protection plays an important role in EU policy, it would still seem quite surprising if the EU passed a regulation that imposes on average substantially higher fines for European car companies than Japanese ones without making some adjustments that make this gap smaller.

Weight Adjustment

One implemented adjustment is that heavier cars have a more generous emission target. In 2019 and 2020 the formula for a pool’s fleet target is given by

target_pool = 95 + a*(mass_pool-M0)

where mass is the average mass (in kg) of the cars in the pool and M0 is the average mass of all cars sold in previous years. The factor a is set from 2019 onward to 0.0333 and M0 is 1379.88 kg in 2019. This means every additional kg mass of a pool’s average car reduces the CO2 target by 0.0333 g CO2/km.

The following code shows the average fleet emission together with the adjusted targets of the largest pools. The parameters for a and M0 changed over years and the corresponding parameters are in regulation.csv which can also be found on my Github repository:

reg.df = read.csv("regulation.csv",) # By pool pool.dat = dat %>% group_by(pool8, year) %>% summarize( co2_pool = weighted.mean(co2, q,na.rm = TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE) ) %>% left_join(reg.df, by="year") %>% mutate( target_pool = target + a*(mass_pool-M0), target2020_pool = 95 + 0.033*(mass_pool-1379.88) ) ggplot(pool.dat,aes(x=year,y=co2_pool)) + facet_wrap(~pool8) + # The unadjusted target for 2020 of 95 g geom_hline(yintercept = 95, linetype="dashed") + # Target of the corresponding year geom_line(aes(x=year, y=target_pool), color="#5555aa") + # Target using formula for 2020 geom_line(aes(x=year, y=target2020_pool),color="#5555aa") + geom_line() + geom_point()

We see how the weight adjustment mainly benefits the BMW and Daimler pools. The following plot shows the average fee per car with and without weight adjustments (still ignoring any further adjustments):

df = pool.dat %>% ungroup() %>% filter(year==2018) %>% mutate( non_adjusted = (co2_pool - 95)*95, weight_adjusted = (co2_pool-target2020_pool)*95, pool8 = fct_reorder(pool8, weight_adjusted), benefit = weight_adjusted < non_adjusted ) %>% tidyr::pivot_longer( cols = c("non_adjusted","weight_adjusted"), names_to = "fee_type",values_to = "fee" ) ggplot(df,aes(pool8, fee, color=fee_type)) + geom_line(aes(group=pool8), color="black") + geom_point(size=2) + coord_flip()

Without weight adjustment Daimler would pay 2020 a fine of (133-95)*95€ = 3610€ per car (given the 2018 fleet) while with weight adjustment the fine reduces to (133-102)*95€ = 2945€.

I am not privy to the internal bargaining that precedes EU law making, but it is not hard to imagine how weight adjustment was a compromise in order to make Germany agree to a regulation with tight fleet standards on CO2 emission. The argument was probably along the line that large premium cars are quite different products than small cars and that there should be a way to account for that. In particular, since distinguishing products classes is often done. E.g. the EU also has different targets for Vans (light commercial vehicles) that have an average target of 147 g CO2/km in 2020.

One can have different views on the weight adjustment compromise, but it does not seem surprising at all to me given the importance of the car industry for many parts of Germany. (Beside such compromises and the factors leading to them, there are several other reasons that make me personally doubt whether very tight fleet standards are a good instrument for emission reduction compared to the alternative of higher CO2 prices for fuel. I will discuss some pro and cons later in this post.)

One obvious drawback of weight adjustment is that it reduces incentives to build lighter cars, which is an important channel to make cars more CO2 efficient. Let us take a look how in 2018 the CO2 emissions of a car relates to its mass (controlling for fuel type, engine power and year fixed effects):

library(lfe) library(stargazer) reg1 = felm(co2~mass+power|ft+year, data=dat) stargazer(reg1, type = "html") Dependent variable: co2 mass 0.057*** (0.0002) power 0.274*** (0.001) Observations 189,192 R2 0.762 Adjusted R2 0.762 Residual Std. Error 26.993 (df = 189171) Note: *p<0.1; **p<0.05; ***p<0.01

If we are willing to interpret the estimated coefficient in a causal fashion, it suggests than 1 kg extra mass increases emissions on average by 0.0574 g CO2/km. This means the weight adjustment factor of a=0.0333 does not fully cancel incentives to reduce CO2 emissions by building cars with lower mass but cancels more than half the effect.

Distribution of excess emissions over car models

The following plot provides more insights how the excess emissions above each pool’s targets are distributed over the different sold car types in 2018

dat2018 = dat %>% filter(year == 2018, ft != "unknown") %>% mutate( model = paste0(firm, " ", cn), fuel = fct_collapse(ft, petrol = "petrol", diesel = "diesel", plugin_hybrid = c("petrol-electric","diesel-electric"), electric = "electric", other = c("lpg","hydrogen","e85","ng","ng-biomethane") ) %>% fct_reorder(q, sum, na.rm=TRUE) ) %>% group_by(pool8) %>% mutate( q_pool = sum(q, na.rm=TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE), co2_pool = weighted.mean(co2, q, na.rm=TRUE), target_pool = first(95 + 0.0333*(mass_pool-1379.88)), excess_co2 = co2- target_pool, fee = (co2-target_pool - 0.0333*(mass-mass_pool))*95, share.in.pool = q / q_pool ) %>% ungroup() pool.df = dat2018 %>% group_by(pool8) %>% summarize(excess_pool = first(co2_pool)-first(target_pool)) ggplot(dat2018 %>% filter(co2 <= 300), aes(y=excess_co2,x=fct_rev(pool8))) + geom_jitter(aes(color=fuel, size=q),alpha=0.15) + geom_hline(aes(yintercept=0)) + geom_violin(aes(weight=share.in.pool), alpha=0.7) + geom_point(data=pool.df, aes(y=excess_pool), size=2) + coord_flip() + theme_bw()

Except for Toyota-Mazda no pool has a substantial share of petrol or diesel cars whose emissions are below the 2020 target of the pool. Volkswagen has a few cars (classified as other) driving with natural gas a little bit below the target.

Far below the target are the electric cars of all firms. That is because the EU regulation only counts emissions that are directly emitted by the car but ignores emissions of the power plants that generate the electricity used by the car. Thus electric cars (as well as the very few hydrogen cars) count as zero emission cars. For the same reason, plugin-hybrids that can drive with an electric engine or petrol also are solidly below the emission targets.

Implicit fee for each car and huge implicit subsidies for electric cars

As far as we have specified the regulation for 2020, selling an additional car with emissions of co2 and a given mass changes the total fees of a pool whose emissions are above target by

(co2-target_pool - a*(mass-mass_pool)) * 95 Euro

The following plot shows the averages of these fees over fuel types for the different pools. (We still ignore an important additional feature of the regulation: super credits.)

dat2018 = dat2018 %>% group_by(pool8) %>% mutate( fee = (co2-target_pool - 0.0333*(mass-mass_pool))*95 ) %>% ungroup() sum.dat = dat2018 %>% group_by(pool8, fuel) %>% summarize( fee = weighted.mean(fee, q, na.rm=TRUE), q = sum(q, na.rm=TRUE), share.fuel = q / first(q_pool), label = paste0(round(fee), "€ ", round(100*share.fuel,1),"%") ) %>% filter(fuel != "other") ggplot(sum.dat, aes(y=fee,x=fuel, fill=fuel, label=label)) + geom_col() + geom_text(size=4) + facet_wrap(~pool8) + coord_flip() + ylim(c(-13000,7000))

According to this calculation, selling an additional electric car would on average be worth between 8000€ and 10000€ for each manufacturer in 2020. These gains in terms of fee reduction from selling an electric car are huge for all pools.

(The differences in gains from selling an electric car between pools are due to different masses of electric cars and other cars in the pools’ fleets. For example, we find the lowest gain from selling an electric car for Daimler because most of its electric cars are the light weight Smarts whose mass is substantially below the average mass of a Daimler.)

Super Credits

Article 5a of the EU regulation makes the gains from selling electric cars even larger:

Article 5a Super-credits for 95 g CO2/km target In calculating the average specific emissions of CO2, each new passenger car with specific emissions of CO2 of less than 50 g CO2/km shall be counted as: — 2 passenger cars in 2020, — 1,67 passenger cars in 2021, — 1,33 passenger cars in 2022, — 1 passenger car from 2023, for the year in which it is registered in the period from 2020 to 2022, subject to a cap of 7,5 g CO2/km over that period for each manufacturer.

I will discuss the cap of 7,5g CO2/km in the last sentence of the regulation at the end of this blog in the Appendix. Assume for the moment, it is not binding. Then a car with emissions below 50g CO2/km saves a pool in 2020 via super credits additional:

(target_pool-co2) * 95 Euro

Let us look at the fee change of selling a particular car type in 2020 assuming a factor 2 super credit (and non-binding cap):

dat2018 = dat2018 %>% group_by(pool8) %>% mutate( fee_sc = fee - ifelse(co2<50, target_pool-co2,0)*95 ) %>% ungroup() sum.dat = dat2018 %>% group_by(pool8, fuel) %>% summarize( fee_sc = weighted.mean(fee_sc, q, na.rm=TRUE), q = sum(q, na.rm=TRUE), share.fuel = q / first(q_pool), label = paste0(round(fee_sc), "€ ", round(100*share.fuel,1),"%") ) %>% filter(fuel != "other") ggplot(sum.dat, aes(y=fee_sc,x=fuel, fill=fuel, label=label)) + geom_col() + geom_text(size=4) + facet_wrap(~pool8)+ coord_flip() + ylim(c(-22000,7000))

These are gigantic implicit subsidies for electric cars in 2020!

To put this in relation, we can compute that, with full super credits, an VW E-Golf would reduce Volkswagen’s fees by 18906€:

library(stringtools) dat2018 %>% filter(firm=="volkswagen", ft=="electric") %>% select(cn,fuel,fee, fee_sc,q) %>% head(1) ## # A tibble: 1 x 5 ## cn fuel fee fee_sc q ## ## 1 GOLF electric -9769. -18906. 1836

A quick Google search finds historical list price of the E-Golf of 35900€ in 2018 and 31000€ in 2019. In both years the old CO2 target of 130 grams was still in place which VW overachieved with a secure margin. This means VW did not pay CO2 fees for its fleet in 2018 and 2019. Thus selling an additional E-Golf did not reduce any CO2 fleet fees and thus should not have affected those historical prices.

If VW does not reach the new target in 2020 (which seems likely), it now has a tremendously higher incentives to sell electric cars than in the previous years: the fee reduction of 18906€ is more than half of the original price of an E-Golf and even though super credits phase out in later years, the fee reduction of 9769€ without super credits is still almost a third of the 2019 price.

Given these numbers it does not seem surprising that Volkswagen made massive investments into electric cars, even though some observers consider it a risky gamble (see e.g. here or here)

Thoughts on pricing

An interesting question is by how much electric car prices will fall in 2020. Probably by considerably less than producer’s marginal benefit from fee reduction. For several reasons:

  1. Basic micro economics suggests that firms with some market power will typically not fully pass on marginal cost changes to consumers.

  2. Limited production capacity: it makes no sense to increase demand by price reductions to an extend that cannot be produced.

  3. Consumers who bought electric cars in 2019 may feel cheated if car prices drop too drastically this year.

  4. The marginal fee reduction effect of an electric car gets smaller in future years as super credits phase out and the emission target further tightens. It is thus risky to induce in consumers expectations of very low electric car prices.

  5. The fee reduction effect only occurs for a car registered in the EU. But setting substantially lower prices in the EU than in other markets can be risky, e.g. because of arbitrage opportunities.

Points 3-5 suggest that car companies who are not yet capacity constrained may make attractive leasing offers of electric cars to large business clients, or to employees. If the manufacturers own car sharing or car rental companies, it is probably also quite attractive to increase the number of electric cars in those fleets in 2020.

I wonder if e.g. for a firm like ShareNow that has fleets in the EU and North America it could now become attractive to swap the US and EU electric cars after one year to get more new BMW or Daimler electric car registrations in the EU. It would of course not be desirable from a climate point of view to ship cars over the ocean for regulatory arbitrage. The monetary incentives might be there, but probably the net gains are too small compared to the risk of image damage.

With whom does Tesla go swimming?

Tesla was not part of any pool in 2018. Yet, if it joins a pool who misses the 2020 target and Tesla sells like in 2018 roughly 20000 cars (all electric), the other pool members would even without super credits save roughly between 8000*20000=160 Mio € and 10000*20000=200 Mio €, which is more than Tesla’s 2018 profits (but less than 1% of its revenues). With super credits, savings for the pool could increase up to twice as much.

That is a nice time for a quiz:

Loading…

You can find the answer, for example, here.

If you think about it, it makes a lot of sense that Tesla joint a pool with FCA (Fiat and Chrysler). First, FCA had the highest excess CO2 emissions in 2018, i.e. FCA seems very likely to miss the 2020 and later targets. Second, FCA has not yet built electric cars for themselves, i.e. it is not a fierce competitor of Tesla.

It is interesting how much in total Tesla will benefit from the tighter new EU regulations. On the one hand it is nice to get money from FCA and the higher attractivity of electric cars may generally be good for Tesla. On the other hand, Tesla most likely will face much fiercer competition given that all pools have strong incentives to sell more electric cars and may thus strongly reduce prices. It probably depends a lot on the maximum production capacities for electric cars of Tesla’s competitors.

What are the actual CO2 emissions of electric cars?

Of course, assuming in the regulation that electric cars have zero CO2 emissions does not fit reality since much electricity is still generated by burning coal and gas. The column elec in our data set contains for electric cars and plug-in hybrids the estimated electricity usage in Wh per driven km.

# Average Wh/km over all electric cars dat %>% filter(ft == "electric") %>% summarize(Wh = weighted.mean(elec,q, na.rm=TRUE)) ## Wh ## 1 156.4098 # Average Wh/km over years dat %>% filter(ft == "electric") %>% group_by(year) %>% summarize(Wh = weighted.mean(elec,q, na.rm=TRUE)) ## # A tibble: 7 x 2 ## year Wh ## ## 1 2012 148. ## 2 2013 153. ## 3 2014 152. ## 4 2015 160. ## 5 2016 156. ## 6 2017 147. ## 7 2018 163.

There is no clear time trend, so we just use the overall average of 156 Wh/km for our further calculations. We now use a data set on yearly CO2 emission intensities from electricity generation of different EU countries (until 2016) provided by the EU here to estimate the corresponding g CO2 emissions of driving an electric car for 1 km.

co2_int = read.csv("co2-intensity-electricity-generation.csv") co2_int = co2_int %>% filter(year >= 2010) %>% mutate( co2_Wh = g_co2_kWH / 1000, co2_el_car = 156*co2_Wh, country = forcats::fct_reorder(country, co2_Wh,mean) ) co2_int %>% filter(country=="EU") ## year country type g_co2_kWH co2_Wh co2_el_car ## 1 2010 EU Electricity generation 346.9 0.3469 54.1164 ## 2 2011 EU Electricity generation 352.7 0.3527 55.0212 ## 3 2012 EU Electricity generation 353.2 0.3532 55.0992 ## 4 2013 EU Electricity generation 333.7 0.3337 52.0572 ## 5 2014 EU Electricity generation 320.4 0.3204 49.9824 ## 6 2015 EU Electricity generation 314.4 0.3144 49.0464 ## 7 2016 EU Electricity generation 295.8 0.2958 46.1448 # Show averages for EU ggplot(co2_int %>% filter(country=="EU"), aes(x=year, y=co2_el_car)) + geom_line() + geom_point() + geom_hline(yintercept = 130, linetype="dashed")+ geom_hline(yintercept = 95, linetype="dashed")

We see that the average newly registered electric car has emissions around 46 g CO2/km given the average CO2 intensity of EU electricity generation in 2016. While this is far above 0, it is only about 37% of the average CO2 emission of petrol and diesel cars between 2012 and 2018 (123 g CO2/km) and about half the new 2020 target of 95g CO2/km.

Compare Golf and E-Golf

Of course, the average electric car may differ in size and many other dimensions from the average petrol or diesel car in our sample. For a better comparison let us compare Volkswagen Golf from 2018 with different fuel types:

df = dat2018 %>% filter(firm=="volkswagen", has.substr(cn,"GOLF")) %>% group_by(fuel) %>% summarize( co2_el = weighted.mean(elec*0.2958, q, na.rm=TRUE), co2 = weighted.mean(co2, q, na.rm=TRUE), mass = weighted.mean(mass, q, na.rm=TRUE), q = sum(q, na.rm=TRUE) ) df ## # A tibble: 5 x 5 ## fuel co2_el co2 mass q ## ## 1 electric 42.3 0 1615. 13886 ## 2 plugin_hybrid 35.7 38.9 1615. 7876 ## 3 other NaN 118. 1327. 12065 ## 4 diesel NaN 110. 1378. 140701 ## 5 petrol NaN 120. 1323. 309126

The results are similar to the aggregate averages. The CO2 emissions of an average 2018 E-Golf with average 2016 EU electricity mix are 38% of a diesel and 35% of a petrol Golf. (The data also illustrates one problem of electric cars: they are much heavier than diesel or petrol cars. Furthermore, the average E-Golf has a quite short range. Longer range means more batteries and thus an even heavier car, which needs more electricity to move.)

The following plot shows how counting electricity as zero emissions and super credits (at the 2020 factor of 2 and no binding cap) change the CO2 emissions of the 2018 fleets of the 8 largest pools compared to using CO2 emission of the average EU electricity mix without super credits. We now already add Tesla to the FCA pool, since they form a joint pool in 2020.

dat2018 = dat2018 %>% mutate_rows(firm == "tesla", pool8 = "fca-tesla") %>% mutate_rows(startsWith(as.character(pool), "fca"), pool8 = "fca-tesla") df = dat2018 %>% filter(pool8 != "Other") %>% mutate( co2_el = replace_na(elec*0.2958,0), co2_fuel = replace_na(co2, 0), co2_tot = co2_el + co2_fuel, q_sc = ifelse(co2_fuel < 50, 2*q, q) ) %>% group_by(pool8) %>% summarize( zero_co2 = weighted.mean(co2_fuel, q, na.rm=TRUE), zero_co2_and_sc = weighted.mean(co2_fuel, q_sc, na.rm=TRUE), eu_co2 = weighted.mean(co2_tot, q, na.rm=TRUE) ) %>% mutate( pool8 = fct_reorder(pool8, eu_co2) ) %>% tidyr::pivot_longer( cols = c("zero_co2","zero_co2_and_sc", "eu_co2"), names_to = "electricity", values_to = "co2" ) ggplot(df,aes(pool8, co2, color=electricity)) + geom_line(aes(group=pool8), color="black") + geom_point(size=3) + coord_flip()

If the pools kept their 2018 fleet in 2020, BMW would benefit most from counting electric cars as zero emission cars, followed by Renault, FCA-Tesla and Daimler.

CO2 intensity of electric cars between countries

The CO2 intensity of electricity production differs quite substantially between countries. The following plot shows the CO2 emissions in g/km of the average sold EU electric car for the different EU countries:

ggplot(co2_int, aes(x=year, y=co2_el_car)) + geom_line() + geom_point() + facet_wrap(~country) + geom_hline(yintercept = 130, linetype="dashed") + geom_hline(yintercept = 95, linetype="dashed")

There is substantial heterogeneity between countries. While an electric car loaded with the average electricity mix in Sweden, France or Austria would have very low emissions, in Poland it would have much higher emissions than the average petrol or diesel car.

Given the EU’s focus on a unified market, it is understandable that the regulation does not differentiate where in the EU a new car is registered. However, it is still hard to see any environmental benefit of electric cars driven in Poland that would justify the enormous implicit subsidies that the current fleet regulation implies.

One also should keep in mind that while electricity production in France and Sweden emits very little carbon, it creates nuclear waste. One can be of different opinion about the relative dangers of climate change and nuclear power plants, though.

Discussion: Fleet CO2 targets vs pricing actual CO2 emissions

Are tight CO2 fleet standards a good instrument to reduce CO2 emissions?
Basic economic analysis rather points to a uniform price of every tonne actual emitted CO2 (and CO2 equivalents of other GHG) as an optimal instrument to fight climate change in a cost efficient way. This price could be imposed by a tax or via a comprehensive CO2 trading scheme that would augment the current EU certificate trading by certificates on fossil fuel consumption for transport and household heating.

Compared to such higher CO2 prices, fleet targets have several disadvantages (see also here or here for more detailed discussions):

  • CO2 Fleet targets don’t take into account how many km a car is driven. It provides thus no incentives to drive less. In contrast, higher CO2 prices on fuels provide incentives for both buying more fuel efficient cars and for driving less.

  • Fleet targets only influence the price of newly sold cars. If new car prices increase substantially by tighter standards, incentives to buy new cars go down. The average car fleet gets older, which reduces fuel efficiency. In contrast, higher CO2 prices on fuel instead provide incentives to buy new, more fuel efficient cars.

  • We have seen how the CO2 emissions of electric cars depend strongly on where they are driven. This is not accounted for in the EU fleet standards. In contrast, with a common CO2 price for car and power plant fuels, it would be automatically be relatively more attractive to drive electric cars in countries that produce electricity with little CO2 emissions than in countries that mainly have coal power plants.

  • The actual fleet target regulation has compromises like weight adjustment that reduce the incentives to build cars more CO2 efficient.

There are also some possible advantages of fleet targets compared to direct CO2 pricing.

  • Consumers could be myopic and in their buying decision not account sufficiently the fuel cost savings from more efficient cars. They may react stronger to a direct increase of a car price induced by tighter fuel standards. But e.g. Busse et. al. 2013 find little evidence for such myopia.

  • Fleet standards that are not too tight and can be satisfied if manufacturers make reasonable effort to reduce emissions can provide incentives for car manufacturers to make cars more efficient without actually imposing a tax. This leads to a lower tax burden for car users than higher CO2 prices. This may make fleet standards politically easier to implement and may be considered fairer (in particular by car users).

  • Even tight fleet standards, which actually imply taxes on new cars, may be politically easier to implement than comprehensive, higher CO2 prices. First, fleet standards are not called a tax, which probably gives less scope for political attacks. Second, one could possibly argue that poorer people who cannot afford new or big cars will not be hurt as much (of course weight adjustment partly countervails this argument). On the other hand, one also could support poorer households by redistributing earnings from higher CO2 prices. Third, we actually see that quite tight fleet standards have been implemented in the EU while there is still no comprehensive EU-wide CO2 certificate trading in sight that also encompasses emissions from car fuels and heating systems. This suggests that it indeed was politically easier to implement the tight fleet regulation, even though it seems less sensible than higher CO2 prices from an economic point of view.

Special Adjustment in 2020 only: Ignore 5% worst cars

There are some additional rules in the EU regulation, which I want to describe in the last 3 sections of this (quite long) blog post.

In 2020 (but not anymore from 2021 onward) the EU fleet regulations allows pools to ignore the 5% cars with the highest CO2 emission when computing their fleet average. The following plot shows how this exemption reduces the average fees of the 8 largest pools. We also add the further reductions from the super credits for comparison.

dat2018 = dat2018 %>% group_by(pool8) %>% arrange(co2) %>% mutate( Q_pool = sum(q), q_cumsum = cumsum(q), perc_start = (q_cumsum-q) / Q_pool, perc_end = q_cumsum / Q_pool, q_95 = ifelse(perc_end <= 0.95,q,0) ) %>% mutate_rows(perc_start < 0.95 & perc_end > 0.95, q_95 = q * (0.95 - perc_start)/(perc_end - perc_start) ) %>% ungroup() df = dat2018 %>% filter(pool8 != "Other") %>% mutate( q_95_sc = ifelse(co2 < 50, 2*q_95, q_95) ) %>% group_by(pool8) %>% summarize( co2_95 = weighted.mean(co2, q_95, na.rm=TRUE), co2_95_sc = weighted.mean(co2, q_95_sc, na.rm=TRUE), co2_100 = weighted.mean(co2, q, na.rm=TRUE), excess_95 = co2_95-first(target_pool), excess_95_sc = co2_95_sc-first(target_pool), excess_100 = co2_100-first(target_pool) ) %>% mutate( pool8 = fct_reorder(pool8, co2_100) ) %>% tidyr::pivot_longer( cols = c("excess_95","excess_95_sc", "excess_100"), names_to = "scheme", values_to = "excess" ) %>% mutate(fee = excess*95) ggplot(df,aes(pool8, fee, color=scheme)) + geom_line(aes(group=pool8), color="black") + geom_point(size=3) + coord_flip()

While the omission of the 5% worst cars helps to reduce fees, still all car companies would pay (substantial) positive fees with their 2018 fleet in 2020.

Small manufacturers

The following code shows the 5 cars in our data set from 2018 with the highest CO2 emissions:

# 5 entries with highest fees dat2018 %>% arrange(-co2) %>% slice(1:5) %>% select(pool, firm,country, cn, mass, co2, excess_co2, fee) ## # A tibble: 5 x 8 ## pool firm country cn mass co2 excess_co2 fee ## ## 1 Other Other IT PHANTOM VIII 2635 548 451. 39064. ## 2 vw group pc bugatti AT BUGATTI CHIRON 2068 516 420. 37818. ## 3 vw group pc bugatti BG CHIRON 2070 516 420. 37812. ## 4 vw group pc bugatti DE "" 2070 516 420. 37812. ## 5 vw group pc bugatti DE BUGATTI CHIRON 2070 516 420. 37812.

The Phantom VIII on top is actually produced by Rolls-Royce and if it would not fall under the largest 5% or otherwise exempted would increase the CO2 fee by 39064 Euro in 2020. The Bugattis on rank 2-5 would correspond to similar high fees. Will these fees materialize from 2021 onward when the 5% rule expires? That is not clear.

Article 11 of the EU regulation allows manufacturers that sell less than 10000 new cars per year in the EU to apply for a derogation. This essentially means that they are exempt from the actual regulation but have to meet individual reduction targets based on historic CO2 efficiency.

Also Rolls-Royce and Bugatti sell far less than 10000 cars a year in the EU. However, they are subsidiaries of BMW and VW, respectively, and the EU regulation allows for such small subsidiaries a derogation only if they operate their own production facilities and design centers. I could not find data for which manufacturers a derogation was granted and at which conditions.

Let us look at a list of derogation candidates with high CO2 emission (above the pre 2020 target of 130 g CO2/km):

dat = dat %>% group_by(firm, year) %>% mutate( firm.q = sum(q, na.rm=TRUE), firm.co2 = weighted.mean(co2,q, na.rm=TRUE), dero.cand = firm.q < 10000 & firm.co2 > 130 | tolower(firm) == "other" ) dero.cand = dat %>% group_by(pool,firm, year) %>% filter(dero.cand) %>% summarize( co2 = weighted.mean(co2, q, na.rm=TRUE), q = sum(q, na.rm=TRUE) ) %>% group_by(pool,firm) %>% summarize( co2 = weighted.mean(co2, q, na.rm=TRUE), q = round(mean(q, na.rm=TRUE)), ) %>% arrange(-q) head(dero.cand, 10) ## # A tibble: 10 x 4 ## # Groups: pool [8] ## pool firm co2 q ## ## 1 ssangyong ssangyong 182. 5932 ## 2 Other Other 148. 5161 ## 3 fca italy spa maserati 211. 5098 ## 4 vw group pc quattro 227. 5069 ## 5 subaru subaru 166. 3501 ## 6 fca italy spa jeep 146. 3309 ## 7 mg mg 137. 3124 ## 8 vw group pc bentley 291. 2587 ## 9 ferrari ferrari 299. 2413 ## 10 renault vaz 193. 2023 # Mean number of maximal derogated CARS per year sum(dero.cand$q) ## [1] 53671 # Percentage of average cars per year round(100*sum(dero.cand$q) / (sum(dat$q, na.rm=TRUE)/7),2) ## [1] 0.39

So on average no more than 0.39 percent of newly registered cars should fall under a derogation. The following plot compares the average CO2 emissions if we remove all derogation candidates from the largest pools to the case that we don’t.

pool.dero.dat = dat %>% filter(!dero.cand) %>% group_by(pool8, year) %>% summarize( co2_pool = weighted.mean(co2, q,na.rm = TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE) ) %>% left_join(reg.df, by="year") %>% mutate( target_pool = target + a*(mass_pool-M0), target2020_pool = 95 + 0.033*(mass_pool-1379.88) ) ggplot(pool.dat,aes(x=year,y=co2_pool)) + facet_wrap(~pool8) + geom_line() + geom_point() + geom_line(data=pool.dero.dat, color="blue")

It makes almost no visual difference (except a bit for FCA) whether these small subsidiaries that are derogation candidates are removed from the pools or not. Even removing these cars from the category Other does only matter marginally.

Eco innovations

Car producers may make some Eco innovations that reduce CO2 emissions in a fashion that is not measured by the standardized test cycles. The following code shows for the 8 largest pool the CO2 reductions in g/km through such Eco innovations:

dat2018 %>% mutate_rows(!is.finite(reduction_nedc), reduction_nedc=0) %>% group_by(pool8) %>% summarize(eco_innovation = weighted.mean(reduction_nedc,q)) ## # A tibble: 9 x 2 ## pool8 eco_innovation ## ## 1 vw group pc 0.00922 ## 2 psa-opel 0.00171 ## 3 renault 0.00000788 ## 4 ford-werke gmbh 0.00396 ## 5 bmw group 0.356 ## 6 daimler ag 0.493 ## 7 toyota-mazda 0.0134 ## 8 Other 0.00683 ## 9 fca-tesla 0.00118

Even for the pool with largest Eco innovations (Daimler) the CO2 reductions are only a little bit less than 0.5 g CO2/km for the average and thus so far almost negligible.

Appendix: Who understands the details of the cap on super credits?

Article 5a on super credits states that there is a cap of max. 7.5 g CO2/km by super credits for the total period from 2020-2022. I am not sure how that is exactly implemented. If Article 5a refereed to pools, it probably would mean that over the three years together a pool’s average CO2 emissions with super credits can at most be 7.5 g CO2/km lower than without super credits. But what if the pool composition changes during those 3 years? Also Article 5a refers to a manufacturer not a pool and Article 7 about pools only says that pools are relevant for the obligations in Article 4 but says nothing about Article 5a.

Assume Article 5a refers only to manufacturers, not to pools. Let us take a look at the reduction of the average CO2 emissions due to super credits on a firm level in 2020 using the 2018 fleet:

dat2018 %>% mutate( q_sc = ifelse(co2<50, q*2, q) ) %>% group_by(firm) %>% summarize( co2_firm = weighted.mean(co2, q, na.rm=TRUE), co2_sc_firm = weighted.mean(co2, q_sc, na.rm=TRUE), reductions_sc = co2_firm - co2_sc_firm ) %>% filter(firm %in% c("tesla", levels(dat2018$firm11))) %>% arrange(-reductions_sc) ## # A tibble: 13 x 4 ## firm co2_firm co2_sc_firm reductions_sc ## ## 1 bmw 128. 124. 3.74 ## 2 renault 111. 108. 3.05 ## 3 daimler 134. 132. 2.05 ## 4 volkswagen 120. 118. 1.93 ## 5 Other 146. 145. 1.51 ## 6 audi 129. 128. 0.769 ## 7 citroen 108. 108. 0.286 ## 8 toyota 103. 102. 0.280 ## 9 peugeot 108. 108. 0.205 ## 10 opel 125. 125. 0.177 ## 11 ford 123. 123. 0.00226 ## 12 fiat 120. 120. 0 ## 13 tesla 0 0 0

Already for the 2018 fleet the super credit reductions for BMW would be 3.74 g CO2/km in 2020. Given that there are strong incentives to sell more electric cars probably the cap will be binding for BMW and possible some other manufacturers.

But I am still not sure how the cap will be computed. What happens if Tesla joins some pool in 2020 and some other in 2021? For Tesla itself super credits don’t reduce the CO2 emissions as they are always 0. Yet inside a pool, Tesla’s super credits may reduce CO2 emissions substantially. How will the cap be computed? Maybe it is clear for lawyers (and probably Tesla employs clever people who have figured it out) but as an economist, I really think that some example calculation would make this regulation easier to understand.

#gh-comments-list { border-bottom: 1px solid #cccccc; margin-bottom: 5px; }

.gh-comment-header { margin-top: 3px; padding-top: 4px; background-color: #f1f8ff; border-bottom-color: #c0d3eb; padding-right: 16px; padding-left: 16px; border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom: 1px solid #d1d5da; line-height: 24px; } .gh-comment-header a { margin-left: 3px; margin-right: 3px; }

.gh-comment-body { padding-top: 2px; padding-right: 16px; padding-left: 16px; } COMMENTS

Comment via Github

var url = "https://github.com/skranz/skranz.github.com/issues/" + 2 var api_url = "https://api.github.com/repos/skranz/skranz.github.com/issues/" + 2 + "/comments"

$(document).ready(function () { $.ajax(api_url, { headers: {Accept: "application/vnd.github.v3.html+json"}, dataType: "json", success: function(comments) { //$("#gh-comments-list").append("Visit the Github Issue to comment on this post"); $.each(comments, function(i, comment) {

var date = new Date(comment.created_at);

var t = "

"; t += "

" t += ""; t += "" + comment.user.login + ""; t += " posted at "; t += "" + date.toUTCString() + ""; t += "

" t += "

"; t += comment.body_html; t += "

" t += "

"; $("#gh-comments-list").append(t); }); }, error: function() { $("#gh-comments-list").append("Comments are not open for this post yet."); } }); });

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Business Case Analysis with R (Guest Post)

Tue, 01/14/2020 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Learning Machines proudly presents a fascinating guest post by decision and risk analyst Robert D. Brown III with a great application of R in the business and especially startup-arena! I encourage you to visit his blog too: Thales’ Press. Have fun!

Introduction

While Excel remains the tool of choice among analysts for business case analysis, presumably because of its low barrier to entry and ease of use, it nonetheless continues to be the source of many vexing and disastrous errors. As the research from Panko and others have revealed, the source of these errors arise in part for the following reasons:

  • Analysts continue in a persistent lack of QA standards and practices to ensure that spreadsheets are as error free as possible.
  • Cell referencing and replicating formulas across relevant dimensions propagate errors at exponential rates.
  • Effective auditing gets obscured because formulas’ conceptual meanings are difficult to interpret from formulas constructed from grid coordinates.

To suggest a means to remedy that situation, in the spring of 2018, I published Business Case Analysis with R: Simulation Tutorials to Support Complex Business Decisions (BCAwR). I also wanted to showcase how the popular programming language R could be employed to evaluate business case opportunities that are fraught with uncertainty nor supported by an abundance of well structured, pedigreed, and relevant data, a situation that plagues just about all business decisions faced in the real world. BCAwR takes the reader through the process of analyzing a rather complex commercial investment problem of bringing a chemical processing plant on line. For my post here, I will present a simplified model to give a preview of the contents of the book. I will also provide a little surprise at the end that I hope will contribute to bridging the efforts of the decision and data science communities.

But before we delve into any coding or quantitative activity related to analysis of any business case, we ought to thoroughly frame our problem by narrowing the scope of our inquiry to the appropriate level for the context of the problem at hand, identifying the goals and objectives we want to achieve when we eventually commit to action, partitioning the set of actionable decision we can make to achieve our goals, and creating an inventory of the influences and relationships that connect decisions and uncertainties to the measures of success. Although we will look at a much more simplified business problem here, a little framing will still serve us well.

Imagine that we are considering investing in a project that solves a particular market problem with the ultimate desire to generate a financial return to our organization. We identify three possible strategic pathways to compare for the investment, each with a different level of capital commitment required to address certain nuances of the market problem. Each (1) investment, in turn, will generate (2) an initial post investment net cash flow, (3) a peak cash flow, and varying durations of (4) time to reach the peak from the initial cash flow.

To complicate the matter, for each strategy, we do not know the precise quantitative value of the four parameters that characterize the potential cash flow. For this reason, when we evaluate the three strategies, we will employ a Monte Carlo simulation approach that samples a large number of potential realizations for each parameter from probability distributions. With these ideas in mind, we can now set up the coding to handle the simulation problem.

library(leonRdo) # https://www.incitedecisiontech.com/packages/packages.html library(tidyverse) # (on CRAN) # Set the simulation seed. set.seed(42) # The number of simulation runs per uncertain variable. trials <- 1000 # The time frame across which our cash flow simulation will run. # The initial investment will occur in year 0. year <- 0:15 # The weighted average cost of capital without any additional risk premium. discount.rate <- 0.1 # The distinctive capital allocation strategic decisions. strategy <- c("Strategy 1", "Strategy 2", "Strategy 3")

R employs simple Monte Carlo techniques, which doesn’t typically yield stable means for result distributions until many thousands of trials are used. However, When developing a model like this, I suggest setting the trials to 50-100 to make sure that the logic is running correctly in short responsive time frames. Then, when you’re ready to produce real results, change the values to greater than 10,000. Given that Monte Carlo in R is noisy even for reasonably large samples (~1000) and for different seed settings, I recently developed the leonRdo package (R Packages: leonRdo & inteRest) to provide a median Latin hypercube approach to sampling that produces much more stable means with approximately 1/10th the number of trials regardless of the seed value used.

In this particular showcase discussion, I have included the tidyverse package (on CRAN) to help with data manipulation and graphing. In the original BCAwR, I based all the code on base R so that readers new to R and simulation could focus more on learning the key principles over the idiosyncratic syntax of secondary packages.

The set.seed() function ensures that we will produce the same sample trial set on each run of our model. This will help us debug problems as we develop code. Eventually as others might interact with our code, they will be able to observe the same set of results we work with. Later, we can reset the seed to other values to make sure that our inferences on a given set of trials remain consistent on a different trial set.

Model Functions

Next, we need to declare some important functions that we will call in our model. The first is a function that provides an analytic solution to the differential equation that relates the proportional absorption of an entity into a fixed environment. This ramp up function takes as parameters the amount of time Tp required to go from, say, Y = 1% absorption to Y = 99% across Time. We will use this function to model the growth of our cash flow over the year index.

# Ramp up function. calcRampUp <- function(Y0, Time, Tp) { # Y0 = initial absorption # Time = time index # tp = time to Y = 1 - Y0, the time to peak. Y <- 1 / (1 + (Y0 / (1 - Y0)) ^ (2 * (Time - 1) / (Tp - 1) - 1)) return(Y) }

The second function we need is the actual business model we want to represent. The business model logic will remain fixed across strategies, and the results will vary only on the basis of the strategy conditional parameters we supply. By “functionalizing” our model, we can iterate the entire model over variable ranges to understand the sensitivity of the output function to those variables, as we shall later see. Please note, this is a toy model we are using for illustration purposes only. BCAwR demonstrates a much more complex consideration for the implementation of a chemical processing plant.

# Business model cash flow. calcCashFlow <- function(I, SCF, PCF, YP, Y) { # I = initial investment # SCF = starting cash flow # PCF = peak cash flow # YP = year of peak cash flow # Y = year index cf <- (Y == 0) * (-I) + (Y > 0) * (SCF + (PCF - SCF) * calcRampUp(0.01, Y, YP)) return(cf) }

The last function we will employ is the net present value (NPV) function that we will apply to the individual trial results of the cash flow simulation across strategies.

# The net present value of the cash flow function. calcNPV <- function(CF, Y, DR) { # CF = cash flow vector # Y = year index # DR = discount rate npv <- sum(CF / (1 + DR) ^ Y) return(npv) } Initialize Simulation Trial Samples

Our next task is to create trial samples for the business model cash flow function parameters for the three project strategy investments. For this particular example, I have chosen to use canonical distributions as if their parameters were based on empirical or historical data and to utilize a simple method for generating trials. However, typically I would use a combination of both empirical data and subject matter expert guidance reflected as cumulative probabilities across the range of the assessed variables’ potential values. I explain how to take this latter approach in greater detail in Section III of BCAwR.

investment1 <- rlnorm_mlhs(n = trials, meanlog = log(800), sdlog = log(1 + 200 / 800)) investment2 <- rlnorm_mlhs(n = trials, meanlog = log(700), sdlog = log(1 + 50 / 700)) investment3 <- rlnorm_mlhs(n = trials, meanlog = log(1000), sdlog = log(1 + 150 / 1000)) start.cash.flow1 <- rnorm_mlhs(n = trials, mean = -100, sd = 20) start.cash.flow2 <- rnorm_mlhs(n = trials, mean = -90, sd = 5) start.cash.flow3 <- rnorm_mlhs(n = trials, mean = -120, sd = 15) peak.cash.flow1 <- rnorm_mlhs(n = trials, mean = 300, sd = 20) peak.cash.flow2 <- rnorm_mlhs(n = trials, mean = 280, sd = 5) peak.cash.flow3 <- rnorm_mlhs(n = trials, mean = 375, sd = 15) yr.peak1 <- rnorm_mlhs(n = trials, mean = 8.5, sd = 0.75) yr.peak2 <- rnorm_mlhs(n = trials, mean = 10, sd = 0.85) yr.peak3 <- rnorm_mlhs(n = trials, mean = 11, sd = 1) # Store the business model parameter samples in a list of data frames. proj.data <- list( investment = data.frame(investment1, investment2, investment3), start.cash.flow = data.frame(start.cash.flow1, start.cash.flow2, start.cash.flow3), peak.cash.flow = data.frame(peak.cash.flow1, peak.cash.flow2, peak.cash.flow3), yr.peak = data.frame(yr.peak1, yr.peak2, yr.peak3) ) Model Results

For each strategy, we apply the samples from each parameter to the business model cash flow function. This will result in a list of cash flows for the three project strategies. For each strategy, there will be as many cash flows as defined by trials, and each cash flow trial will be as long as the year index. We can use the lapply() and sapply() functions to avoid using for loops [see also Learning R: A gentle introduction to higher-order functions].

proj.cf <- lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcCashFlow( I = proj.data$investment[[s]][t], SCF = proj.data$start.cash.flow[[s]][t], PCF = proj.data$peak.cash.flow[[s]][t], YP = proj.data$yr.peak[[s]][t], Y = year ) }) }) names(proj.cf) <- strategy

By running the first five trials of Strategy 1, we can see what the cash flows look like.

head(round(proj.cf[[1]][, 1:5], digits = 1), n=length(year)) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -852.7 -852.2 -766.7 -974.3 -912.6 ## [2,] -85.8 -103.1 -117.2 -116.1 -88.1 ## [3,] -77.8 -93.6 -107.3 -106.9 -79.9 ## [4,] -55.3 -64.3 -76.7 -79.9 -57.1 ## [5,] 0.0 9.4 0.2 -13.1 -1.6 ## [6,] 98.0 128.7 123.7 102.2 97.9 ## [7,] 201.5 230.2 227.2 215.6 206.8 ## [8,] 265.3 279.2 276.2 279.8 276.9 ## [9,] 292.5 296.5 293.3 305.3 308.0 ## [10,] 302.3 301.8 298.5 314.1 319.5 ## [11,] 305.6 303.4 300.1 316.9 323.5 ## [12,] 306.7 303.9 300.5 317.8 324.8 ## [13,] 307.1 304.1 300.7 318.1 325.2 ## [14,] 307.2 304.1 300.7 318.2 325.4 ## [15,] 307.2 304.1 300.7 318.2 325.4 ## [16,] 307.2 304.1 300.7 318.2 325.5

We can calculate some summary results for the cash flows for each strategy and plot them. First, we might like to plot the mean cash flow over time.

proj.cf.mean <- as.data.frame(lapply(strategy, function(s) { rowMeans(proj.cf[[s]]) })) names(proj.cf.mean) <- strategy proj.cf.mean <- cbind(year, proj.cf.mean) # Plot the mean strategy cash flows. gg.mean.cf <- ggplot(gather(proj.cf.mean, "strategy", "mean", -year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.cf) Figure 1: The mean cash flows over time show tradeoffs in initial outlays, times to peak, level of peak, and the timing of breakeven.

Then we can calculate the cumulative mean cash flow and plot that for each strategy, too.

proj.ccf.mean <- proj.cf.mean %>% mutate_at(vars(strategy), list(~ cumsum(.))) # Plot the cumulative mean strategy cash flows. gg.mean.ccf <- ggplot(gather(proj.ccf.mean, "strategy", "mean", -year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Cumulative Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.ccf) Figure 2: The cumulative mean cash flows over time show tradeoffs in drawdown, nominal payback timing, and terminal cash recovery level.

Now we can observe the risk profile of the cash flows by calculating the trial NPVs of the cash flows.

proj.npv <- as.data.frame(lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcNPV(CF = proj.cf[[s]][, t], Y = year, DR = discount.rate) }) })) names(proj.npv) <- strategy # Plot the CDF of the strategies' sample NPVs. gg.ecdf <- ggplot(gather(proj.npv, "strategy", "NPV"), aes(x = NPV, color = strategy)) + stat_ecdf(geom = "point") + labs(title = "Strategy NPV Risk Profile", x = "Strategy NPV [$M]", y = "Cumulative Probability") plot(gg.ecdf) Figure 3: The cumulative probability chart shows tradeoffs in stategy dominance, range of value, and relative value volatility.

And we can calculate the mean NPV of the cash flows.

proj.npv.mean <- round(colMeans(proj.npv), digits = 1) print(proj.npv.mean) ## Strategy 1 Strategy 2 Strategy 3 ## 157.4 59.3 -123.6

The first observation we make about the risk profiles of the strategies is that we can dismiss Strategy 3 immediately because it presents negative mean NPV, the probability that it produces a negative economic value is ~75% (i.e., the probability(NPV<=0) = 0.75), and practically none of its trials present any opportunity for dominance over any other strategy.

When we observe the relative volatility and dominance of the remaining strategies, we realize that we face a bit of ambiguity about how to choose the best pathway forward. Strategy 1 exhibits the best overall mean NPV, but it does so with the greatest relative volatility. While Strategy 2 exhibits approximately the same risk of failure (~25%) as Strategy 1 (~27%), it also exhibits the least maximum exposure and relative volatility. To reduce the ambiguity of choosing, we might like to know which uncertainty, due to the overall quality of the information we possess about it, contributes most to switching dominance from Strategy 1 over Strategy 2. Knowing which uncertainty our strategy values are most sensitive to gives us the ability to stop worrying about the other uncertainties for the purpose of choosing clearly and focus only on improving our understanding of those that matter most.

To accomplish that latter feat, we run our functionalized model to test the sensitivity of the strategy means to the 80th percentile range in each of the variables. We start by initializing the lists to hold the sensitivity responses.

data.temp <- proj.data proj.npv.sens <- list(p10 = proj.data, p90 = proj.data) d <- data.frame(0, 0, 0) dd <- list( investment = d, start.cash.flow = d, peak.cash.flow = d, yr.peak = d ) proj.npv.sens.mean <- list(p10 = dd, p90 = dd)

We calculate the sensitivity of the strategies’ mean NPVs by fixing each variable to its p10 and p90 quantile values sequentially while all the other variables run according to their defined variation. The result is a chart that shows how sensitive we might be to changing our decision to pursue the best strategy over the next best strategy on the outcome of a given uncertainty.

p1090 <- c(0.1, 0.9) for (q in 1:length(p1090)) { for (v in 1:length(proj.data)) { for (s in 1:length(strategy)) { data.temp[[v]][s] <- rep(quantile(unlist(proj.data[[v]][s]), probs = p1090[q]), trials) for (t in 1:trials) { proj.npv.sens[[q]][[v]][t, s] <- calcNPV( CF = calcCashFlow( I = data.temp$investment[[s]][t], SCF = data.temp$start.cash.flow[[s]][t], PCF = data.temp$peak.cash.flow[[s]][t], YP = data.temp$yr.peak[[s]][t], Y = year ), Y = year, DR = discount.rate ) } proj.npv.sens.mean[[q]][[v]][s] <- mean(proj.npv.sens[[q]][[v]][, s]) data.temp <- proj.data } } } # Recast the sensitivity values to a form that can be plotted. variable.names <- c("investment", "start.cash.flow", "peak.cash.flow", "yr.peak") proj.npv.sens.mean2 <- as.data.frame(sapply(1:length(p1090), function(q) { sapply(1:length(proj.data), function(v) { sapply(1:length(strategy), function(s) { proj.npv.sens.mean[[q]][[v]][[s]] }) }) })) %>% rename(p10 = V1, p90 = V2) %>% mutate( variable = rep(variable.names, each = length(strategy)), strategy = rep(strategy, times = length(proj.data)), strategy.mean = rep(t(proj.npv.mean), times = length(proj.data)) ) %>% select(variable, strategy, strategy.mean, p10, p90) gg.sens <- ggplot(proj.npv.sens.mean2, aes( x = variable, y = p10, yend = p90, color = strategy )) + geom_segment(aes( x = variable, xend = variable, y = p10, yend = p90 ), color = "gray50", size = 0.75) + geom_point( aes(x = variable, y = strategy.mean, color = strategy), size = 1, color = "black", fill = "black", shape = 23 ) + geom_point( aes(x = variable, y = p90), size = 3) + geom_point( aes(x = variable, y = p10), size = 3) + labs(title = "Sensitivity of Strategy Mean NPV to Uncertain Variable Ranges", y = "Strategy NPV [$M]") + coord_flip() + theme(legend.position = "none") + facet_wrap(~ strategy) plot(gg.sens) Figure 4: The sensitivity chart shows the range of variation in a strategy mean due to an 80th percentile variation in a given uncertainty. The mean value of each strategy is marked by the black diamond within the range of each variable. According to our results here, we should focus additional research attention on the investment range first and then the year to peak.

This sensitivity chart is valuable not only to the decision analyst, but to the data analyst as well for this one simple reason: now that we know which uncertainties potentially exposes us to the greatest risk or reward, we can calculate the value of information on those uncertainties. This value of information represents the rational maximum budget we should allocate to acquire better data and insight on those uncertainties. While a typical business case analysis of sufficient complexity may contain 10-100 uncertainties, we should focus our research budget only on those that are critical to making a decision. This simple “stopping function” helps us ensure that our data science resources are allocated properly and most efficiently. Section IV of BCAwR provides a means to calculate the value of information on continuous variables like the ones we have utilized in our example here.

Conclusion

Business decision analysts can definitely use R effectively to conduct many of the business case analyses in a more transparent and less error prone environment than their current tools typically allow, especially through the use of clearly named variables that array abstract and the use of functionalized structural models. However, the real power that R provides is the means to simulate information when high quality, large scale data may be scarce, infeasible, or impossible to obtain. Then, by using such concepts as value of information, data scientists can be called upon to refine the quality of information on those critical uncertainties that may frustrate success when commitment to action turns into real decisions. I hope the simple example presented here inspires you to learn more about it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

eRum2020: call for submissions open!

Tue, 01/14/2020 - 08:00

[This article was first published on MilanoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The new year has started and we at MilanoR are deep into the frantic organization of the third European R User Meeting (eRum). eRum will take place next May (27-30) in Milan, it is a nonprofit event and completely community-driven.  An impressive array of experts, enthusiasts and R professionals will animate the conference. We expect more than 500 R users and a rich showcase of talks in six different tracks. MilanoR is organizing partner in collaboration with Università di Milano Bicocca and Politecnico di Milano.

Submit your work to eRum!

The call for contributed sessions is now officially open. You can now submit abstracts for the following types of contributed sessions (the deadline for submissions is January 29, 2020):

  • Workshops
  • Regular Talks
  • Lightning Talks
  • Shiny Demos
  • Posters

If you are an R developer or lover, this is a great way to showcase your work and find peers and opportunities. The whole team and collaborators of MilanoR are dedicating plenty of time in building a great event. Don’t miss this chance and submit your work! You can easily do it by clicking this link.

Become a Sponsor

eRum2020 is also a great sponsorship opportunity. Beside our sponsors there is still room for other companies and associations working with R in the data-world willing to sponsor the conference. Becoming a sponsor you will have high visibility and access to an audience of R professionals from all over Europe. You could be part of the conference with a talk or workshop: you certainly will be able to find and recruit skilled and motivated talents. Download the sponsorship packages brochure or write us (sponsors@erum.io).

Book your agenda, submit your R-work and see you at eRum2020!
(Do not miss following our social activity on Twitter , Facebook or Linkedin)

The post eRum2020: call for submissions open! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

eRum2020: call for submissions open!

Tue, 01/14/2020 - 08:00

[This article was first published on MilanoR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The new year has started and we at MilanoR are deep into the frantic organization of the third European R User Meeting (eRum). eRum will take place next May (27-30) in Milan, it is a nonprofit event and completely community-driven.  An impressive array of experts, enthusiasts and R professionals will animate the conference. We expect more than 500 R users and a rich showcase of talks in six different tracks. MilanoR is organizing partner in collaboration with Università di Milano Bicocca and Politecnico di Milano.

Submit your work to eRum!

The call for contributed sessions is now officially open. You can now submit abstracts for the following types of contributed sessions (the deadline for submissions is January 29, 2020):

  • Workshops
  • Regular Talks
  • Lightning Talks
  • Shiny Demos
  • Posters

If you are an R developer or lover, this is a great way to showcase your work and find peers and opportunities. The whole team and collaborators of MilanoR are dedicating plenty of time in building a great event. Don’t miss this chance and submit your work! You can easily do it by clicking this link.

Become a Sponsor

eRum2020 is also a great sponsorship opportunity. Beside our sponsors there is still room for other companies and associations working with R in the data-world willing to sponsor the conference. Becoming a sponsor you will have high visibility and access to an audience of R professionals from all over Europe. You could be part of the conference with a talk or workshop: you certainly will be able to find and recruit skilled and motivated talents. Download the sponsorship packages brochure or write us (sponsors@erum.io).

Book your agenda, submit your R-work and see you at eRum2020!
(Do not miss following our social activity on Twitter , Facebook or Linkedin)

The post eRum2020: call for submissions open! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Biomedical Data Science Textbook Available

Tue, 01/14/2020 - 02:34

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Bob Hoyt
& Bob Muenchen

Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.

For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.

The book is available in print
and various electronic formats
. Because it is self-published, we plan to update it more rapidly than would be
possible through traditional publishers.

Below you will find a detailed table of contents and a list
of the textbook authors.

Table of Contents​

​OVERVIEW OF BIOMEDICAL DATA SCIENCE

  1. Introduction
  2. Background and history
  3. Conflicting perspectives
    1. the statistician’s perspective
    2. the machine learner’s perspective
    3. the database administrator’s perspective
    4. the data visualizer’s perspective
  4. Data analytical processes
    1. raw data
    2. data pre-processing
    3. exploratory data analysis (EDA)
    4. predictive modeling approaches
    5. types of models
    6. types of software
  5. Major types of analytics
    1. descriptive analytics
    2. diagnostic analytics
    3. predictive analytics (modeling)
    4. prescriptive analytics
    5. putting it all together
  6. Biomedical data science tools
  7. Biomedical data science education
  8. Biomedical data science careers
  9. Importance of soft skills in data science
  10. Biomedical data science resources
  11. Biomedical data science challenges
  12. Future trends
  13. Conclusion
  14. References

​​SPREADSHEET TOOLS AND TIPS

  1. Introduction
    1. basic spreadsheet functions
    1. download the sample spreadsheet
  2. Navigating the worksheet
  3. Clinical application of spreadsheets
    1. formulas and functions
    2. filter
    3. sorting data
    4. freezing panes
    5. conditional formatting
    6. pivot tables
    7. visualization
    8. data analysis
  4. Tips and tricks
    1. Microsoft Excel shortcuts – windows users
    2. Google sheets tips and tricks
  5. Conclusions
  6. Exercises
  7. References

​​BIOSTATISTICS PRIMER

  1. Introduction
  2. Measures of central tendency & dispersion
    1. the normal and log-normal distributions
  3. Descriptive and inferential statistics
  4. Categorical data analysis
  5. Diagnostic tests
  6. Bayes’ theorem
  7. Types of research studies
    1. observational studies
    2. interventional studies
    3. meta-analysis
    4. orrelation
  8. Linear regression
  9. Comparing two groups
    1. the independent-samples t-test
    2. the wilcoxon-mann-whitney test
  10. Comparing more than two groups
  11. Other types of tests
    1. generalized tests
    2. exact or permutation tests
    3. bootstrap or resampling tests
  12. Stats packages and online calculators
    1. commercial packages
    2. non-commercial or open source packages
    3. online calculators
  13. Challenges
  14. Future trends
  15. Conclusion
  16. Exercises
  17. References

​​DATA VISUALIZATION

  1. Introduction
    1. historical data visualizations
    2. visualization frameworks
  2. Visualization basics
  3. Data visualization software
    1. Microsoft Excel
    2. Google sheets
    3. Tableau
    4. R programming language
    5. other visualization programs
  4. Visualization options
    1. visualizing categorical data
    2. visualizing continuous data
  5. Dashboards
  6. Geographic maps
  7. Challenges
  8. Conclusion
  9. Exercises
  10. References

​​INTRODUCTION TO DATABASES

  1. Introduction
  2. Definitions
  3. A brief history of database models
    1. hierarchical model
    2. network model
    3. relational model
  4. Relational database structure
  5. Clinical data warehouses (CDWs)
  6. Structured query language (SQL)
  7. Learning SQL
  8. Conclusion
  9. Exercises
  10. References

BIG DATA

  1. Introduction
  2. The seven v’s of big data related to health care data
  3. Technical background
  4. Application
  5. Challenges
    1. technical
    2. organizational
    3. legal
    4. translational
  6. Future trends
  7. Conclusion
  8. References

​​BIOINFORMATICS and PRECISION MEDICINE

  1. Introduction
  2. History
  3. Definitions
  4. Biological data analysis – from data to discovery
  5. Biological data types
    1. genomics
    2. transcriptomics
    3. proteomics
    4. bioinformatics data in public repositories
    5. biomedical cancer data portals
  6. Tools for analyzing bioinformatics data
    1. command line tools
    2. web-based tools
  7. Genomic data analysis
  8. Genomic data analysis workflow
    1. variant calling pipeline for whole exome sequencing data
    2. quality check
    3. alignment
    4. variant calling
    5. variant filtering and annotation
    6. downstream analysis
    7. reporting and visualization
  9. Precision medicine – from big data to patient care
  10. Examples of precision medicine
  11. Challenges
  12. Future trends
  13. Useful resources
  14. Conclusion
  15. Exercises
  16. References

​​PROGRAMMING LANGUAGES FOR DATA ANALYSIS

  1. Introduction
  2. History
  3. R language
    1. installing R & rstudio
    2. an example R program
    3. getting help in R
    4. user interfaces for R
    5. R’s default user interface: rgui
    6. Rstudio
    7. menu & dialog guis
    8. some popular R guis
    9. R graphical user interface comparison
    10. R resources
  4. Python language
    1. installing Python
    2. an example Python program
    3. getting help in Python
    4. user interfaces for Python
  5. reproducibility
  6. R vs. Python
  7. Future trends
  8. Conclusion
  9. Exercises
  10. References

​​MACHINE LEARNING

  1. Brief history
  2. Introduction
    1. data refresher
    2. training vs test data
    3. bias and variance
    4. supervised and unsupervised learning
  3. Common machine learning algorithms
  4. Supervised learning
  5. Unsupervised learning
    1. dimensionality reduction
    2. reinforcement learning
    3. semi-supervised learning
  6. Evaluation of predictive analytical performance
    1. classification model evaluation
    2. regression model evaluation
  7. Machine learning software
    1. Weka
    2. Orange
    3. Rapidminer studio
    4. KNIME
    5. Google TensorFlow
    6. honorable mention
    7. summary
  8. Programming languages and machine learning
  9. Machine learning challenges
  10. Machine learning examples
    1. example 1 classification
    2. example 2 regression
    3. example 3 clustering
    4. example 4 association rules
  11. Conclusion
  12. Exercises
  13. References

​​ARTIFICIAL INTELLIGENCE

  1. Introduction
    1. definitions
  2. History
  3. Ai architectures
  4. Deep learning
  5. Image analysis (computer vision)
    1. Radiology
    2. Ophthalmology
    3. Dermatology
    4. Pathology
    5. Cardiology
    6. Neurology
    7. Wearable devices
    8. Image libraries and packages
  6. Natural language processing
    1. NLP libraries and packages
    2. Text mining and medicine
    3. Speech recognition
  7. Electronic health record data and AI
  8. Genomic analysis
  9. AI platforms
    1. deep learning platforms and programs
  10. Artificial intelligence challenges
    1. General
    2. Data issues
    3. Technical
    4. Socio economic and legal
    5. Regulatory
    6. Adverse unintended consequences
    7. Need for more ML and AI education
  11. Future trends
  12. Conclusion
  13. Exercises
  14. References
Authors

Brenda Griffith
Technical Writer
Data.World
Austin, TX

Robert Hoyt MD, FACP, ABPM-CI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA

David Hurwitz MD, FACP, ABPM-CI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL

Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO

Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY

Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN

Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN

Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL

​A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Biomedical Data Science Textbook Available

Tue, 01/14/2020 - 02:34

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Bob Hoyt
& Bob Muenchen

Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easy-to-use interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.

For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.

The book is available in print
and various electronic formats
. Because it is self-published, we plan to update it more rapidly than would be
possible through traditional publishers.

Below you will find a detailed table of contents and a list
of the textbook authors.

Table of Contents​

​OVERVIEW OF BIOMEDICAL DATA SCIENCE

  1. Introduction
  2. Background and history
  3. Conflicting perspectives
    1. the statistician’s perspective
    2. the machine learner’s perspective
    3. the database administrator’s perspective
    4. the data visualizer’s perspective
  4. Data analytical processes
    1. raw data
    2. data pre-processing
    3. exploratory data analysis (EDA)
    4. predictive modeling approaches
    5. types of models
    6. types of software
  5. Major types of analytics
    1. descriptive analytics
    2. diagnostic analytics
    3. predictive analytics (modeling)
    4. prescriptive analytics
    5. putting it all together
  6. Biomedical data science tools
  7. Biomedical data science education
  8. Biomedical data science careers
  9. Importance of soft skills in data science
  10. Biomedical data science resources
  11. Biomedical data science challenges
  12. Future trends
  13. Conclusion
  14. References

​​SPREADSHEET TOOLS AND TIPS

  1. Introduction
    1. basic spreadsheet functions
    1. download the sample spreadsheet
  2. Navigating the worksheet
  3. Clinical application of spreadsheets
    1. formulas and functions
    2. filter
    3. sorting data
    4. freezing panes
    5. conditional formatting
    6. pivot tables
    7. visualization
    8. data analysis
  4. Tips and tricks
    1. Microsoft Excel shortcuts – windows users
    2. Google sheets tips and tricks
  5. Conclusions
  6. Exercises
  7. References

​​BIOSTATISTICS PRIMER

  1. Introduction
  2. Measures of central tendency & dispersion
    1. the normal and log-normal distributions
  3. Descriptive and inferential statistics
  4. Categorical data analysis
  5. Diagnostic tests
  6. Bayes’ theorem
  7. Types of research studies
    1. observational studies
    2. interventional studies
    3. meta-analysis
    4. orrelation
  8. Linear regression
  9. Comparing two groups
    1. the independent-samples t-test
    2. the wilcoxon-mann-whitney test
  10. Comparing more than two groups
  11. Other types of tests
    1. generalized tests
    2. exact or permutation tests
    3. bootstrap or resampling tests
  12. Stats packages and online calculators
    1. commercial packages
    2. non-commercial or open source packages
    3. online calculators
  13. Challenges
  14. Future trends
  15. Conclusion
  16. Exercises
  17. References

​​DATA VISUALIZATION

  1. Introduction
    1. historical data visualizations
    2. visualization frameworks
  2. Visualization basics
  3. Data visualization software
    1. Microsoft Excel
    2. Google sheets
    3. Tableau
    4. R programming language
    5. other visualization programs
  4. Visualization options
    1. visualizing categorical data
    2. visualizing continuous data
  5. Dashboards
  6. Geographic maps
  7. Challenges
  8. Conclusion
  9. Exercises
  10. References

​​INTRODUCTION TO DATABASES

  1. Introduction
  2. Definitions
  3. A brief history of database models
    1. hierarchical model
    2. network model
    3. relational model
  4. Relational database structure
  5. Clinical data warehouses (CDWs)
  6. Structured query language (SQL)
  7. Learning SQL
  8. Conclusion
  9. Exercises
  10. References

BIG DATA

  1. Introduction
  2. The seven v’s of big data related to health care data
  3. Technical background
  4. Application
  5. Challenges
    1. technical
    2. organizational
    3. legal
    4. translational
  6. Future trends
  7. Conclusion
  8. References

​​BIOINFORMATICS and PRECISION MEDICINE

  1. Introduction
  2. History
  3. Definitions
  4. Biological data analysis – from data to discovery
  5. Biological data types
    1. genomics
    2. transcriptomics
    3. proteomics
    4. bioinformatics data in public repositories
    5. biomedical cancer data portals
  6. Tools for analyzing bioinformatics data
    1. command line tools
    2. web-based tools
  7. Genomic data analysis
  8. Genomic data analysis workflow
    1. variant calling pipeline for whole exome sequencing data
    2. quality check
    3. alignment
    4. variant calling
    5. variant filtering and annotation
    6. downstream analysis
    7. reporting and visualization
  9. Precision medicine – from big data to patient care
  10. Examples of precision medicine
  11. Challenges
  12. Future trends
  13. Useful resources
  14. Conclusion
  15. Exercises
  16. References

​​PROGRAMMING LANGUAGES FOR DATA ANALYSIS

  1. Introduction
  2. History
  3. R language
    1. installing R & rstudio
    2. an example R program
    3. getting help in R
    4. user interfaces for R
    5. R’s default user interface: rgui
    6. Rstudio
    7. menu & dialog guis
    8. some popular R guis
    9. R graphical user interface comparison
    10. R resources
  4. Python language
    1. installing Python
    2. an example Python program
    3. getting help in Python
    4. user interfaces for Python
  5. reproducibility
  6. R vs. Python
  7. Future trends
  8. Conclusion
  9. Exercises
  10. References

​​MACHINE LEARNING

  1. Brief history
  2. Introduction
    1. data refresher
    2. training vs test data
    3. bias and variance
    4. supervised and unsupervised learning
  3. Common machine learning algorithms
  4. Supervised learning
  5. Unsupervised learning
    1. dimensionality reduction
    2. reinforcement learning
    3. semi-supervised learning
  6. Evaluation of predictive analytical performance
    1. classification model evaluation
    2. regression model evaluation
  7. Machine learning software
    1. Weka
    2. Orange
    3. Rapidminer studio
    4. KNIME
    5. Google TensorFlow
    6. honorable mention
    7. summary
  8. Programming languages and machine learning
  9. Machine learning challenges
  10. Machine learning examples
    1. example 1 classification
    2. example 2 regression
    3. example 3 clustering
    4. example 4 association rules
  11. Conclusion
  12. Exercises
  13. References

​​ARTIFICIAL INTELLIGENCE

  1. Introduction
    1. definitions
  2. History
  3. Ai architectures
  4. Deep learning
  5. Image analysis (computer vision)
    1. Radiology
    2. Ophthalmology
    3. Dermatology
    4. Pathology
    5. Cardiology
    6. Neurology
    7. Wearable devices
    8. Image libraries and packages
  6. Natural language processing
    1. NLP libraries and packages
    2. Text mining and medicine
    3. Speech recognition
  7. Electronic health record data and AI
  8. Genomic analysis
  9. AI platforms
    1. deep learning platforms and programs
  10. Artificial intelligence challenges
    1. General
    2. Data issues
    3. Technical
    4. Socio economic and legal
    5. Regulatory
    6. Adverse unintended consequences
    7. Need for more ML and AI education
  11. Future trends
  12. Conclusion
  13. Exercises
  14. References
Authors

Brenda Griffith
Technical Writer
Data.World
Austin, TX

Robert Hoyt MD, FACP, ABPM-CI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA

David Hurwitz MD, FACP, ABPM-CI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL

Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO

Robert Leviton MD, MPH, FACEP, ABPM-CI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY

Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN

Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN

Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL

​A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to import Python classes into R

Tue, 01/14/2020 - 02:10

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

Background

This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes.

To install reticulate, we can run:

install.packages("reticulate") Creating a Python class

Let’s create a simple class in Python.

import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) Porting the Python class to R

There’s a couple ways we can can port our Python class to R. One way is by sourcing a Python script defining the class. Let’s suppose our class is defined in a script called “sample_class.py”. We can use the reticulate function source_python.

# load reticulate package library(reticulate) # inside R, source Python script source_python("sample_class.py")

Running the command above will not only make the class we defined will available to us in our R session, but would also make any other variables or functions we defined in the script available as well (if those exist). Thus, if we define 10 classes in the script, all 10 classes will be available in the R session. We can refer to any specific method defined in a class using R’s “$” notation, rather than the “dot” notation of Python.

result <- explore(iris) # get summary stats of data result$metrics() # create dummy variables from factor / character fields # (just one in this case) result$dummify_vars()

One other note is that when you import a class from Python, the class becomes a closure in R. You can see this by running R’s typeof function:

typeof(explore) # closure Using R markdown to switch between R and Python

Another way of using a Python class in R is by using R Markdown. This feature is available in RStudio v. 1.2+ and it allows us to write chunks of R code followed by chunks of Python code, and vice-versa. It also lets us pass variables, or even classes from Python to R. Below, we write the same code as above, but this time using chunks in an R Markdown file. When we write switch between R and Python chunks in R Markdown, we can reference Python objects (including classes) by typing py$name_of_object, where you just need to replace name_of_object with whatever you’re trying to reference from the Python code. In the below case, we reference the explore class we created by typing py$explore.

```{r} library(reticulate) ``` ```{python} import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) ``` ```{r} py$explore ``` Example with class and instance variables

Now, let’s look at another example. Below, we create a Python class in a file called “sample_class2.py” that has an instance variable (value) and a class variable (num).

class test: def __init__(self, value): self.value = value def class_var(self): test.num = 10 source_python("sample_class2.py") check = test(5) check$value check$num # error because class_var hasn't been called yet check$class_var() check$num # works now test$num

That’s it for now! If you enjoyed this post, please follow my blog on Twitter.

To learn more about reticulate, check out its official documentation here.

The post How to import Python classes into R appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How to import Python classes into R

Tue, 01/14/2020 - 02:10

[This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

Background

This post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes.

To install reticulate, we can run:

install.packages("reticulate") Creating a Python class

Let’s create a simple class in Python.

import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) Porting the Python class to R

There’s a couple ways we can can port our Python class to R. One way is by sourcing a Python script defining the class. Let’s suppose our class is defined in a script called “sample_class.py”. We can use the reticulate function source_python.

# load reticulate package library(reticulate) # inside R, source Python script source_python("sample_class.py")

Running the command above will not only make the class we defined will available to us in our R session, but would also make any other variables or functions we defined in the script available as well (if those exist). Thus, if we define 10 classes in the script, all 10 classes will be available in the R session. We can refer to any specific method defined in a class using R’s “$” notation, rather than the “dot” notation of Python.

result <- explore(iris) # get summary stats of data result$metrics() # create dummy variables from factor / character fields # (just one in this case) result$dummify_vars()

One other note is that when you import a class from Python, the class becomes a closure in R. You can see this by running R’s typeof function:

typeof(explore) # closure Using R markdown to switch between R and Python

Another way of using a Python class in R is by using R Markdown. This feature is available in RStudio v. 1.2+ and it allows us to write chunks of R code followed by chunks of Python code, and vice-versa. It also lets us pass variables, or even classes from Python to R. Below, we write the same code as above, but this time using chunks in an R Markdown file. When we write switch between R and Python chunks in R Markdown, we can reference Python objects (including classes) by typing py$name_of_object, where you just need to replace name_of_object with whatever you’re trying to reference from the Python code. In the below case, we reference the explore class we created by typing py$explore.

```{r} library(reticulate) ``` ```{python} import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) ``` ```{r} py$explore ``` Example with class and instance variables

Now, let’s look at another example. Below, we create a Python class in a file called “sample_class2.py” that has an instance variable (value) and a class variable (num).

class test: def __init__(self, value): self.value = value def class_var(self): test.num = 10 source_python("sample_class2.py") check = test(5) check$value check$num # error because class_var hasn't been called yet check$class_var() check$num # works now test$num

That’s it for now! If you enjoyed this post, please follow my blog on Twitter.

To learn more about reticulate, check out its official documentation here.

The post How to import Python classes into R appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On the conjugate function

Tue, 01/14/2020 - 01:36

[This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the MAT7381 course (graduate course on regression models), we will talk about optimization, and a classical tool is the so-called conjugate. Given a function f:\mathbb{R}^p\to\mathbb{R} its conjugate is function f^{\star}:\mathbb{R}^p\to\mathbb{R} such that f^{\star}(\boldsymbol{y})=\max_{\boldsymbol{x}}\lbrace\boldsymbol{x}^\top\boldsymbol{y}-f(\boldsymbol{x})\rbraceso, long story short, f^{\star}(\boldsymbol{y}) is the maximum gap between the linear function \boldsymbol{x}^\top\boldsymbol{y} and f(\boldsymbol{x}).

Just to visualize, consider a simple parabolic function (in dimension 1) f(x)=x^2/2, then f^{\star}(\color{blue}{2}) is the maximum gap between the line x\mapsto\color{blue}{2}x and function f(x).

x = seq(-100,100,length=6001)
f = function(x) x^2/2
vf = Vectorize(f)(x)
fstar = function(y) max(y*x-vf)
vfstar = Vectorize(fstar)(x)

We can see it on the figure below.

viz = function(x0=1,YL=NA){
idx=which(abs(x)<=3) par(mfrow=c(1,2)) plot(x[idx],vf[idx],type="l",xlab="",ylab="",col="blue",lwd=2) abline(h=0,col="grey") abline(v=0,col="grey") idx2=which(x0*x>=vf)
polygon(c(x[idx2],rev(x[idx2])),c(vf[idx2],rev(x0*x[idx2])),col=rgb(0,1,0,.3),border=NA)
abline(a=0,b=x0,col="red")
i=which.max(x0*x-vf)
segments(x[i],x0*x[i],x[i],f(x[i]),lwd=3,col="red")
if(is.na(YL)) YL=range(vfstar[idx])
plot(x[idx],vfstar[idx],type="l",xlab="",ylab="",col="red",lwd=1,ylim=YL)
abline(h=0,col="grey")
abline(v=0,col="grey")
segments(x0,0,x0,fstar(x0),lwd=3,col="red")
points(x0,fstar(x0),pch=19,col="red")
}
viz(1)

or

viz(1.5)

In that case, we can actually compute f^{\star}, since f^{\star}(y)=\max_{x}\lbrace xy-f(x)\rbrace=\max_{x}\lbrace xy-x^2/2\rbraceThe first order condition is here x^{\star}=y and thusf^{\star}(y)=\max_{x}\lbrace xy-x^2/2\rbrace=\lbrace x^{\star}y-(x^{\star})^2/2\rbrace=\lbrace y^2-y^2/2\rbrace=y^2/2And actually, that can be related to two results. The first one is to observe that f(\boldsymbol{x})=\|\boldsymbol{x}\|_2^2/2 and in that case f^{\star}(\boldsymbol{y})=\|\boldsymbol{y}\|_2^2/2 from the following general result : if f(\boldsymbol{x})=\|\boldsymbol{x}\|_p^p/p with p>1, where \|\cdot\|_p denotes the standard \ell_p norm, then f^{\star}(\boldsymbol{y})=\|\boldsymbol{y}\|_q^q/q where\frac{1}{p}+\frac{1}{q}=1The second one is the conjugate of a quadratic function. More specifically if f(\boldsymbol{x})=\boldsymbol{x}^{\top}\boldsymbol{Q}\boldsymbol{x}/2 for some definite positive matrix \boldsymbol{Q},  f^{\star}(\boldsymbol{y})=\boldsymbol{y}^{\top}\boldsymbol{Q}^{-1}\boldsymbol{y}/2. In our case, it was a univariate problem with \boldsymbol{Q}=1.

For the conjugate of the \ell_p norm, we can use the following code to visualize it

p = 3
f = function(x) abs(x)^p/p
vf = Vectorize(f)(x)
fstar = function(y) max(y*x-vf)
vfstar = Vectorize(fstar)(x)
viz(1.5)

or

p = 1.1
f = function(x) abs(x)^p/p
vf = Vectorize(f)(x)
fstar = function(y) max(y*x-vf)
vfstar = Vectorize(fstar)(x)
viz(1, YL=c(0,10))

Actually, in that case, we almost visualize that if f(x)=|x| then\displaystyle{f^{\star}\left(y\right)={\begin{cases}0,&\left|y\right|\leq 1\\\infty ,&\left|y\right|>1.\end{cases}}}

To conclude, another popular case, f(x)=\exp(x) then{\displaystyle f^{\star}\left(y\right)={\begin{cases}y\log(y)-y,&y>0\\0,&y=0\\\infty ,&y<0.\end{cases}}}[/latex]We can visualize that case below

f = function(x) exp(x)
vf = Vectorize(f)(x)
fstar = function(y) max(y*x-vf)
vfstar = Vectorize(fstar)(x)
viz(1,YL=c(-3,3))

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Le Monde puzzle [#1120]

Tue, 01/14/2020 - 00:20

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A board game as Le weekly Monde current mathematical puzzle:

11 players in a circle and 365 tokens first owned by a single player. Players with at least two tokens can either remove one token and give another one left or move two right and one left. How quickly does the game stall, how many tokens are left, and where are they?

The run of a R simulation like

od=function(i)(i-1)%%11+1 muv<-function(bob){ if (max(bob)>1){ i=sample(rep((1:11)[bob>1],2),1) dud=c(0,-2,1) if((runif(1)<.5)&(bob[i]>2))dud=c(2,-3,1) bob[c(od(i+10),i,od(i+1))]=bob[c(od(i+10),i,od(i+1))]+dud } bob}

always provides a solution

> bob [1] 1 0 1 1 0 1 1 0 1 0 0

with six ones at these locations. However the time it takes to reach this frozen configuration varies, depending on the sequence of random choices.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages