rOpenSci 2019 Code of Conduct Transparency Report
[This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In January 2019, we announced the release of rOpenSci’s Code of Conduct version 2.0. This includes a named Committee, greater detail about unacceptable behaviors, instructions on how to make a report, and information on how reports are handled. We are committed to transparency with our community while upholding of victims and people who report incidents.
Our Code of Conduct applies to all people participating in the rOpenSci community, including rOpenSci staff and leadership. It applies to all modes of interaction online including GitHub project repositories, the rOpenSci discussion forum, Slack, Community Calls, and in person at rOpenScihosted events or events officially endorsed by rOpenSci, including social gatherings affiliated with an event.
In 2019, we received one Code of Conduct incident report. We published a brief, anonymized summary in our forum in a dedicated, readonly category.
Summary of Reported IncidentsWe received a code of conduct report for demeaning comments made on rOpenSci’s Slack. One committee member was recused from handling this report; the remaining members reviewed the messages and determined that they violated the code of conduct. We contacted the sender and informed them that such comments are not acceptable under our CoC. We also informed the reporter of our decision and the outcome.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
rOpenSci Code of Conduct Annual Review
[This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
One year ago, we released our new Code of Conduct. At that time, the Code of Conduct Committee (authors of this post) agreed to do an annual review of the text, reporting form, and our internal guidelines.
We have made one change to the Code of Conduct text. Because some people who have experienced abuse prefer not to label themselves as a victim, in “We are committed to transparency with our community while upholding the privacy of victims” we edited to “… upholding the privacy of victims and people who report incidents”.
With this change, this will be version 2.1, dated January 15, 2020.
We are pleased to report that we, the Code of Conduct Committee members, Stefanie Butland (rOpenSci Community Manager), Scott Chamberlain (rOpenSci Cofounder and Technical Lead), and Kara Woo (independent community member), all wish to continue serving in 2020. We are responsible for receiving, investigating, deciding, enforcing and reporting on all reports of potential violations of our Code.
We welcome your feedback by email to conduct at ropensci.org, and we thank you for working with us to make rOpenSci a safe, enjoyable, friendly and enriching experience for everyone who participates.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
On Cochran Theorem (and Orthogonal Projections)
[This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Cochran Theorem – from The distribution of quadratic forms in a normal system, with applications to the analysis of covariance published in 1934 – is probably the most import one in a regression course. It is an application of a nice result on quadratic forms of Gaussian vectors. More precisely, we can prove that if \boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{0},\mathbb{I}_d) is a random vector with d \mathcal{N}(0,1) variable then (i) if A is a (squared) idempotent matrix \boldsymbol{Y}^\top A\boldsymbol{Y}\sim\chi^2_r where r is the rank of matrix A, and (ii) conversely, if \boldsymbol{Y}^\top A\boldsymbol{Y}\sim\chi^2_r then A is an idempotent matrix of rank r. And just in case, A is an idempotent matrix means that A^2=A, and a lot of results can be derived (for instance on the eigenvalues). The prof of that result (at least the (i) part) is nice: we diagonlize matrix A, so that A=P\Delta P^\top, with P orthonormal. Since A is an idempotent matrix observe thatA^2=P\Delta P^\top=P\Delta P^\top=P\Delta^2 P^\topwhere \Delta is some diagonal matrix such that \Delta^2=\Delta, so terms on the diagonal of \Delta are either 0 or 1‘s. And because the rank of A (and \Delta) is r then there should be r 1‘s and dr 1‘s. Now write\boldsymbol{Y}^\top A\boldsymbol{Y}=\boldsymbol{Y}^\top P\Delta P^\top\boldsymbol{Y}=\boldsymbol{Z}^\top \Delta\boldsymbol{Z}where \boldsymbol{Z}=P^\top\boldsymbol{Y} that satisfies\boldsymbol{Z}\sim\mathcal{N}(\boldsymbol{0},PP^\top) i.e. \boldsymbol{Z}\sim\mathcal{N}(\boldsymbol{0},\mathbb{I}_d). Thus \boldsymbol{Z}^\top \Delta\boldsymbol{Z}=\sum_{i:\Delta_{i,i}1}Z_i^2\sim\chi^2_rNice, isn’t it. And there is more (that will be strongly connected actually to Cochran theorem). Let A=A_1+\dots+A_k, then the two following statements are equivalent (i) A is idempotent and \text{rank}(A)=\text{rank}(A_1)+\dots+\text{rank}(A_k) (ii) A_i‘s are idempotents, A_iA_j=0 for all i\neq j.
Now, let us talk about projections. Let \boldsymbol{y} be a vector in \mathbb{R}^n. Its projection on the space \mathcal V(\boldsymbol{v}_1,\dots,\boldsymbol{v}_p) (generated by those p vectors) is the vector \hat{\boldsymbol{y}}=\boldsymbol{V} \hat{\boldsymbol{a}} that minimizes \\boldsymbol{y} \boldsymbol{V} \boldsymbol{a}\ (in \boldsymbol{a}). The solution is\hat{\boldsymbol{a}}=( \boldsymbol{V}^\top \boldsymbol{V})^{1} \boldsymbol{V}^\top \boldsymbol{y} \text{ and } \hat{\boldsymbol{y}} = \boldsymbol{V} \hat{\boldsymbol{a}}
Matrix P=\boldsymbol{V} ( \boldsymbol{V}^\top \boldsymbol{V})^{1} \boldsymbol{V}^\top is the orthogonal projection on \{\boldsymbol{v}_1,\dots,\boldsymbol{v}_p\} and \hat{\boldsymbol{y}} = P\boldsymbol{y}.
Now we can recall Cochran theorem. Let \boldsymbol{Y}\sim\mathcal{N}(\boldsymbol{\mu},\sigma^2\mathbb{I}_d) for some \sigma>0 and \boldsymbol{\mu}. Consider subvector orthogonal spaces F_1,\dots,F_m, with dimension d_i. Let P_{F_i} be the orthogonal projection matrix on F_i, then (i) vectors P_{F_1}\boldsymbol{X},\dots,P_{F_m}\boldsymbol{X} are independent, with respective distribution \mathcal{N}(P_{F_i}\boldsymbol{\mu},\sigma^2\mathbb{I}_{d_i}) and (ii) random variables \P_{F_i}(\boldsymbol{X}\boldsymbol{\mu})\^2/\sigma^2 are independent and \chi^2_{d_i} distributed.
We can try to visualize those results. For instance, the orthogonal projection of a random vector has a Gaussian distribution. Consider a twodimensional Gaussian vector
library(mnormt)
r = .7
s1 = 1
s2 = 1
Sig = matrix(c(s1^2,r*s1*s2,r*s1*s2,s2^2),2,2)
Sig
Y = rmnorm(n = 1000,mean=c(0,0),varcov = Sig)
plot(Y,cex=.6)
vu = seq(4,4,length=101)
vz = outer(vu,vu,function (x,y) dmnorm(cbind(x,y),
mean=c(0,0), varcov = Sig))
contour(vu,vu,vz,add=TRUE,col='blue')
abline(a=0,b=2,col="red")
Consider now the projection of points \boldsymbol{y}=(y_1,y_2) on the straight linear with directional vector \overrightarrow{\boldsymbol{u}} with slope a (say a=2). To get the projected point \boldsymbol{x}=(x_1,x_2) recall that x_2=ay_1 and \overrightarrow{\boldsymbol{x},\boldsymbol{y}}\perp\overrightarrow{\boldsymbol{u}}. Hence, the following code will give us the orthogonal projections
p = function(a){
x0=(Y[,1]+a*Y[,2])/(1+a^2)
y0=a*x0
cbind(x0,y0)
}
with
P = p(2)
for(i in 1:20) segments(Y[i,1],Y[i,2],P[i,1],P[i,2],lwd=4,col="red")
points(P[,1],P[,2],col="red",cex=.7)
Now, if we look at the distribution of points on that line, we get… a Gaussian distribution, as expected,
z = sqrt(P[,1]^2+P[,2]^2)*c(1,+1)[(P[,1]>0)*1+1]
vu = seq(6,6,length=601)
vv = dnorm(vu,mean(z),sd(z))
hist(z,probability = TRUE,breaks = seq(4,4,by=.25))
lines(vu,vv,col="red")
Or course, we can use the matrix representation to get the projection on \overrightarrow{\boldsymbol{u}}, or a normalized version of that vector actually
a=2
U = c(1,a)/sqrt(a^2+1)
U
[1] 0.4472136 0.8944272
matP = U %*% solve(t(U) %*% U) %*% t(U)
matP %*% Y[1,]
[,1]
[1,] 0.1120555
[2,] 0.2241110
P[1,]
x0 y0
0.1120555 0.2241110
(which is consistent with our manual computation). Now, in Cochran theorem, we start with independent random variables,
Y = rmnorm(n = 1000,mean=c(0,0),varcov = diag(c(1,1)))
Then we consider the projection on \overrightarrow{\boldsymbol{u}} and \overrightarrow{\boldsymbol{v}}=\overrightarrow{\boldsymbol{u}}^\perp
U = c(1,a)/sqrt(a^2+1)
matP1 = U %*% solve(t(U) %*% U) %*% t(U)
P1 = Y %*% matP1
z1 = sqrt(P1[,1]^2+P1[,2]^2)*c(1,+1)[(P1[,1]>0)*1+1]
V = c(a,1)/sqrt(a^2+1)
matP2 = V %*% solve(t(V) %*% V) %*% t(V)
P2 = Y %*% matP2
z2 = sqrt(P2[,1]^2+P2[,2]^2)*c(1,+1)[(P2[,1]>0)*1+1]
We can plot those two projections
plot(z1,z2)
and observe that the two are indeed, independent Gaussian variables. And (of course) there squared norms are \chi^2_{1} distributed.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Appsilon Data Science is now an RStudio Full Service Certified Partner
[This article was first published on r – Appsilon Data Science  End to End Data Science Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In November 2019, Appsilon Data Science became an RStudio Full Service Certified Partner.
We now officially provide support services for existing Shiny applications, Plumber APIs, and RStudio infrastructure. Further, we now provide Managed Services for the R environment. This means that we can handle ongoing support for your servers and infrastructure to keep applications running smoothly. If you are interested in learning more about these services, please reach out!
This partnership solidifies our position as one of the leaders in the R/Shiny space. Building advanced Decision Support Systems has been a part of our path since we started.
RStudio Full Service Certified Partners resell RStudio professional products and employ RStudio certified R administrators and/or R user instructors who offer training, implementation, management, and development services that help customers succeed with R and RStudio.
This is a natural step for us, as we have been R evangelists since the first days of our company. We have even worked side by side with RStudio on many occasions with different clients.
We also share values with RStudio — a passion for community, open source, and empowering data scientists and companies everywhere. We both love to conference and share what we’ve learned in our day to day work. The broader community benefits from this cooperation as well — our open source packages have been downloaded over 39,000 times and are praised by companies like Viacom and BCG.
Becoming a certified partner with RStudio fortifies the communication between our companies so that we can better serve our clients and help them with every step of the RStudio product implementation process. Some RStudio products that we can help you with include:
RStudio Connect is a publishing platform that will boost your R workflow with a click of a button. Publish your Shiny applications, R Markdown reports & dashboards, Plumber APIs, Python notebooks, and more in one convenient ITmanaged location. Periodic reports can be scheduled with just several clicks.
RStudio Package Manager is a repository management server optimized for R and RStudio. It allows you to organize, search, and browse R packages within your team and across your organization. Access all of CRAN or define curated subsets. Add your local packages and packages from Github. Control and manage all of the packages that your data scientists and data products depend on.
Shiny Server Pro is intended for projects requiring pure Shiny work. This is a field tested solution which can be readily scaled up and down. Shiny Server Pro doesn’t have all the features that you get with Connect, but if your use case doesn’t require RStudio Connect, then Shiny Server Pro might be a more economical solution. As with all other R products, we are happy to help with Shiny Server Pro.
Moving forward, we will continue to develop R Shiny applications for the data science and business communities alongside our new Partner. This will translate into a new, higher level of quality for our clients. Now, we can not only advise on solutions but also optimize their license and maintenance costs.
As you might know, we build decision support systems for enterprise and other types of organizations, such as the International Institute for Applied Systems Analysis (IIASA). We recently built a Shiny dashboard that will help IIASA understand and visualize natural disaster risk in Madagascar. Last year we also built R Shiny projects for Fortune 500 companies (including companies in the pharmaceutical, insurance, and logistics industries) and for universities and educational organizations (natural language processing). We intend to continue these endeavors in the future.
Plus we are happy to continue to offer services associated with setting up a production environment and a set of postimplementation support options such as development support, hosting, and administration services.
I am very excited about our collaboration with the RStudio team, and we look forward to fruitful cooperation in the future!
Meet usWe hope to see you at rstudio::conf in San Francisco on January 27. Please drop by the Appsilon Data Science booth in the Yosemite Room and say hello. Damian Rodziewicz is giving a presentation “Building a native iPad dashboard using plumber and RStudio Connect in Pharma” (Jan. 29, 4:23pm, Room 4). He will also give an eposter “What does it take to make a Shiny prototype ready for production?” Pedro Coutinho Silva will give an eposter “From FrontEnd to Shiny development. Augmenting the power of your dashboards with modern frontend tools.”
Eposters will be shown during the opening reception on Tuesday (Jan. 28) evening from 5:30 to 7:30pm. You can also find us at the Bay Area useR Group meetup (conference tickets not required) at 6:30pm on January 27 at the Hilton San Francisco Union Square.
Thanks for reading. For more, follow me on Linkedin and Twitter.
Follow Appsilon Data Science on Social Media Follow @Appsilon on Twitter
 Follow Appsilon on LinkedIn
 Sign up for our company newsletter
 Try out our R Shiny open source packages
 Sign up for the AI for Good newsletter
Article Appsilon Data Science is now an RStudio Full Service Certified Partner comes from Appsilon Data Science  End to End Data Science Solutions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science  End to End Data Science Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
RQuantLib 0.4.11: More polish
[This article was first published on Thinking inside the box , and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
New year, new RQuantLib! A new release 0.4.11 of RQuantLib arrived overnight on CRAN; and a Debian upload will follow shortly.
QuantLib is a very comprehensice free/opensource library for quantitative finance; RQuantLib connects it to the R environment and language.
This version does three new things. First, we fixed an oversight on our end and now allow a null calendar (as the C++ API). Second, the package switched to tinytest as a few of my other packages have done, allowing for very flexible testing during development and deployment—three cheers for easily testing installed packages too. Third, and per a kind nag from Kurt Hornik I updated a few calls which the current QuantLib 1.17 marks as deprecated. That lead to a compile issue with 1.16 so the change is conditional in one part. The complete set of changes is listed below:
Changes in RQuantLib version 0.4.11 (20200115)
Changes in RQuantLib code:
Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the RQuantLib page. Questions, comments etc should go to the new rquantlibdevel mailing list. Issue tickets can be filed at the GitHub repo.
If you like this or other opensource work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Mapping World Languages’ Difficulty Relative to English
[This article was first published on Educators R Learners, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
I was reading r/MapPorn and saw the image below:
As fate would have it, the very first comment was from Guridkt who said, “I’d like to see asia too, i wonder if there’s a map of that. (sic)”
So would I, so lets do it.
About the DataThis map is based on data published by the US Foreign Service Institute and, according to their website, it is based on the institutes’ “70 years of experience in teaching languages to U.S. diplomats, and illustrate the time usually required for a student to reach “Professional Working Proficiency” in the language, or a score of “Speaking3/Reading3” on the Interagency Language Roundtable scale” (emphasis added.)
And how would you describe someone with “professional working proficiency?” Their website says someone with a 3 in speaking can “use the language to satisfy professional needs in a wide range of sophisticated and demanding tasks.”
Fair enough, but how do they measure whether someone can actually achieve those tasks? Through a 15 – 30 minute TELEPHONE interview. Yikes!!!!
Now for a few caveats about the map.
First, for the sake of ease and simplicity, I scraped Wikipedia to get the official languages of each country.
Second, in order to simplify the task of displaying the relative ease/difficulty for English speakers to learn the official language of a country, I selected just one official language for each country.
Finally, in the instances when a country had multiple official languages but only one is listed by the FSI, (e.g. Afghanistan with Pashto and Dari) I selected the one listed by FSI for display purposes.
Now, for the map:
Clearly, according to the FSI, officials being sent to countries where Arabic, Mandarin, Cantonese, Korean or Japanese is the official language should receive more language training than those being sent to areas where Bahasa is spoken, (i.e., Malaysia and Indonesia).
Finally, if you’re interested in the code I used to create the world map in the thumbnail and the map of Asia above, please find it here.
Happy coding!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Educators R Learners. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
an elegant sampler
[This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Following an X validated question on how to simulate a multinomial with fixed average, W. Huber produced a highly elegant and efficient resolution with the compact R code
tabulate(sample.int((k1)*n, sn) %% n + 1, n) + 1where k is the number of classes, n the number of draws, and s equal to n times the fixed average. The R function sample.int is an alternative to sample that seems faster. Breaking the outcome of
sample.int((k1)*n, sn)as nonzero positions in an n x (k1) matrix and adding a adding a row of n 1’s leads to a simulation of integers between 1 and k by counting the 1’s in each of the n columns, which is the meaning of the above picture. Where the colour code is added after counting the number of 1’s. Since there are s 1’s in this matrix, the sum is automatically equal to s. Since the sn positions are chosen uniformly over the n x (k1) locations, the outcome is uniform. The rest of the R code is a brutally efficient way to translate the idea into a function. (By comparison, I bruteforced the question by suggesting a basic Metropolis algorithm.)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
sklearn Pipe Step Interface for vtreat
[This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
We’ve been experimenting with this for a while, and the next R vtreat package will have a backport of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).
This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data), coder$fit(train_data_cal)$transform(train_data_model), and coder$fit(application_data).
We have also regenerated the current taskoriented vtreat documentation to demonstrate the new nested bias warning feature:
 Regression: R regression example, Python regression example.
 Classification: R classification example, Python classification example.
 Unsupervised data preparation: R unsupervised example, Python unsupervised example.
 Multinomial classification: R multinomial classification example, Python multinomial classification example.
And we now have new versions of these documents showing the sklearn $fit_transform() style notation in R.
 Regression: R $fit_transform() regression example.
 Classification: R $fit_transform() classification example.
 Unsupervised data preparation: R $fit_transform() unsupervised example.
 Multinomial classification: R $fit_transform() multinomial classification example.
The original R interface is going to remain the standard interface for vtreat. It is more idiomatic R, and is taught in chapter 8 of Zumel, Mount; Practical Data Science with R, 2nd Edition, Manning 2019.
In contrast, the $fit_transform() notation will always just be an adaptor over the primary R interface. However, there is a lot to be learned from sklearn’s organization and ideas, so we felt we would use make their naming convention available as a way of showing appreciation and giving credit. Some more of my notes on the grace of the sklearn interface in being a good way to manage state and generative effects (see Brendan Fong, David I. Spivak; An Invitation to Applied Category Theory, Cambridge University Press, 2019) can be found here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
LondonR & RLadies to host ‘watch party’ at RStudio Conference, Jan 29th
[This article was first published on RBlog – Mango Solutions, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Mango Solutions, a full service certified partner of RStudio, are delighted to be supporting the rstudio::conf 2020 conference in San Francisco, on January 27th30th. This annual tech conference promises to be ‘all things R’ and includes an impressive agenda covering tutorials, presentations and lighting talks from recognised R experts and developers.
As long time champions of the R community, and organisers of the LondonR user group, Mango has been invited to host the London version of the rstudio::conf 2020 around the world ‘watch party’ which we are hosting in conjunction with RLadies London. This is the first year that watch parties are being run and will allow the London based R community the opportunity to see livestreams of the keynote sessions and interact with the San Francisco based hosts and guests in real time.
Liz Matthews, Head of Community & Education at Mango Solutions, said “this will be the first event of its kind and promises to be an exciting opportunity for the London R community. The watch party will be the next best thing to actually being in San Francisco and hearing about the latest developments in R.”
Celebrating a 10 year strategic relationship with RStudio, Liz said, “sponsoring the RStudio conference is a pivotal part of our relationship as an RStudio full service certified partner. Mango offers a full service offering for the RStudio suite of products, from the installation of an RStudio environment, best practice operating models, training and supporting the usage of R in production environments”.
Samantha Toet, Partner Marketing Specialist at RStudio Inc, said “I’m thrilled that Mango Solutions will be joining us both in person at the conference and remotely for our watch parties. This is a great way that we can support not only our partnership, but the data science community as a whole.”
Both the RStudio conference and Mango’s annual EARL conference provide the growing R community with an abundance of networking opportunities and learning experiences centred around R.
Having developed the R capability for some of the world’s largest organisations, Mango’s extensive knowledge of the R ecosystem in addition to ancillary technologies such as python and spark, makes us a wellplaced and trusted partner for data science and for scaling an appropriate environment R using RStudio platforms. At the conference, Mango Solutions will be providing details of their new support packages to assist enterprise clients with accelerating in the use of RStudio products.
We look forward to meeting you in San Francisco and welcoming you to our live stream at LondonR in conjunction with RLadies.
The post LondonR & RLadies to host ‘watch party’ at RStudio Conference, Jan 29th appeared first on Mango Solutions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Get Better: early career researcher development
[This article was first published on Rstats – quantixed, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
How can we contribute to the development of early career researchers in a lab environment?
I’m talking about how people in the lab acquire “soft skills” or “get better” in ways that are parallel to doing research. This sort of training can get overlooked in the chase for new results and the excitement of doing biomedical research.
I’m testing out a strategy to develop the skills of people in the lab. It’s an experiment. I’ll share what we’re trying out here in case others want to pursue a similar approach.
My idea was take a place in the lab meeting rota and use this for sessions aimed at developing the skills of the people in the lab. This is a way to force me to schedule something at a set time. The list of things we’ve covered is below. I will update the links asandwhen the posts are ready.
 Creating a research profile [link]
 CV clinic
 Writing a paper
 Learning LaTeX and Overleaf
 Learning to code in R
—
The post title comes from Get Better by The New Fast Automatic Daffodils. The version I have is on a compilation of Martin Hannett produced tracks called “And Here Is The Young Man”
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Exploring the tightened EU CO2 emission standards for cars in 2020 – Why now selling an electric car can be worth an extra 18000€ for producers.
[This article was first published on Economics and R  R posts, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
img {maxwidth: 100%}In this blog post I want to explore the EU regulation for average CO2 emissions of manufacturer’s car fleets using the EU data set of newly registered cars. The data was already studied on an aggregate level in my earlier post. Here we explore, in particular, the monetary implications of the regulatory details for different car manufactures and car types. Commenting is possible at the bottom of the original post.
Let’s first have a look at the data set, which is available on a Github repository:
dat = fst::read_fst("cars2.fst") NROW(dat) ## [1] 260639 # Show 2 example rows dat[c(260566,100789),] ## year pool firm country cn ft q it co2 ## 260566 2018 vw group pc seat SK LEON diesel 5 110.6 ## 100789 2014 bmw group bmw SE I3 petrolelectric 132 13.0 ## mass cm3 power elec wltp reduction_nedc ## 260566 1390.8 1820 105 NaN NaN NaN ## 100789 1390.0 647 NaN 115 NA NAThe data contains one row for each combination of year, country where the car was registered, manufacturing firm, car name cn and fuel type ft. The variable q measures the number of registered such cars and co2 the CO2 emissions in gram per km using the NEDC procedure.
Before 2020 the EU had the target that on average newly registered cars should not emit more than 130 g CO2/km and for 2020 the target has been reduced to 95 g CO2/km with further reduction in subsequent years. Let us compare those targets with the average tested emissions of all newly registered cars:
dat %>% group_by(year) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + # CO2 target before 2020 geom_hline(yintercept = 130, linetype="dashed") + # CO2 target in 2020 geom_hline(yintercept = 95, linetype="dashed")Unless there is a big jump, it does not look like the new target will be reached in 2020. Let us take a look at the fleet averages for the 11 biggest (with respect to the number of new cars sold in Europe) car firms.
library(dplyrExtras) library(forcats) dat = dat %>% mutate_rows(is.na(firm), firm="Other") %>% mutate( firm = fct_reorder(firm, q, sum, .desc=TRUE), firm11 = firm %>% fct_other(.,levels(.)[1:11]), ) dat %>% group_by(year, firm11) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + facet_wrap(~firm11) + geom_hline(yintercept = 130, linetype="dashed") + geom_hline(yintercept = 95, linetype="dashed")We see quite a bit of diversity. Toyota has the lowest fleet averages and may well manage to reach the new 2020 target. Also the french car producers Renault, Peugeot and Citroen have low average emissions even though they increased their emissions from 2016 to 2018. The highest emissions have the 3 German premium car manufacturers BMW, Daimler and Audi.
If its fleet of newly registered cars produces too much CO2 on average the manufacturer has to pay fine to the EU. From 2019 onward the total fine is given by
Excess emissions * 95 €/(g CO2/km) * number of new passenger cars.Let us for the moment ignore further adjustments specified in the regulation. Then the excess emissions would simply be the average emissions per car in the fleet minus the target (which is 95g CO2/km in 2020). If all cars would be produced by a single firm and the fleet stayed the same as in 2018 (with average CO2 emissions of 121 g CO2/km) the fine per car under the new emission target in 2020 would be given by:
(12195) * 95€ = 26 * 95€ = 2470€Quite a substantial amount. If the fleets are differentiated by firm, the fine would range from (10295)*95€ = 665€ for an average Toyota to (13395)*95€ = 3610€ for an average Daimler.
However, there are a series of further adjustments that affect the emission targets and fines. We will explore these adjustments step by step in this post. All details are taken from the official EU regulation here.
Manufacturer PoolsThe EU allows groups of manufacturers to form pools such that every newly registered car of a pool will be part of a single fleet. If we have one manufacturer whose fleet is below the target and another one whose fleet is above, forming a joint pool will reduce the sum of the fines that have to be paid.
Firms belonging to the same mother company naturally belong to the same pool (e.g. Seat, Skoda, Audi, Volkswagen all belong to a common pool). But also unrelated firms can form a joint pool, e.g. Toyota and Mazda form a common pool in 2018. The rationale for allowing pools of unrelated firms is that otherwise the regulation could create incentives for mergers (e.g. Tesla would be an extremely attractive target) that increase market concentration and may harm consumers.
The following plot shows the development of average CO2 emissions of newly registered cars for the 8 biggest pools:
library(dplyrExtras) library(forcats) dat = dat %>% mutate_rows(is.na(pool), pool="Other") %>% mutate( pool = fct_reorder(pool, q, sum, .desc=TRUE), pool8 = pool %>% fct_other(.,levels(.)[1:8]), ) dat %>% group_by(year, pool8) %>% summarize(co2 = weighted.mean(co2, q, na.rm=TRUE)) %>% ggplot(aes(x=year,y=co2)) + geom_line() + geom_point() + ylab("g CO2/km") + facet_wrap(~pool8) + # CO2 target before 2020 geom_hline(yintercept = 130, linetype="dashed") + # CO2 target in 2020 geom_hline(yintercept = 95, linetype="dashed")Qualitatively, the picture looks similar to the one of individual firms. The Japanese ToyotaMazda pool leads, followed by the pools lead by French companies (psaopel and renault), while the pools lead by German or Italian companies (vw, ford, bmw, fca, daimler) perform worse.
Even though climate protection plays an important role in EU policy, it would still seem quite surprising if the EU passed a regulation that imposes on average substantially higher fines for European car companies than Japanese ones without making some adjustments that make this gap smaller.
Weight AdjustmentOne implemented adjustment is that heavier cars have a more generous emission target. In 2019 and 2020 the formula for a pool’s fleet target is given by
target_pool = 95 + a*(mass_poolM0)
where mass is the average mass (in kg) of the cars in the pool and M0 is the average mass of all cars sold in previous years. The factor a is set from 2019 onward to 0.0333 and M0 is 1379.88 kg in 2019. This means every additional kg mass of a pool’s average car reduces the CO2 target by 0.0333 g CO2/km.
The following code shows the average fleet emission together with the adjusted targets of the largest pools. The parameters for a and M0 changed over years and the corresponding parameters are in regulation.csv which can also be found on my Github repository:
reg.df = read.csv("regulation.csv",) # By pool pool.dat = dat %>% group_by(pool8, year) %>% summarize( co2_pool = weighted.mean(co2, q,na.rm = TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE) ) %>% left_join(reg.df, by="year") %>% mutate( target_pool = target + a*(mass_poolM0), target2020_pool = 95 + 0.033*(mass_pool1379.88) ) ggplot(pool.dat,aes(x=year,y=co2_pool)) + facet_wrap(~pool8) + # The unadjusted target for 2020 of 95 g geom_hline(yintercept = 95, linetype="dashed") + # Target of the corresponding year geom_line(aes(x=year, y=target_pool), color="#5555aa") + # Target using formula for 2020 geom_line(aes(x=year, y=target2020_pool),color="#5555aa") + geom_line() + geom_point()We see how the weight adjustment mainly benefits the BMW and Daimler pools. The following plot shows the average fee per car with and without weight adjustments (still ignoring any further adjustments):
df = pool.dat %>% ungroup() %>% filter(year==2018) %>% mutate( non_adjusted = (co2_pool  95)*95, weight_adjusted = (co2_pooltarget2020_pool)*95, pool8 = fct_reorder(pool8, weight_adjusted), benefit = weight_adjusted < non_adjusted ) %>% tidyr::pivot_longer( cols = c("non_adjusted","weight_adjusted"), names_to = "fee_type",values_to = "fee" ) ggplot(df,aes(pool8, fee, color=fee_type)) + geom_line(aes(group=pool8), color="black") + geom_point(size=2) + coord_flip()Without weight adjustment Daimler would pay 2020 a fine of (13395)*95€ = 3610€ per car (given the 2018 fleet) while with weight adjustment the fine reduces to (133102)*95€ = 2945€.
I am not privy to the internal bargaining that precedes EU law making, but it is not hard to imagine how weight adjustment was a compromise in order to make Germany agree to a regulation with tight fleet standards on CO2 emission. The argument was probably along the line that large premium cars are quite different products than small cars and that there should be a way to account for that. In particular, since distinguishing products classes is often done. E.g. the EU also has different targets for Vans (light commercial vehicles) that have an average target of 147 g CO2/km in 2020.
One can have different views on the weight adjustment compromise, but it does not seem surprising at all to me given the importance of the car industry for many parts of Germany. (Beside such compromises and the factors leading to them, there are several other reasons that make me personally doubt whether very tight fleet standards are a good instrument for emission reduction compared to the alternative of higher CO2 prices for fuel. I will discuss some pro and cons later in this post.)
One obvious drawback of weight adjustment is that it reduces incentives to build lighter cars, which is an important channel to make cars more CO2 efficient. Let us take a look how in 2018 the CO2 emissions of a car relates to its mass (controlling for fuel type, engine power and year fixed effects):
library(lfe) library(stargazer) reg1 = felm(co2~mass+powerft+year, data=dat) stargazer(reg1, type = "html") Dependent variable: co2 mass 0.057*** (0.0002) power 0.274*** (0.001) Observations 189,192 R2 0.762 Adjusted R2 0.762 Residual Std. Error 26.993 (df = 189171) Note: *p<0.1; **p<0.05; ***p<0.01If we are willing to interpret the estimated coefficient in a causal fashion, it suggests than 1 kg extra mass increases emissions on average by 0.0574 g CO2/km. This means the weight adjustment factor of a=0.0333 does not fully cancel incentives to reduce CO2 emissions by building cars with lower mass but cancels more than half the effect.
Distribution of excess emissions over car modelsThe following plot provides more insights how the excess emissions above each pool’s targets are distributed over the different sold car types in 2018
dat2018 = dat %>% filter(year == 2018, ft != "unknown") %>% mutate( model = paste0(firm, " ", cn), fuel = fct_collapse(ft, petrol = "petrol", diesel = "diesel", plugin_hybrid = c("petrolelectric","dieselelectric"), electric = "electric", other = c("lpg","hydrogen","e85","ng","ngbiomethane") ) %>% fct_reorder(q, sum, na.rm=TRUE) ) %>% group_by(pool8) %>% mutate( q_pool = sum(q, na.rm=TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE), co2_pool = weighted.mean(co2, q, na.rm=TRUE), target_pool = first(95 + 0.0333*(mass_pool1379.88)), excess_co2 = co2 target_pool, fee = (co2target_pool  0.0333*(massmass_pool))*95, share.in.pool = q / q_pool ) %>% ungroup() pool.df = dat2018 %>% group_by(pool8) %>% summarize(excess_pool = first(co2_pool)first(target_pool)) ggplot(dat2018 %>% filter(co2 <= 300), aes(y=excess_co2,x=fct_rev(pool8))) + geom_jitter(aes(color=fuel, size=q),alpha=0.15) + geom_hline(aes(yintercept=0)) + geom_violin(aes(weight=share.in.pool), alpha=0.7) + geom_point(data=pool.df, aes(y=excess_pool), size=2) + coord_flip() + theme_bw()Except for ToyotaMazda no pool has a substantial share of petrol or diesel cars whose emissions are below the 2020 target of the pool. Volkswagen has a few cars (classified as other) driving with natural gas a little bit below the target.
Far below the target are the electric cars of all firms. That is because the EU regulation only counts emissions that are directly emitted by the car but ignores emissions of the power plants that generate the electricity used by the car. Thus electric cars (as well as the very few hydrogen cars) count as zero emission cars. For the same reason, pluginhybrids that can drive with an electric engine or petrol also are solidly below the emission targets.
Implicit fee for each car and huge implicit subsidies for electric carsAs far as we have specified the regulation for 2020, selling an additional car with emissions of co2 and a given mass changes the total fees of a pool whose emissions are above target by
(co2target_pool  a*(massmass_pool)) * 95 EuroThe following plot shows the averages of these fees over fuel types for the different pools. (We still ignore an important additional feature of the regulation: super credits.)
dat2018 = dat2018 %>% group_by(pool8) %>% mutate( fee = (co2target_pool  0.0333*(massmass_pool))*95 ) %>% ungroup() sum.dat = dat2018 %>% group_by(pool8, fuel) %>% summarize( fee = weighted.mean(fee, q, na.rm=TRUE), q = sum(q, na.rm=TRUE), share.fuel = q / first(q_pool), label = paste0(round(fee), "€ ", round(100*share.fuel,1),"%") ) %>% filter(fuel != "other") ggplot(sum.dat, aes(y=fee,x=fuel, fill=fuel, label=label)) + geom_col() + geom_text(size=4) + facet_wrap(~pool8) + coord_flip() + ylim(c(13000,7000))According to this calculation, selling an additional electric car would on average be worth between 8000€ and 10000€ for each manufacturer in 2020. These gains in terms of fee reduction from selling an electric car are huge for all pools.
(The differences in gains from selling an electric car between pools are due to different masses of electric cars and other cars in the pools’ fleets. For example, we find the lowest gain from selling an electric car for Daimler because most of its electric cars are the light weight Smarts whose mass is substantially below the average mass of a Daimler.)
Super CreditsArticle 5a of the EU regulation makes the gains from selling electric cars even larger:
Article 5a Supercredits for 95 g CO2/km target In calculating the average specific emissions of CO2, each new passenger car with specific emissions of CO2 of less than 50 g CO2/km shall be counted as: — 2 passenger cars in 2020, — 1,67 passenger cars in 2021, — 1,33 passenger cars in 2022, — 1 passenger car from 2023, for the year in which it is registered in the period from 2020 to 2022, subject to a cap of 7,5 g CO2/km over that period for each manufacturer.I will discuss the cap of 7,5g CO2/km in the last sentence of the regulation at the end of this blog in the Appendix. Assume for the moment, it is not binding. Then a car with emissions below 50g CO2/km saves a pool in 2020 via super credits additional:
(target_poolco2) * 95 EuroLet us look at the fee change of selling a particular car type in 2020 assuming a factor 2 super credit (and nonbinding cap):
dat2018 = dat2018 %>% group_by(pool8) %>% mutate( fee_sc = fee  ifelse(co2<50, target_poolco2,0)*95 ) %>% ungroup() sum.dat = dat2018 %>% group_by(pool8, fuel) %>% summarize( fee_sc = weighted.mean(fee_sc, q, na.rm=TRUE), q = sum(q, na.rm=TRUE), share.fuel = q / first(q_pool), label = paste0(round(fee_sc), "€ ", round(100*share.fuel,1),"%") ) %>% filter(fuel != "other") ggplot(sum.dat, aes(y=fee_sc,x=fuel, fill=fuel, label=label)) + geom_col() + geom_text(size=4) + facet_wrap(~pool8)+ coord_flip() + ylim(c(22000,7000))These are gigantic implicit subsidies for electric cars in 2020!
To put this in relation, we can compute that, with full super credits, an VW EGolf would reduce Volkswagen’s fees by 18906€:
library(stringtools) dat2018 %>% filter(firm=="volkswagen", ft=="electric") %>% select(cn,fuel,fee, fee_sc,q) %>% head(1) ## # A tibble: 1 x 5 ## cn fuel fee fee_sc q ## ## 1 GOLF electric 9769. 18906. 1836A quick Google search finds historical list price of the EGolf of 35900€ in 2018 and 31000€ in 2019. In both years the old CO2 target of 130 grams was still in place which VW overachieved with a secure margin. This means VW did not pay CO2 fees for its fleet in 2018 and 2019. Thus selling an additional EGolf did not reduce any CO2 fleet fees and thus should not have affected those historical prices.
If VW does not reach the new target in 2020 (which seems likely), it now has a tremendously higher incentives to sell electric cars than in the previous years: the fee reduction of 18906€ is more than half of the original price of an EGolf and even though super credits phase out in later years, the fee reduction of 9769€ without super credits is still almost a third of the 2019 price.
Given these numbers it does not seem surprising that Volkswagen made massive investments into electric cars, even though some observers consider it a risky gamble (see e.g. here or here)
Thoughts on pricingAn interesting question is by how much electric car prices will fall in 2020. Probably by considerably less than producer’s marginal benefit from fee reduction. For several reasons:

Basic micro economics suggests that firms with some market power will typically not fully pass on marginal cost changes to consumers.

Limited production capacity: it makes no sense to increase demand by price reductions to an extend that cannot be produced.

Consumers who bought electric cars in 2019 may feel cheated if car prices drop too drastically this year.

The marginal fee reduction effect of an electric car gets smaller in future years as super credits phase out and the emission target further tightens. It is thus risky to induce in consumers expectations of very low electric car prices.

The fee reduction effect only occurs for a car registered in the EU. But setting substantially lower prices in the EU than in other markets can be risky, e.g. because of arbitrage opportunities.
Points 35 suggest that car companies who are not yet capacity constrained may make attractive leasing offers of electric cars to large business clients, or to employees. If the manufacturers own car sharing or car rental companies, it is probably also quite attractive to increase the number of electric cars in those fleets in 2020.
I wonder if e.g. for a firm like ShareNow that has fleets in the EU and North America it could now become attractive to swap the US and EU electric cars after one year to get more new BMW or Daimler electric car registrations in the EU. It would of course not be desirable from a climate point of view to ship cars over the ocean for regulatory arbitrage. The monetary incentives might be there, but probably the net gains are too small compared to the risk of image damage.
With whom does Tesla go swimming?Tesla was not part of any pool in 2018. Yet, if it joins a pool who misses the 2020 target and Tesla sells like in 2018 roughly 20000 cars (all electric), the other pool members would even without super credits save roughly between 8000*20000=160 Mio € and 10000*20000=200 Mio €, which is more than Tesla’s 2018 profits (but less than 1% of its revenues). With super credits, savings for the pool could increase up to twice as much.
That is a nice time for a quiz:
Loading…
You can find the answer, for example, here.
If you think about it, it makes a lot of sense that Tesla joint a pool with FCA (Fiat and Chrysler). First, FCA had the highest excess CO2 emissions in 2018, i.e. FCA seems very likely to miss the 2020 and later targets. Second, FCA has not yet built electric cars for themselves, i.e. it is not a fierce competitor of Tesla.
It is interesting how much in total Tesla will benefit from the tighter new EU regulations. On the one hand it is nice to get money from FCA and the higher attractivity of electric cars may generally be good for Tesla. On the other hand, Tesla most likely will face much fiercer competition given that all pools have strong incentives to sell more electric cars and may thus strongly reduce prices. It probably depends a lot on the maximum production capacities for electric cars of Tesla’s competitors.
What are the actual CO2 emissions of electric cars?Of course, assuming in the regulation that electric cars have zero CO2 emissions does not fit reality since much electricity is still generated by burning coal and gas. The column elec in our data set contains for electric cars and plugin hybrids the estimated electricity usage in Wh per driven km.
# Average Wh/km over all electric cars dat %>% filter(ft == "electric") %>% summarize(Wh = weighted.mean(elec,q, na.rm=TRUE)) ## Wh ## 1 156.4098 # Average Wh/km over years dat %>% filter(ft == "electric") %>% group_by(year) %>% summarize(Wh = weighted.mean(elec,q, na.rm=TRUE)) ## # A tibble: 7 x 2 ## year Wh ## ## 1 2012 148. ## 2 2013 153. ## 3 2014 152. ## 4 2015 160. ## 5 2016 156. ## 6 2017 147. ## 7 2018 163.There is no clear time trend, so we just use the overall average of 156 Wh/km for our further calculations. We now use a data set on yearly CO2 emission intensities from electricity generation of different EU countries (until 2016) provided by the EU here to estimate the corresponding g CO2 emissions of driving an electric car for 1 km.
co2_int = read.csv("co2intensityelectricitygeneration.csv") co2_int = co2_int %>% filter(year >= 2010) %>% mutate( co2_Wh = g_co2_kWH / 1000, co2_el_car = 156*co2_Wh, country = forcats::fct_reorder(country, co2_Wh,mean) ) co2_int %>% filter(country=="EU") ## year country type g_co2_kWH co2_Wh co2_el_car ## 1 2010 EU Electricity generation 346.9 0.3469 54.1164 ## 2 2011 EU Electricity generation 352.7 0.3527 55.0212 ## 3 2012 EU Electricity generation 353.2 0.3532 55.0992 ## 4 2013 EU Electricity generation 333.7 0.3337 52.0572 ## 5 2014 EU Electricity generation 320.4 0.3204 49.9824 ## 6 2015 EU Electricity generation 314.4 0.3144 49.0464 ## 7 2016 EU Electricity generation 295.8 0.2958 46.1448 # Show averages for EU ggplot(co2_int %>% filter(country=="EU"), aes(x=year, y=co2_el_car)) + geom_line() + geom_point() + geom_hline(yintercept = 130, linetype="dashed")+ geom_hline(yintercept = 95, linetype="dashed")We see that the average newly registered electric car has emissions around 46 g CO2/km given the average CO2 intensity of EU electricity generation in 2016. While this is far above 0, it is only about 37% of the average CO2 emission of petrol and diesel cars between 2012 and 2018 (123 g CO2/km) and about half the new 2020 target of 95g CO2/km.
Compare Golf and EGolfOf course, the average electric car may differ in size and many other dimensions from the average petrol or diesel car in our sample. For a better comparison let us compare Volkswagen Golf from 2018 with different fuel types:
df = dat2018 %>% filter(firm=="volkswagen", has.substr(cn,"GOLF")) %>% group_by(fuel) %>% summarize( co2_el = weighted.mean(elec*0.2958, q, na.rm=TRUE), co2 = weighted.mean(co2, q, na.rm=TRUE), mass = weighted.mean(mass, q, na.rm=TRUE), q = sum(q, na.rm=TRUE) ) df ## # A tibble: 5 x 5 ## fuel co2_el co2 mass q ## ## 1 electric 42.3 0 1615. 13886 ## 2 plugin_hybrid 35.7 38.9 1615. 7876 ## 3 other NaN 118. 1327. 12065 ## 4 diesel NaN 110. 1378. 140701 ## 5 petrol NaN 120. 1323. 309126The results are similar to the aggregate averages. The CO2 emissions of an average 2018 EGolf with average 2016 EU electricity mix are 38% of a diesel and 35% of a petrol Golf. (The data also illustrates one problem of electric cars: they are much heavier than diesel or petrol cars. Furthermore, the average EGolf has a quite short range. Longer range means more batteries and thus an even heavier car, which needs more electricity to move.)
The following plot shows how counting electricity as zero emissions and super credits (at the 2020 factor of 2 and no binding cap) change the CO2 emissions of the 2018 fleets of the 8 largest pools compared to using CO2 emission of the average EU electricity mix without super credits. We now already add Tesla to the FCA pool, since they form a joint pool in 2020.
dat2018 = dat2018 %>% mutate_rows(firm == "tesla", pool8 = "fcatesla") %>% mutate_rows(startsWith(as.character(pool), "fca"), pool8 = "fcatesla") df = dat2018 %>% filter(pool8 != "Other") %>% mutate( co2_el = replace_na(elec*0.2958,0), co2_fuel = replace_na(co2, 0), co2_tot = co2_el + co2_fuel, q_sc = ifelse(co2_fuel < 50, 2*q, q) ) %>% group_by(pool8) %>% summarize( zero_co2 = weighted.mean(co2_fuel, q, na.rm=TRUE), zero_co2_and_sc = weighted.mean(co2_fuel, q_sc, na.rm=TRUE), eu_co2 = weighted.mean(co2_tot, q, na.rm=TRUE) ) %>% mutate( pool8 = fct_reorder(pool8, eu_co2) ) %>% tidyr::pivot_longer( cols = c("zero_co2","zero_co2_and_sc", "eu_co2"), names_to = "electricity", values_to = "co2" ) ggplot(df,aes(pool8, co2, color=electricity)) + geom_line(aes(group=pool8), color="black") + geom_point(size=3) + coord_flip()If the pools kept their 2018 fleet in 2020, BMW would benefit most from counting electric cars as zero emission cars, followed by Renault, FCATesla and Daimler.
CO2 intensity of electric cars between countriesThe CO2 intensity of electricity production differs quite substantially between countries. The following plot shows the CO2 emissions in g/km of the average sold EU electric car for the different EU countries:
ggplot(co2_int, aes(x=year, y=co2_el_car)) + geom_line() + geom_point() + facet_wrap(~country) + geom_hline(yintercept = 130, linetype="dashed") + geom_hline(yintercept = 95, linetype="dashed")There is substantial heterogeneity between countries. While an electric car loaded with the average electricity mix in Sweden, France or Austria would have very low emissions, in Poland it would have much higher emissions than the average petrol or diesel car.
Given the EU’s focus on a unified market, it is understandable that the regulation does not differentiate where in the EU a new car is registered. However, it is still hard to see any environmental benefit of electric cars driven in Poland that would justify the enormous implicit subsidies that the current fleet regulation implies.
One also should keep in mind that while electricity production in France and Sweden emits very little carbon, it creates nuclear waste. One can be of different opinion about the relative dangers of climate change and nuclear power plants, though.
Discussion: Fleet CO2 targets vs pricing actual CO2 emissionsAre tight CO2 fleet standards a good instrument to reduce CO2 emissions?
Basic economic analysis rather points to a uniform price of every tonne actual emitted CO2 (and CO2 equivalents of other GHG) as an optimal instrument to fight climate change in a cost efficient way. This price could be imposed by a tax or via a comprehensive CO2 trading scheme that would augment the current EU certificate trading by certificates on fossil fuel consumption for transport and household heating.
Compared to such higher CO2 prices, fleet targets have several disadvantages (see also here or here for more detailed discussions):

CO2 Fleet targets don’t take into account how many km a car is driven. It provides thus no incentives to drive less. In contrast, higher CO2 prices on fuels provide incentives for both buying more fuel efficient cars and for driving less.

Fleet targets only influence the price of newly sold cars. If new car prices increase substantially by tighter standards, incentives to buy new cars go down. The average car fleet gets older, which reduces fuel efficiency. In contrast, higher CO2 prices on fuel instead provide incentives to buy new, more fuel efficient cars.

We have seen how the CO2 emissions of electric cars depend strongly on where they are driven. This is not accounted for in the EU fleet standards. In contrast, with a common CO2 price for car and power plant fuels, it would be automatically be relatively more attractive to drive electric cars in countries that produce electricity with little CO2 emissions than in countries that mainly have coal power plants.

The actual fleet target regulation has compromises like weight adjustment that reduce the incentives to build cars more CO2 efficient.
There are also some possible advantages of fleet targets compared to direct CO2 pricing.

Consumers could be myopic and in their buying decision not account sufficiently the fuel cost savings from more efficient cars. They may react stronger to a direct increase of a car price induced by tighter fuel standards. But e.g. Busse et. al. 2013 find little evidence for such myopia.

Fleet standards that are not too tight and can be satisfied if manufacturers make reasonable effort to reduce emissions can provide incentives for car manufacturers to make cars more efficient without actually imposing a tax. This leads to a lower tax burden for car users than higher CO2 prices. This may make fleet standards politically easier to implement and may be considered fairer (in particular by car users).

Even tight fleet standards, which actually imply taxes on new cars, may be politically easier to implement than comprehensive, higher CO2 prices. First, fleet standards are not called a tax, which probably gives less scope for political attacks. Second, one could possibly argue that poorer people who cannot afford new or big cars will not be hurt as much (of course weight adjustment partly countervails this argument). On the other hand, one also could support poorer households by redistributing earnings from higher CO2 prices. Third, we actually see that quite tight fleet standards have been implemented in the EU while there is still no comprehensive EUwide CO2 certificate trading in sight that also encompasses emissions from car fuels and heating systems. This suggests that it indeed was politically easier to implement the tight fleet regulation, even though it seems less sensible than higher CO2 prices from an economic point of view.
There are some additional rules in the EU regulation, which I want to describe in the last 3 sections of this (quite long) blog post.
In 2020 (but not anymore from 2021 onward) the EU fleet regulations allows pools to ignore the 5% cars with the highest CO2 emission when computing their fleet average. The following plot shows how this exemption reduces the average fees of the 8 largest pools. We also add the further reductions from the super credits for comparison.
dat2018 = dat2018 %>% group_by(pool8) %>% arrange(co2) %>% mutate( Q_pool = sum(q), q_cumsum = cumsum(q), perc_start = (q_cumsumq) / Q_pool, perc_end = q_cumsum / Q_pool, q_95 = ifelse(perc_end <= 0.95,q,0) ) %>% mutate_rows(perc_start < 0.95 & perc_end > 0.95, q_95 = q * (0.95  perc_start)/(perc_end  perc_start) ) %>% ungroup() df = dat2018 %>% filter(pool8 != "Other") %>% mutate( q_95_sc = ifelse(co2 < 50, 2*q_95, q_95) ) %>% group_by(pool8) %>% summarize( co2_95 = weighted.mean(co2, q_95, na.rm=TRUE), co2_95_sc = weighted.mean(co2, q_95_sc, na.rm=TRUE), co2_100 = weighted.mean(co2, q, na.rm=TRUE), excess_95 = co2_95first(target_pool), excess_95_sc = co2_95_scfirst(target_pool), excess_100 = co2_100first(target_pool) ) %>% mutate( pool8 = fct_reorder(pool8, co2_100) ) %>% tidyr::pivot_longer( cols = c("excess_95","excess_95_sc", "excess_100"), names_to = "scheme", values_to = "excess" ) %>% mutate(fee = excess*95) ggplot(df,aes(pool8, fee, color=scheme)) + geom_line(aes(group=pool8), color="black") + geom_point(size=3) + coord_flip()While the omission of the 5% worst cars helps to reduce fees, still all car companies would pay (substantial) positive fees with their 2018 fleet in 2020.
Small manufacturersThe following code shows the 5 cars in our data set from 2018 with the highest CO2 emissions:
# 5 entries with highest fees dat2018 %>% arrange(co2) %>% slice(1:5) %>% select(pool, firm,country, cn, mass, co2, excess_co2, fee) ## # A tibble: 5 x 8 ## pool firm country cn mass co2 excess_co2 fee ## ## 1 Other Other IT PHANTOM VIII 2635 548 451. 39064. ## 2 vw group pc bugatti AT BUGATTI CHIRON 2068 516 420. 37818. ## 3 vw group pc bugatti BG CHIRON 2070 516 420. 37812. ## 4 vw group pc bugatti DE "" 2070 516 420. 37812. ## 5 vw group pc bugatti DE BUGATTI CHIRON 2070 516 420. 37812.The Phantom VIII on top is actually produced by RollsRoyce and if it would not fall under the largest 5% or otherwise exempted would increase the CO2 fee by 39064 Euro in 2020. The Bugattis on rank 25 would correspond to similar high fees. Will these fees materialize from 2021 onward when the 5% rule expires? That is not clear.
Article 11 of the EU regulation allows manufacturers that sell less than 10000 new cars per year in the EU to apply for a derogation. This essentially means that they are exempt from the actual regulation but have to meet individual reduction targets based on historic CO2 efficiency.
Also RollsRoyce and Bugatti sell far less than 10000 cars a year in the EU. However, they are subsidiaries of BMW and VW, respectively, and the EU regulation allows for such small subsidiaries a derogation only if they operate their own production facilities and design centers. I could not find data for which manufacturers a derogation was granted and at which conditions.
Let us look at a list of derogation candidates with high CO2 emission (above the pre 2020 target of 130 g CO2/km):
dat = dat %>% group_by(firm, year) %>% mutate( firm.q = sum(q, na.rm=TRUE), firm.co2 = weighted.mean(co2,q, na.rm=TRUE), dero.cand = firm.q < 10000 & firm.co2 > 130  tolower(firm) == "other" ) dero.cand = dat %>% group_by(pool,firm, year) %>% filter(dero.cand) %>% summarize( co2 = weighted.mean(co2, q, na.rm=TRUE), q = sum(q, na.rm=TRUE) ) %>% group_by(pool,firm) %>% summarize( co2 = weighted.mean(co2, q, na.rm=TRUE), q = round(mean(q, na.rm=TRUE)), ) %>% arrange(q) head(dero.cand, 10) ## # A tibble: 10 x 4 ## # Groups: pool [8] ## pool firm co2 q ## ## 1 ssangyong ssangyong 182. 5932 ## 2 Other Other 148. 5161 ## 3 fca italy spa maserati 211. 5098 ## 4 vw group pc quattro 227. 5069 ## 5 subaru subaru 166. 3501 ## 6 fca italy spa jeep 146. 3309 ## 7 mg mg 137. 3124 ## 8 vw group pc bentley 291. 2587 ## 9 ferrari ferrari 299. 2413 ## 10 renault vaz 193. 2023 # Mean number of maximal derogated CARS per year sum(dero.cand$q) ## [1] 53671 # Percentage of average cars per year round(100*sum(dero.cand$q) / (sum(dat$q, na.rm=TRUE)/7),2) ## [1] 0.39So on average no more than 0.39 percent of newly registered cars should fall under a derogation. The following plot compares the average CO2 emissions if we remove all derogation candidates from the largest pools to the case that we don’t.
pool.dero.dat = dat %>% filter(!dero.cand) %>% group_by(pool8, year) %>% summarize( co2_pool = weighted.mean(co2, q,na.rm = TRUE), mass_pool = weighted.mean(mass, q, na.rm=TRUE) ) %>% left_join(reg.df, by="year") %>% mutate( target_pool = target + a*(mass_poolM0), target2020_pool = 95 + 0.033*(mass_pool1379.88) ) ggplot(pool.dat,aes(x=year,y=co2_pool)) + facet_wrap(~pool8) + geom_line() + geom_point() + geom_line(data=pool.dero.dat, color="blue")It makes almost no visual difference (except a bit for FCA) whether these small subsidiaries that are derogation candidates are removed from the pools or not. Even removing these cars from the category Other does only matter marginally.
Eco innovationsCar producers may make some Eco innovations that reduce CO2 emissions in a fashion that is not measured by the standardized test cycles. The following code shows for the 8 largest pool the CO2 reductions in g/km through such Eco innovations:
dat2018 %>% mutate_rows(!is.finite(reduction_nedc), reduction_nedc=0) %>% group_by(pool8) %>% summarize(eco_innovation = weighted.mean(reduction_nedc,q)) ## # A tibble: 9 x 2 ## pool8 eco_innovation ## ## 1 vw group pc 0.00922 ## 2 psaopel 0.00171 ## 3 renault 0.00000788 ## 4 fordwerke gmbh 0.00396 ## 5 bmw group 0.356 ## 6 daimler ag 0.493 ## 7 toyotamazda 0.0134 ## 8 Other 0.00683 ## 9 fcatesla 0.00118Even for the pool with largest Eco innovations (Daimler) the CO2 reductions are only a little bit less than 0.5 g CO2/km for the average and thus so far almost negligible.
Appendix: Who understands the details of the cap on super credits?Article 5a on super credits states that there is a cap of max. 7.5 g CO2/km by super credits for the total period from 20202022. I am not sure how that is exactly implemented. If Article 5a refereed to pools, it probably would mean that over the three years together a pool’s average CO2 emissions with super credits can at most be 7.5 g CO2/km lower than without super credits. But what if the pool composition changes during those 3 years? Also Article 5a refers to a manufacturer not a pool and Article 7 about pools only says that pools are relevant for the obligations in Article 4 but says nothing about Article 5a.
Assume Article 5a refers only to manufacturers, not to pools. Let us take a look at the reduction of the average CO2 emissions due to super credits on a firm level in 2020 using the 2018 fleet:
dat2018 %>% mutate( q_sc = ifelse(co2<50, q*2, q) ) %>% group_by(firm) %>% summarize( co2_firm = weighted.mean(co2, q, na.rm=TRUE), co2_sc_firm = weighted.mean(co2, q_sc, na.rm=TRUE), reductions_sc = co2_firm  co2_sc_firm ) %>% filter(firm %in% c("tesla", levels(dat2018$firm11))) %>% arrange(reductions_sc) ## # A tibble: 13 x 4 ## firm co2_firm co2_sc_firm reductions_sc ## ## 1 bmw 128. 124. 3.74 ## 2 renault 111. 108. 3.05 ## 3 daimler 134. 132. 2.05 ## 4 volkswagen 120. 118. 1.93 ## 5 Other 146. 145. 1.51 ## 6 audi 129. 128. 0.769 ## 7 citroen 108. 108. 0.286 ## 8 toyota 103. 102. 0.280 ## 9 peugeot 108. 108. 0.205 ## 10 opel 125. 125. 0.177 ## 11 ford 123. 123. 0.00226 ## 12 fiat 120. 120. 0 ## 13 tesla 0 0 0Already for the 2018 fleet the super credit reductions for BMW would be 3.74 g CO2/km in 2020. Given that there are strong incentives to sell more electric cars probably the cap will be binding for BMW and possible some other manufacturers.
But I am still not sure how the cap will be computed. What happens if Tesla joins some pool in 2020 and some other in 2021? For Tesla itself super credits don’t reduce the CO2 emissions as they are always 0. Yet inside a pool, Tesla’s super credits may reduce CO2 emissions substantially. How will the cap be computed? Maybe it is clear for lawyers (and probably Tesla employs clever people who have figured it out) but as an economist, I really think that some example calculation would make this regulation easier to understand.
#ghcommentslist { borderbottom: 1px solid #cccccc; marginbottom: 5px; }.ghcommentheader { margintop: 3px; paddingtop: 4px; backgroundcolor: #f1f8ff; borderbottomcolor: #c0d3eb; paddingright: 16px; paddingleft: 16px; bordertopleftradius: 3px; bordertoprightradius: 3px; borderbottom: 1px solid #d1d5da; lineheight: 24px; } .ghcommentheader a { marginleft: 3px; marginright: 3px; }
.ghcommentbody { paddingtop: 2px; paddingright: 16px; paddingleft: 16px; } COMMENTS
var url = "https://github.com/skranz/skranz.github.com/issues/" + 2 var api_url = "https://api.github.com/repos/skranz/skranz.github.com/issues/" + 2 + "/comments"
$(document).ready(function () { $.ajax(api_url, { headers: {Accept: "application/vnd.github.v3.html+json"}, dataType: "json", success: function(comments) { //$("#ghcommentslist").append("Visit the Github Issue to comment on this post"); $.each(comments, function(i, comment) {
var date = new Date(comment.created_at);
var t = "
"; t += " " t += ""; t += "" + comment.user.login + ""; t += " posted at "; t += "" + date.toUTCString() + ""; t += "" t += "
"; t += comment.body_html; t += "" t += "
"; $("#ghcommentslist").append(t); }); }, error: function() { $("#ghcommentslist").append("Comments are not open for this post yet."); } }); });
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Economics and R  R posts. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Business Case Analysis with R (Guest Post)
[This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Learning Machines proudly presents a fascinating guest post by decision and risk analyst Robert D. Brown III with a great application of R in the business and especially startuparena! I encourage you to visit his blog too: Thales’ Press. Have fun!
While Excel remains the tool of choice among analysts for business case analysis, presumably because of its low barrier to entry and ease of use, it nonetheless continues to be the source of many vexing and disastrous errors. As the research from Panko and others have revealed, the source of these errors arise in part for the following reasons:
 Analysts continue in a persistent lack of QA standards and practices to ensure that spreadsheets are as error free as possible.
 Cell referencing and replicating formulas across relevant dimensions propagate errors at exponential rates.
 Effective auditing gets obscured because formulas’ conceptual meanings are difficult to interpret from formulas constructed from grid coordinates.
To suggest a means to remedy that situation, in the spring of 2018, I published Business Case Analysis with R: Simulation Tutorials to Support Complex Business Decisions (BCAwR). I also wanted to showcase how the popular programming language R could be employed to evaluate business case opportunities that are fraught with uncertainty nor supported by an abundance of well structured, pedigreed, and relevant data, a situation that plagues just about all business decisions faced in the real world. BCAwR takes the reader through the process of analyzing a rather complex commercial investment problem of bringing a chemical processing plant on line. For my post here, I will present a simplified model to give a preview of the contents of the book. I will also provide a little surprise at the end that I hope will contribute to bridging the efforts of the decision and data science communities.
But before we delve into any coding or quantitative activity related to analysis of any business case, we ought to thoroughly frame our problem by narrowing the scope of our inquiry to the appropriate level for the context of the problem at hand, identifying the goals and objectives we want to achieve when we eventually commit to action, partitioning the set of actionable decision we can make to achieve our goals, and creating an inventory of the influences and relationships that connect decisions and uncertainties to the measures of success. Although we will look at a much more simplified business problem here, a little framing will still serve us well.
Imagine that we are considering investing in a project that solves a particular market problem with the ultimate desire to generate a financial return to our organization. We identify three possible strategic pathways to compare for the investment, each with a different level of capital commitment required to address certain nuances of the market problem. Each (1) investment, in turn, will generate (2) an initial post investment net cash flow, (3) a peak cash flow, and varying durations of (4) time to reach the peak from the initial cash flow.
To complicate the matter, for each strategy, we do not know the precise quantitative value of the four parameters that characterize the potential cash flow. For this reason, when we evaluate the three strategies, we will employ a Monte Carlo simulation approach that samples a large number of potential realizations for each parameter from probability distributions. With these ideas in mind, we can now set up the coding to handle the simulation problem.
library(leonRdo) # https://www.incitedecisiontech.com/packages/packages.html library(tidyverse) # (on CRAN) # Set the simulation seed. set.seed(42) # The number of simulation runs per uncertain variable. trials < 1000 # The time frame across which our cash flow simulation will run. # The initial investment will occur in year 0. year < 0:15 # The weighted average cost of capital without any additional risk premium. discount.rate < 0.1 # The distinctive capital allocation strategic decisions. strategy < c("Strategy 1", "Strategy 2", "Strategy 3")R employs simple Monte Carlo techniques, which doesn’t typically yield stable means for result distributions until many thousands of trials are used. However, When developing a model like this, I suggest setting the trials to 50100 to make sure that the logic is running correctly in short responsive time frames. Then, when you’re ready to produce real results, change the values to greater than 10,000. Given that Monte Carlo in R is noisy even for reasonably large samples (~1000) and for different seed settings, I recently developed the leonRdo package (R Packages: leonRdo & inteRest) to provide a median Latin hypercube approach to sampling that produces much more stable means with approximately 1/10th the number of trials regardless of the seed value used.
In this particular showcase discussion, I have included the tidyverse package (on CRAN) to help with data manipulation and graphing. In the original BCAwR, I based all the code on base R so that readers new to R and simulation could focus more on learning the key principles over the idiosyncratic syntax of secondary packages.
The set.seed() function ensures that we will produce the same sample trial set on each run of our model. This will help us debug problems as we develop code. Eventually as others might interact with our code, they will be able to observe the same set of results we work with. Later, we can reset the seed to other values to make sure that our inferences on a given set of trials remain consistent on a different trial set.
Model FunctionsNext, we need to declare some important functions that we will call in our model. The first is a function that provides an analytic solution to the differential equation that relates the proportional absorption of an entity into a fixed environment. This ramp up function takes as parameters the amount of time Tp required to go from, say, Y = 1% absorption to Y = 99% across Time. We will use this function to model the growth of our cash flow over the year index.
# Ramp up function. calcRampUp < function(Y0, Time, Tp) { # Y0 = initial absorption # Time = time index # tp = time to Y = 1  Y0, the time to peak. Y < 1 / (1 + (Y0 / (1  Y0)) ^ (2 * (Time  1) / (Tp  1)  1)) return(Y) }The second function we need is the actual business model we want to represent. The business model logic will remain fixed across strategies, and the results will vary only on the basis of the strategy conditional parameters we supply. By “functionalizing” our model, we can iterate the entire model over variable ranges to understand the sensitivity of the output function to those variables, as we shall later see. Please note, this is a toy model we are using for illustration purposes only. BCAwR demonstrates a much more complex consideration for the implementation of a chemical processing plant.
# Business model cash flow. calcCashFlow < function(I, SCF, PCF, YP, Y) { # I = initial investment # SCF = starting cash flow # PCF = peak cash flow # YP = year of peak cash flow # Y = year index cf < (Y == 0) * (I) + (Y > 0) * (SCF + (PCF  SCF) * calcRampUp(0.01, Y, YP)) return(cf) }The last function we will employ is the net present value (NPV) function that we will apply to the individual trial results of the cash flow simulation across strategies.
# The net present value of the cash flow function. calcNPV < function(CF, Y, DR) { # CF = cash flow vector # Y = year index # DR = discount rate npv < sum(CF / (1 + DR) ^ Y) return(npv) } Initialize Simulation Trial SamplesOur next task is to create trial samples for the business model cash flow function parameters for the three project strategy investments. For this particular example, I have chosen to use canonical distributions as if their parameters were based on empirical or historical data and to utilize a simple method for generating trials. However, typically I would use a combination of both empirical data and subject matter expert guidance reflected as cumulative probabilities across the range of the assessed variables’ potential values. I explain how to take this latter approach in greater detail in Section III of BCAwR.
investment1 < rlnorm_mlhs(n = trials, meanlog = log(800), sdlog = log(1 + 200 / 800)) investment2 < rlnorm_mlhs(n = trials, meanlog = log(700), sdlog = log(1 + 50 / 700)) investment3 < rlnorm_mlhs(n = trials, meanlog = log(1000), sdlog = log(1 + 150 / 1000)) start.cash.flow1 < rnorm_mlhs(n = trials, mean = 100, sd = 20) start.cash.flow2 < rnorm_mlhs(n = trials, mean = 90, sd = 5) start.cash.flow3 < rnorm_mlhs(n = trials, mean = 120, sd = 15) peak.cash.flow1 < rnorm_mlhs(n = trials, mean = 300, sd = 20) peak.cash.flow2 < rnorm_mlhs(n = trials, mean = 280, sd = 5) peak.cash.flow3 < rnorm_mlhs(n = trials, mean = 375, sd = 15) yr.peak1 < rnorm_mlhs(n = trials, mean = 8.5, sd = 0.75) yr.peak2 < rnorm_mlhs(n = trials, mean = 10, sd = 0.85) yr.peak3 < rnorm_mlhs(n = trials, mean = 11, sd = 1) # Store the business model parameter samples in a list of data frames. proj.data < list( investment = data.frame(investment1, investment2, investment3), start.cash.flow = data.frame(start.cash.flow1, start.cash.flow2, start.cash.flow3), peak.cash.flow = data.frame(peak.cash.flow1, peak.cash.flow2, peak.cash.flow3), yr.peak = data.frame(yr.peak1, yr.peak2, yr.peak3) ) Model ResultsFor each strategy, we apply the samples from each parameter to the business model cash flow function. This will result in a list of cash flows for the three project strategies. For each strategy, there will be as many cash flows as defined by trials, and each cash flow trial will be as long as the year index. We can use the lapply() and sapply() functions to avoid using for loops [see also Learning R: A gentle introduction to higherorder functions].
proj.cf < lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcCashFlow( I = proj.data$investment[[s]][t], SCF = proj.data$start.cash.flow[[s]][t], PCF = proj.data$peak.cash.flow[[s]][t], YP = proj.data$yr.peak[[s]][t], Y = year ) }) }) names(proj.cf) < strategyBy running the first five trials of Strategy 1, we can see what the cash flows look like.
head(round(proj.cf[[1]][, 1:5], digits = 1), n=length(year)) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 852.7 852.2 766.7 974.3 912.6 ## [2,] 85.8 103.1 117.2 116.1 88.1 ## [3,] 77.8 93.6 107.3 106.9 79.9 ## [4,] 55.3 64.3 76.7 79.9 57.1 ## [5,] 0.0 9.4 0.2 13.1 1.6 ## [6,] 98.0 128.7 123.7 102.2 97.9 ## [7,] 201.5 230.2 227.2 215.6 206.8 ## [8,] 265.3 279.2 276.2 279.8 276.9 ## [9,] 292.5 296.5 293.3 305.3 308.0 ## [10,] 302.3 301.8 298.5 314.1 319.5 ## [11,] 305.6 303.4 300.1 316.9 323.5 ## [12,] 306.7 303.9 300.5 317.8 324.8 ## [13,] 307.1 304.1 300.7 318.1 325.2 ## [14,] 307.2 304.1 300.7 318.2 325.4 ## [15,] 307.2 304.1 300.7 318.2 325.4 ## [16,] 307.2 304.1 300.7 318.2 325.5We can calculate some summary results for the cash flows for each strategy and plot them. First, we might like to plot the mean cash flow over time.
proj.cf.mean < as.data.frame(lapply(strategy, function(s) { rowMeans(proj.cf[[s]]) })) names(proj.cf.mean) < strategy proj.cf.mean < cbind(year, proj.cf.mean) # Plot the mean strategy cash flows. gg.mean.cf < ggplot(gather(proj.cf.mean, "strategy", "mean", year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.cf) Figure 1: The mean cash flows over time show tradeoffs in initial outlays, times to peak, level of peak, and the timing of breakeven.Then we can calculate the cumulative mean cash flow and plot that for each strategy, too.
proj.ccf.mean < proj.cf.mean %>% mutate_at(vars(strategy), list(~ cumsum(.))) # Plot the cumulative mean strategy cash flows. gg.mean.ccf < ggplot(gather(proj.ccf.mean, "strategy", "mean", year)) + geom_line(aes(x = year, y = mean, color = strategy)) + geom_point(aes(x = year, y = mean, color = strategy)) + labs(title = "Cumulative Mean Strategy Cash Flow", y = "[$M]") plot(gg.mean.ccf) Figure 2: The cumulative mean cash flows over time show tradeoffs in drawdown, nominal payback timing, and terminal cash recovery level.Now we can observe the risk profile of the cash flows by calculating the trial NPVs of the cash flows.
proj.npv < as.data.frame(lapply(1:length(strategy), function(s) { sapply(1:trials, function(t) { calcNPV(CF = proj.cf[[s]][, t], Y = year, DR = discount.rate) }) })) names(proj.npv) < strategy # Plot the CDF of the strategies' sample NPVs. gg.ecdf < ggplot(gather(proj.npv, "strategy", "NPV"), aes(x = NPV, color = strategy)) + stat_ecdf(geom = "point") + labs(title = "Strategy NPV Risk Profile", x = "Strategy NPV [$M]", y = "Cumulative Probability") plot(gg.ecdf) Figure 3: The cumulative probability chart shows tradeoffs in stategy dominance, range of value, and relative value volatility.And we can calculate the mean NPV of the cash flows.
proj.npv.mean < round(colMeans(proj.npv), digits = 1) print(proj.npv.mean) ## Strategy 1 Strategy 2 Strategy 3 ## 157.4 59.3 123.6The first observation we make about the risk profiles of the strategies is that we can dismiss Strategy 3 immediately because it presents negative mean NPV, the probability that it produces a negative economic value is ~75% (i.e., the probability(NPV<=0) = 0.75), and practically none of its trials present any opportunity for dominance over any other strategy.
When we observe the relative volatility and dominance of the remaining strategies, we realize that we face a bit of ambiguity about how to choose the best pathway forward. Strategy 1 exhibits the best overall mean NPV, but it does so with the greatest relative volatility. While Strategy 2 exhibits approximately the same risk of failure (~25%) as Strategy 1 (~27%), it also exhibits the least maximum exposure and relative volatility. To reduce the ambiguity of choosing, we might like to know which uncertainty, due to the overall quality of the information we possess about it, contributes most to switching dominance from Strategy 1 over Strategy 2. Knowing which uncertainty our strategy values are most sensitive to gives us the ability to stop worrying about the other uncertainties for the purpose of choosing clearly and focus only on improving our understanding of those that matter most.
To accomplish that latter feat, we run our functionalized model to test the sensitivity of the strategy means to the 80th percentile range in each of the variables. We start by initializing the lists to hold the sensitivity responses.
data.temp < proj.data proj.npv.sens < list(p10 = proj.data, p90 = proj.data) d < data.frame(0, 0, 0) dd < list( investment = d, start.cash.flow = d, peak.cash.flow = d, yr.peak = d ) proj.npv.sens.mean < list(p10 = dd, p90 = dd)We calculate the sensitivity of the strategies’ mean NPVs by fixing each variable to its p10 and p90 quantile values sequentially while all the other variables run according to their defined variation. The result is a chart that shows how sensitive we might be to changing our decision to pursue the best strategy over the next best strategy on the outcome of a given uncertainty.
p1090 < c(0.1, 0.9) for (q in 1:length(p1090)) { for (v in 1:length(proj.data)) { for (s in 1:length(strategy)) { data.temp[[v]][s] < rep(quantile(unlist(proj.data[[v]][s]), probs = p1090[q]), trials) for (t in 1:trials) { proj.npv.sens[[q]][[v]][t, s] < calcNPV( CF = calcCashFlow( I = data.temp$investment[[s]][t], SCF = data.temp$start.cash.flow[[s]][t], PCF = data.temp$peak.cash.flow[[s]][t], YP = data.temp$yr.peak[[s]][t], Y = year ), Y = year, DR = discount.rate ) } proj.npv.sens.mean[[q]][[v]][s] < mean(proj.npv.sens[[q]][[v]][, s]) data.temp < proj.data } } } # Recast the sensitivity values to a form that can be plotted. variable.names < c("investment", "start.cash.flow", "peak.cash.flow", "yr.peak") proj.npv.sens.mean2 < as.data.frame(sapply(1:length(p1090), function(q) { sapply(1:length(proj.data), function(v) { sapply(1:length(strategy), function(s) { proj.npv.sens.mean[[q]][[v]][[s]] }) }) })) %>% rename(p10 = V1, p90 = V2) %>% mutate( variable = rep(variable.names, each = length(strategy)), strategy = rep(strategy, times = length(proj.data)), strategy.mean = rep(t(proj.npv.mean), times = length(proj.data)) ) %>% select(variable, strategy, strategy.mean, p10, p90) gg.sens < ggplot(proj.npv.sens.mean2, aes( x = variable, y = p10, yend = p90, color = strategy )) + geom_segment(aes( x = variable, xend = variable, y = p10, yend = p90 ), color = "gray50", size = 0.75) + geom_point( aes(x = variable, y = strategy.mean, color = strategy), size = 1, color = "black", fill = "black", shape = 23 ) + geom_point( aes(x = variable, y = p90), size = 3) + geom_point( aes(x = variable, y = p10), size = 3) + labs(title = "Sensitivity of Strategy Mean NPV to Uncertain Variable Ranges", y = "Strategy NPV [$M]") + coord_flip() + theme(legend.position = "none") + facet_wrap(~ strategy) plot(gg.sens) Figure 4: The sensitivity chart shows the range of variation in a strategy mean due to an 80th percentile variation in a given uncertainty. The mean value of each strategy is marked by the black diamond within the range of each variable. According to our results here, we should focus additional research attention on the investment range first and then the year to peak.This sensitivity chart is valuable not only to the decision analyst, but to the data analyst as well for this one simple reason: now that we know which uncertainties potentially exposes us to the greatest risk or reward, we can calculate the value of information on those uncertainties. This value of information represents the rational maximum budget we should allocate to acquire better data and insight on those uncertainties. While a typical business case analysis of sufficient complexity may contain 10100 uncertainties, we should focus our research budget only on those that are critical to making a decision. This simple “stopping function” helps us ensure that our data science resources are allocated properly and most efficiently. Section IV of BCAwR provides a means to calculate the value of information on continuous variables like the ones we have utilized in our example here.
ConclusionBusiness decision analysts can definitely use R effectively to conduct many of the business case analyses in a more transparent and less error prone environment than their current tools typically allow, especially through the use of clearly named variables that array abstract and the use of functionalized structural models. However, the real power that R provides is the means to simulate information when high quality, large scale data may be scarce, infeasible, or impossible to obtain. Then, by using such concepts as value of information, data scientists can be called upon to refine the quality of information on those critical uncertainties that may frustrate success when commitment to action turns into real decisions. I hope the simple example presented here inspires you to learn more about it.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
eRum2020: call for submissions open!
[This article was first published on MilanoR, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The new year has started and we at MilanoR are deep into the frantic organization of the third European R User Meeting (eRum). eRum will take place next May (2730) in Milan, it is a nonprofit event and completely communitydriven. An impressive array of experts, enthusiasts and R professionals will animate the conference. We expect more than 500 R users and a rich showcase of talks in six different tracks. MilanoR is organizing partner in collaboration with Università di Milano Bicocca and Politecnico di Milano.
Submit your work to eRum!The call for contributed sessions is now officially open. You can now submit abstracts for the following types of contributed sessions (the deadline for submissions is January 29, 2020):
 Workshops
 Regular Talks
 Lightning Talks
 Shiny Demos
 Posters
If you are an R developer or lover, this is a great way to showcase your work and find peers and opportunities. The whole team and collaborators of MilanoR are dedicating plenty of time in building a great event. Don’t miss this chance and submit your work! You can easily do it by clicking this link.
Become a SponsoreRum2020 is also a great sponsorship opportunity. Beside our sponsors there is still room for other companies and associations working with R in the dataworld willing to sponsor the conference. Becoming a sponsor you will have high visibility and access to an audience of R professionals from all over Europe. You could be part of the conference with a talk or workshop: you certainly will be able to find and recruit skilled and motivated talents. Download the sponsorship packages brochure or write us (sponsors@erum.io).
Book your agenda, submit your Rwork and see you at eRum2020!
(Do not miss following our social activity on Twitter , Facebook or Linkedin)
The post eRum2020: call for submissions open! appeared first on MilanoR.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: MilanoR. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
eRum2020: call for submissions open!
[This article was first published on MilanoR, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
The new year has started and we at MilanoR are deep into the frantic organization of the third European R User Meeting (eRum). eRum will take place next May (2730) in Milan, it is a nonprofit event and completely communitydriven. An impressive array of experts, enthusiasts and R professionals will animate the conference. We expect more than 500 R users and a rich showcase of talks in six different tracks. MilanoR is organizing partner in collaboration with Università di Milano Bicocca and Politecnico di Milano.
Submit your work to eRum!The call for contributed sessions is now officially open. You can now submit abstracts for the following types of contributed sessions (the deadline for submissions is January 29, 2020):
 Workshops
 Regular Talks
 Lightning Talks
 Shiny Demos
 Posters
If you are an R developer or lover, this is a great way to showcase your work and find peers and opportunities. The whole team and collaborators of MilanoR are dedicating plenty of time in building a great event. Don’t miss this chance and submit your work! You can easily do it by clicking this link.
Become a SponsoreRum2020 is also a great sponsorship opportunity. Beside our sponsors there is still room for other companies and associations working with R in the dataworld willing to sponsor the conference. Becoming a sponsor you will have high visibility and access to an audience of R professionals from all over Europe. You could be part of the conference with a talk or workshop: you certainly will be able to find and recruit skilled and motivated talents. Download the sponsorship packages brochure or write us (sponsors@erum.io).
Book your agenda, submit your Rwork and see you at eRum2020!
(Do not miss following our social activity on Twitter , Facebook or Linkedin)
The post eRum2020: call for submissions open! appeared first on MilanoR.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: MilanoR. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Biomedical Data Science Textbook Available
[This article was first published on R – r4stats.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easytouse interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.
For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.
The book is available in print
and various electronic formats. Because it is selfpublished, we plan to update it more rapidly than would be
possible through traditional publishers.
Below you will find a detailed table of contents and a list
of the textbook authors.
OVERVIEW OF BIOMEDICAL DATA SCIENCE
 Introduction
 Background and history
 Conflicting perspectives
 the statistician’s perspective
 the machine learner’s perspective
 the database administrator’s perspective
 the data visualizer’s perspective
 Data analytical processes
 raw data
 data preprocessing
 exploratory data analysis (EDA)
 predictive modeling approaches
 types of models
 types of software
 Major types of analytics
 descriptive analytics
 diagnostic analytics
 predictive analytics (modeling)
 prescriptive analytics
 putting it all together
 Biomedical data science tools
 Biomedical data science education
 Biomedical data science careers
 Importance of soft skills in data science
 Biomedical data science resources
 Biomedical data science challenges
 Future trends
 Conclusion
 References
SPREADSHEET TOOLS AND TIPS
 Introduction
 basic spreadsheet functions
 download the sample spreadsheet
 Navigating the worksheet
 Clinical application of spreadsheets
 formulas and functions
 filter
 sorting data
 freezing panes
 conditional formatting
 pivot tables
 visualization
 data analysis
 Tips and tricks
 Microsoft Excel shortcuts – windows users
 Google sheets tips and tricks
 Conclusions
 Exercises
 References
BIOSTATISTICS PRIMER
 Introduction
 Measures of central tendency & dispersion
 the normal and lognormal distributions
 Descriptive and inferential statistics
 Categorical data analysis
 Diagnostic tests
 Bayes’ theorem
 Types of research studies
 observational studies
 interventional studies
 metaanalysis
 orrelation
 Linear regression
 Comparing two groups
 the independentsamples ttest
 the wilcoxonmannwhitney test
 Comparing more than two groups
 Other types of tests
 generalized tests
 exact or permutation tests
 bootstrap or resampling tests
 Stats packages and online calculators
 commercial packages
 noncommercial or open source packages
 online calculators
 Challenges
 Future trends
 Conclusion
 Exercises
 References
DATA VISUALIZATION
 Introduction
 historical data visualizations
 visualization frameworks
 Visualization basics
 Data visualization software
 Microsoft Excel
 Google sheets
 Tableau
 R programming language
 other visualization programs
 Visualization options
 visualizing categorical data
 visualizing continuous data
 Dashboards
 Geographic maps
 Challenges
 Conclusion
 Exercises
 References
INTRODUCTION TO DATABASES
 Introduction
 Definitions
 A brief history of database models
 hierarchical model
 network model
 relational model
 Relational database structure
 Clinical data warehouses (CDWs)
 Structured query language (SQL)
 Learning SQL
 Conclusion
 Exercises
 References
BIG DATA
 Introduction
 The seven v’s of big data related to health care data
 Technical background
 Application
 Challenges
 technical
 organizational
 legal
 translational
 Future trends
 Conclusion
 References
BIOINFORMATICS and PRECISION MEDICINE
 Introduction
 History
 Definitions
 Biological data analysis – from data to discovery
 Biological data types
 genomics
 transcriptomics
 proteomics
 bioinformatics data in public repositories
 biomedical cancer data portals
 Tools for analyzing bioinformatics data
 command line tools
 webbased tools
 Genomic data analysis
 Genomic data analysis workflow
 variant calling pipeline for whole exome sequencing data
 quality check
 alignment
 variant calling
 variant filtering and annotation
 downstream analysis
 reporting and visualization
 Precision medicine – from big data to patient care
 Examples of precision medicine
 Challenges
 Future trends
 Useful resources
 Conclusion
 Exercises
 References
PROGRAMMING LANGUAGES FOR DATA ANALYSIS
 Introduction
 History
 R language
 installing R & rstudio
 an example R program
 getting help in R
 user interfaces for R
 R’s default user interface: rgui
 Rstudio
 menu & dialog guis
 some popular R guis
 R graphical user interface comparison
 R resources
 Python language
 installing Python
 an example Python program
 getting help in Python
 user interfaces for Python
 reproducibility
 R vs. Python
 Future trends
 Conclusion
 Exercises
 References
MACHINE LEARNING
 Brief history
 Introduction
 data refresher
 training vs test data
 bias and variance
 supervised and unsupervised learning
 Common machine learning algorithms
 Supervised learning
 Unsupervised learning
 dimensionality reduction
 reinforcement learning
 semisupervised learning
 Evaluation of predictive analytical performance
 classification model evaluation
 regression model evaluation
 Machine learning software
 Weka
 Orange
 Rapidminer studio
 KNIME
 Google TensorFlow
 honorable mention
 summary
 Programming languages and machine learning
 Machine learning challenges
 Machine learning examples
 example 1 classification
 example 2 regression
 example 3 clustering
 example 4 association rules
 Conclusion
 Exercises
 References
ARTIFICIAL INTELLIGENCE
 Introduction
 definitions
 History
 Ai architectures
 Deep learning
 Image analysis (computer vision)
 Radiology
 Ophthalmology
 Dermatology
 Pathology
 Cardiology
 Neurology
 Wearable devices
 Image libraries and packages
 Natural language processing
 NLP libraries and packages
 Text mining and medicine
 Speech recognition
 Electronic health record data and AI
 Genomic analysis
 AI platforms
 deep learning platforms and programs
 Artificial intelligence challenges
 General
 Data issues
 Technical
 Socio economic and legal
 Regulatory
 Adverse unintended consequences
 Need for more ML and AI education
 Future trends
 Conclusion
 Exercises
 References
Brenda Griffith
Technical Writer
Data.World
Austin, TX
Robert Hoyt MD, FACP, ABPMCI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA
David Hurwitz MD, FACP, ABPMCI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL
Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO
Robert Leviton MD, MPH, FACEP, ABPMCI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY
Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN
Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN
Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL
A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Biomedical Data Science Textbook Available
[This article was first published on R – r4stats.com, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Data science is being used in many ways to improve healthcare and reduce costs. We have written a textbook, Introduction to Biomedical Data Science, to help healthcare professionals understand the topic and to work more effectively with data scientists. The textbook content and data exercises do not require programming skills or higher math. We introduce open source tools such as R and Python, as well as easytouse interfaces to them such as BlueSky Statistics, jamovi, R Commander, and Orange. Chapter exercises are based on healthcare data, and supplemental YouTube videos are available in most chapters.
For instructors, we provide PowerPoint slides for each chapter, exercises, quiz questions, and solutions. Instructors can download an electronic copy of the book, the Instructor Manual, and PowerPoints after first registering on the instructor page.
The book is available in print
and various electronic formats. Because it is selfpublished, we plan to update it more rapidly than would be
possible through traditional publishers.
Below you will find a detailed table of contents and a list
of the textbook authors.
OVERVIEW OF BIOMEDICAL DATA SCIENCE
 Introduction
 Background and history
 Conflicting perspectives
 the statistician’s perspective
 the machine learner’s perspective
 the database administrator’s perspective
 the data visualizer’s perspective
 Data analytical processes
 raw data
 data preprocessing
 exploratory data analysis (EDA)
 predictive modeling approaches
 types of models
 types of software
 Major types of analytics
 descriptive analytics
 diagnostic analytics
 predictive analytics (modeling)
 prescriptive analytics
 putting it all together
 Biomedical data science tools
 Biomedical data science education
 Biomedical data science careers
 Importance of soft skills in data science
 Biomedical data science resources
 Biomedical data science challenges
 Future trends
 Conclusion
 References
SPREADSHEET TOOLS AND TIPS
 Introduction
 basic spreadsheet functions
 download the sample spreadsheet
 Navigating the worksheet
 Clinical application of spreadsheets
 formulas and functions
 filter
 sorting data
 freezing panes
 conditional formatting
 pivot tables
 visualization
 data analysis
 Tips and tricks
 Microsoft Excel shortcuts – windows users
 Google sheets tips and tricks
 Conclusions
 Exercises
 References
BIOSTATISTICS PRIMER
 Introduction
 Measures of central tendency & dispersion
 the normal and lognormal distributions
 Descriptive and inferential statistics
 Categorical data analysis
 Diagnostic tests
 Bayes’ theorem
 Types of research studies
 observational studies
 interventional studies
 metaanalysis
 orrelation
 Linear regression
 Comparing two groups
 the independentsamples ttest
 the wilcoxonmannwhitney test
 Comparing more than two groups
 Other types of tests
 generalized tests
 exact or permutation tests
 bootstrap or resampling tests
 Stats packages and online calculators
 commercial packages
 noncommercial or open source packages
 online calculators
 Challenges
 Future trends
 Conclusion
 Exercises
 References
DATA VISUALIZATION
 Introduction
 historical data visualizations
 visualization frameworks
 Visualization basics
 Data visualization software
 Microsoft Excel
 Google sheets
 Tableau
 R programming language
 other visualization programs
 Visualization options
 visualizing categorical data
 visualizing continuous data
 Dashboards
 Geographic maps
 Challenges
 Conclusion
 Exercises
 References
INTRODUCTION TO DATABASES
 Introduction
 Definitions
 A brief history of database models
 hierarchical model
 network model
 relational model
 Relational database structure
 Clinical data warehouses (CDWs)
 Structured query language (SQL)
 Learning SQL
 Conclusion
 Exercises
 References
BIG DATA
 Introduction
 The seven v’s of big data related to health care data
 Technical background
 Application
 Challenges
 technical
 organizational
 legal
 translational
 Future trends
 Conclusion
 References
BIOINFORMATICS and PRECISION MEDICINE
 Introduction
 History
 Definitions
 Biological data analysis – from data to discovery
 Biological data types
 genomics
 transcriptomics
 proteomics
 bioinformatics data in public repositories
 biomedical cancer data portals
 Tools for analyzing bioinformatics data
 command line tools
 webbased tools
 Genomic data analysis
 Genomic data analysis workflow
 variant calling pipeline for whole exome sequencing data
 quality check
 alignment
 variant calling
 variant filtering and annotation
 downstream analysis
 reporting and visualization
 Precision medicine – from big data to patient care
 Examples of precision medicine
 Challenges
 Future trends
 Useful resources
 Conclusion
 Exercises
 References
PROGRAMMING LANGUAGES FOR DATA ANALYSIS
 Introduction
 History
 R language
 installing R & rstudio
 an example R program
 getting help in R
 user interfaces for R
 R’s default user interface: rgui
 Rstudio
 menu & dialog guis
 some popular R guis
 R graphical user interface comparison
 R resources
 Python language
 installing Python
 an example Python program
 getting help in Python
 user interfaces for Python
 reproducibility
 R vs. Python
 Future trends
 Conclusion
 Exercises
 References
MACHINE LEARNING
 Brief history
 Introduction
 data refresher
 training vs test data
 bias and variance
 supervised and unsupervised learning
 Common machine learning algorithms
 Supervised learning
 Unsupervised learning
 dimensionality reduction
 reinforcement learning
 semisupervised learning
 Evaluation of predictive analytical performance
 classification model evaluation
 regression model evaluation
 Machine learning software
 Weka
 Orange
 Rapidminer studio
 KNIME
 Google TensorFlow
 honorable mention
 summary
 Programming languages and machine learning
 Machine learning challenges
 Machine learning examples
 example 1 classification
 example 2 regression
 example 3 clustering
 example 4 association rules
 Conclusion
 Exercises
 References
ARTIFICIAL INTELLIGENCE
 Introduction
 definitions
 History
 Ai architectures
 Deep learning
 Image analysis (computer vision)
 Radiology
 Ophthalmology
 Dermatology
 Pathology
 Cardiology
 Neurology
 Wearable devices
 Image libraries and packages
 Natural language processing
 NLP libraries and packages
 Text mining and medicine
 Speech recognition
 Electronic health record data and AI
 Genomic analysis
 AI platforms
 deep learning platforms and programs
 Artificial intelligence challenges
 General
 Data issues
 Technical
 Socio economic and legal
 Regulatory
 Adverse unintended consequences
 Need for more ML and AI education
 Future trends
 Conclusion
 Exercises
 References
Brenda Griffith
Technical Writer
Data.World
Austin, TX
Robert Hoyt MD, FACP, ABPMCI, FAMIA
Associate Clinical Professor
Department of Internal Medicine
Virginia Commonwealth University
Richmond, VA
David Hurwitz MD, FACP, ABPMCI
Associate CMIO
Allscripts Healthcare Solutions
Chicago, IL
Madhurima Kaushal MS
Bioinformatics
Washington University at St. Louis, School of Medicine
St. Louis, MO
Robert Leviton MD, MPH, FACEP, ABPMCI, FAMIA
Assistant Professor
New York Medical College
Department of Emergency Medicine
Valhalla, NY
Karen A. Monsen PhD, RN, FAMIA, FAAN
Professor
School of Nursing
University of Minnesota
Minneapolis, MN
Robert Muenchen MS, PSTAT
Manager, Research Computing Support
University of Tennessee
Knoxville, TN
Dallas Snider PhD
Chair, Department of Information Technology
University of West Florida
Pensacola, FL
A special thanks to Ann Yoshihashi MD for her help with the publication of this textbook.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
How to import Python classes into R
[This article was first published on R – Open Source Automation, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
(adsbygoogle = window.adsbygoogle  []).push({ google_ad_client: "capub4184791493740497", enable_page_level_ads: true });
BackgroundThis post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes.
To install reticulate, we can run:
install.packages("reticulate") Creating a Python classLet’s create a simple class in Python.
import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) Porting the Python class to RThere’s a couple ways we can can port our Python class to R. One way is by sourcing a Python script defining the class. Let’s suppose our class is defined in a script called “sample_class.py”. We can use the reticulate function source_python.
# load reticulate package library(reticulate) # inside R, source Python script source_python("sample_class.py")Running the command above will not only make the class we defined will available to us in our R session, but would also make any other variables or functions we defined in the script available as well (if those exist). Thus, if we define 10 classes in the script, all 10 classes will be available in the R session. We can refer to any specific method defined in a class using R’s “$” notation, rather than the “dot” notation of Python.
result < explore(iris) # get summary stats of data result$metrics() # create dummy variables from factor / character fields # (just one in this case) result$dummify_vars()One other note is that when you import a class from Python, the class becomes a closure in R. You can see this by running R’s typeof function:
typeof(explore) # closure Using R markdown to switch between R and PythonAnother way of using a Python class in R is by using R Markdown. This feature is available in RStudio v. 1.2+ and it allows us to write chunks of R code followed by chunks of Python code, and viceversa. It also lets us pass variables, or even classes from Python to R. Below, we write the same code as above, but this time using chunks in an R Markdown file. When we write switch between R and Python chunks in R Markdown, we can reference Python objects (including classes) by typing py$name_of_object, where you just need to replace name_of_object with whatever you’re trying to reference from the Python code. In the below case, we reference the explore class we created by typing py$explore.
```{r} library(reticulate) ``` ```{python} import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) ``` ```{r} py$explore ``` Example with class and instance variablesNow, let’s look at another example. Below, we create a Python class in a file called “sample_class2.py” that has an instance variable (value) and a class variable (num).
class test: def __init__(self, value): self.value = value def class_var(self): test.num = 10 source_python("sample_class2.py") check = test(5) check$value check$num # error because class_var hasn't been called yet check$class_var() check$num # works now test$numThat’s it for now! If you enjoyed this post, please follow my blog on Twitter.
To learn more about reticulate, check out its official documentation here.
The post How to import Python classes into R appeared first on Open Source Automation.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
How to import Python classes into R
[This article was first published on R – Open Source Automation, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
(adsbygoogle = window.adsbygoogle  []).push({ google_ad_client: "capub4184791493740497", enable_page_level_ads: true });
BackgroundThis post is going to talk about how to import Python classes into R, which can be done using a really awesome package in R called reticulate. reticulate allows you to call Python code from R, including sourcing Python scripts, using Python packages, and porting functions and classes.
To install reticulate, we can run:
install.packages("reticulate") Creating a Python classLet’s create a simple class in Python.
import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) Porting the Python class to RThere’s a couple ways we can can port our Python class to R. One way is by sourcing a Python script defining the class. Let’s suppose our class is defined in a script called “sample_class.py”. We can use the reticulate function source_python.
# load reticulate package library(reticulate) # inside R, source Python script source_python("sample_class.py")Running the command above will not only make the class we defined will available to us in our R session, but would also make any other variables or functions we defined in the script available as well (if those exist). Thus, if we define 10 classes in the script, all 10 classes will be available in the R session. We can refer to any specific method defined in a class using R’s “$” notation, rather than the “dot” notation of Python.
result < explore(iris) # get summary stats of data result$metrics() # create dummy variables from factor / character fields # (just one in this case) result$dummify_vars()One other note is that when you import a class from Python, the class becomes a closure in R. You can see this by running R’s typeof function:
typeof(explore) # closure Using R markdown to switch between R and PythonAnother way of using a Python class in R is by using R Markdown. This feature is available in RStudio v. 1.2+ and it allows us to write chunks of R code followed by chunks of Python code, and viceversa. It also lets us pass variables, or even classes from Python to R. Below, we write the same code as above, but this time using chunks in an R Markdown file. When we write switch between R and Python chunks in R Markdown, we can reference Python objects (including classes) by typing py$name_of_object, where you just need to replace name_of_object with whatever you’re trying to reference from the Python code. In the below case, we reference the explore class we created by typing py$explore.
```{r} library(reticulate) ``` ```{python} import pandas as pd # define a python class class explore: def __init__(self, df): self.df = df def metrics(self): desc = self.df.describe() return desc def dummify_vars(self): for field in self.df.columns: if isinstance(self.df[field][0], str): temp = pd.get_dummies(self.df[field]) self.df = pd.concat([self.df, temp], axis = 1) self.df.drop(columns = [field], inplace = True) ``` ```{r} py$explore ``` Example with class and instance variablesNow, let’s look at another example. Below, we create a Python class in a file called “sample_class2.py” that has an instance variable (value) and a class variable (num).
class test: def __init__(self, value): self.value = value def class_var(self): test.num = 10 source_python("sample_class2.py") check = test(5) check$value check$num # error because class_var hasn't been called yet check$class_var() check$num # works now test$numThat’s it for now! If you enjoyed this post, please follow my blog on Twitter.
To learn more about reticulate, check out its official documentation here.
The post How to import Python classes into R appeared first on Open Source Automation.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
On the conjugate function
[This article was first published on Renglish – Freakonometrics, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
In the MAT7381 course (graduate course on regression models), we will talk about optimization, and a classical tool is the socalled conjugate. Given a function f:\mathbb{R}^p\to\mathbb{R} its conjugate is function f^{\star}:\mathbb{R}^p\to\mathbb{R} such that f^{\star}(\boldsymbol{y})=\max_{\boldsymbol{x}}\lbrace\boldsymbol{x}^\top\boldsymbol{y}f(\boldsymbol{x})\rbraceso, long story short, f^{\star}(\boldsymbol{y}) is the maximum gap between the linear function \boldsymbol{x}^\top\boldsymbol{y} and f(\boldsymbol{x}).
Just to visualize, consider a simple parabolic function (in dimension 1) f(x)=x^2/2, then f^{\star}(\color{blue}{2}) is the maximum gap between the line x\mapsto\color{blue}{2}x and function f(x).
x = seq(100,100,length=6001)
f = function(x) x^2/2
vf = Vectorize(f)(x)
fstar = function(y) max(y*xvf)
vfstar = Vectorize(fstar)(x)
We can see it on the figure below.
viz = function(x0=1,YL=NA){
idx=which(abs(x)<=3) par(mfrow=c(1,2)) plot(x[idx],vf[idx],type="l",xlab="",ylab="",col="blue",lwd=2) abline(h=0,col="grey") abline(v=0,col="grey") idx2=which(x0*x>=vf)
polygon(c(x[idx2],rev(x[idx2])),c(vf[idx2],rev(x0*x[idx2])),col=rgb(0,1,0,.3),border=NA)
abline(a=0,b=x0,col="red")
i=which.max(x0*xvf)
segments(x[i],x0*x[i],x[i],f(x[i]),lwd=3,col="red")
if(is.na(YL)) YL=range(vfstar[idx])
plot(x[idx],vfstar[idx],type="l",xlab="",ylab="",col="red",lwd=1,ylim=YL)
abline(h=0,col="grey")
abline(v=0,col="grey")
segments(x0,0,x0,fstar(x0),lwd=3,col="red")
points(x0,fstar(x0),pch=19,col="red")
}
viz(1)
or
viz(1.5)
In that case, we can actually compute f^{\star}, since f^{\star}(y)=\max_{x}\lbrace xyf(x)\rbrace=\max_{x}\lbrace xyx^2/2\rbraceThe first order condition is here x^{\star}=y and thusf^{\star}(y)=\max_{x}\lbrace xyx^2/2\rbrace=\lbrace x^{\star}y(x^{\star})^2/2\rbrace=\lbrace y^2y^2/2\rbrace=y^2/2And actually, that can be related to two results. The first one is to observe that f(\boldsymbol{x})=\\boldsymbol{x}\_2^2/2 and in that case f^{\star}(\boldsymbol{y})=\\boldsymbol{y}\_2^2/2 from the following general result : if f(\boldsymbol{x})=\\boldsymbol{x}\_p^p/p with p>1, where \\cdot\_p denotes the standard \ell_p norm, then f^{\star}(\boldsymbol{y})=\\boldsymbol{y}\_q^q/q where\frac{1}{p}+\frac{1}{q}=1The second one is the conjugate of a quadratic function. More specifically if f(\boldsymbol{x})=\boldsymbol{x}^{\top}\boldsymbol{Q}\boldsymbol{x}/2 for some definite positive matrix \boldsymbol{Q}, f^{\star}(\boldsymbol{y})=\boldsymbol{y}^{\top}\boldsymbol{Q}^{1}\boldsymbol{y}/2. In our case, it was a univariate problem with \boldsymbol{Q}=1.
For the conjugate of the \ell_p norm, we can use the following code to visualize it
p = 3
f = function(x) abs(x)^p/p
vf = Vectorize(f)(x)
fstar = function(y) max(y*xvf)
vfstar = Vectorize(fstar)(x)
viz(1.5)
or
p = 1.1
f = function(x) abs(x)^p/p
vf = Vectorize(f)(x)
fstar = function(y) max(y*xvf)
vfstar = Vectorize(fstar)(x)
viz(1, YL=c(0,10))
Actually, in that case, we almost visualize that if f(x)=x then\displaystyle{f^{\star}\left(y\right)={\begin{cases}0,&\lefty\right\leq 1\\\infty ,&\lefty\right>1.\end{cases}}}
To conclude, another popular case, f(x)=\exp(x) then{\displaystyle f^{\star}\left(y\right)={\begin{cases}y\log(y)y,&y>0\\0,&y=0\\\infty ,&y<0.\end{cases}}}[/latex]We can visualize that case below
f = function(x) exp(x)
vf = Vectorize(f)(x)
fstar = function(y) max(y*xvf)
vfstar = Vectorize(fstar)(x)
viz(1,YL=c(3,3))
To leave a comment for the author, please follow the link and comment on their blog: Renglish – Freakonometrics. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
Le Monde puzzle [#1120]
[This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers]. (You can report issue about the content on this page here) Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
A board game as Le weekly Monde current mathematical puzzle:
11 players in a circle and 365 tokens first owned by a single player. Players with at least two tokens can either remove one token and give another one left or move two right and one left. How quickly does the game stall, how many tokens are left, and where are they?
The run of a R simulation like
od=function(i)(i1)%%11+1 muv<function(bob){ if (max(bob)>1){ i=sample(rep((1:11)[bob>1],2),1) dud=c(0,2,1) if((runif(1)<.5)&(bob[i]>2))dud=c(2,3,1) bob[c(od(i+10),i,od(i+1))]=bob[c(od(i+10),i,od(i+1))]+dud } bob}always provides a solution
> bob [1] 1 0 1 1 0 1 1 0 1 0 0with six ones at these locations. However the time it takes to reach this frozen configuration varies, depending on the sequence of random choices.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job. Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.