hitting a wall
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
Once in a while, or a wee bit more frequently (!), it proves impossible to communicate with a contributor of a question on X validated. A recent instance was about simulating from a multivariate kernel density estimate where the kernel terms at x¹,x²,… are Gaussian kernels applied to the inverses of the norms xx¹, xx²,… rather than to the norms as in the usual formulation. The reason for using this type of kernel is unclear, as it certainly does not converge to an estimate of the density of the sample x¹,x²,… as the sample size grows, since it excludes a neighbourhood of each point in the sample. Since the kernel term tends to a nonzero constant at infinity, the support of the density estimate is restricted to the hypercube [0,1]x…x[0,1], again with unclear motivations. No mention being made of the bandwidth adopted for this kernel. If one takes this exotic density as a given, the question is rather straightforward as the support is compact, the density bounded and a vanilla acceptreject can be implemented. As illustrated by the massive number of comments on that entry, it did not work as the contributor adopted a fairly bellicose attitude about suggestions from moderators on that site and could not see the point in our requests for clarification, despite plotting a version of the kernel that had its maximum [and not its minimum] at x¹… After a few attempts, including writing a complete answer, from which the above graph is taken (based on an initial understanding of the support being for (xx¹), …), I gave up and deleted all my entries.On that question.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R null values: NULL, NA, NaN, Inf
(This article was first published on R – TomazTsql, and kindly contributed to Rbloggers)
R language supports several nullable values and it is relatively important to understand how these values behave, when making data preprocessing and data munging.
In general, R supports:
 NULL
 NA
 NaN
 Inf / Inf
NULL is an object and is returned when an expression or function results in an undefined value. In R language, NULL (capital letters) is a reserved word and can also be the product of importing data with unknown data type.
NA is a logical constant of length 1 and is an indicator for a missing value.NA (capital letters) is a reserved word and can be coerced to any other data type vector (except raw) and can also be a product when importing data. NA and “NA” (as presented as string) are not interchangeable. NA stands for Not Available.
NaN stands for Not A Number and is a logical vector of a length 1 and applies to numerical values, as well as real and imaginary parts of complex values, but not to values of integer vector. NaN is a reserved word.
Inf and Inf stands for infinity (or negative infinity) and is a result of storing either a large number or a product that is a result of division by zero. Inf is a reserved word and is – in most cases – product of computations in R language and therefore very rarely a product of data import. Infinite also tells you that the value is not missing and a number!
All four null/missing data types have accompanying logical functions available in base R; returning the TRUE / FALSE for each of particular function: is.null(), is.na(), is.nan(), is.infinite().
General understanding of all values by simply using following code:
#reading documentation on all data types: ?NULL ?NA ?NaN ?Inf #populating variables a < "NA" b < "NULL" c < NULL d < NA e < NaN f < Inf ### Check if variables are same? identical(a,d) # [1] FALSE # NA and NaN are not identical identical(d,e) # [1] FALSE ###checking length of data types length(c) # [1] 0 length(d) # [1] 1 length(e) # [1] 1 length(f) # [1] 1 ###checking data types str(c); class(c); #NULL #[1] "NULL" str(d); class(d); #logi NA #[1] "logical" str(e); class(e); #num NaN #[1] "numeric" str(f); class(f); #num Inf #[1] "numeric" Getting data from RNullable data types can have a different behavior when propagated to e.g.: list or or vectors or data.frame types.
We can test this by creating NULL or NA or NaN vectors and dataframes and observe the behaviour:
#empty vectors for NULL, NA and NaN v1 < c(NULL, NULL, NULL) v2 < NULL str(v1); class(v1); mode(v1) str(v2); class(v2); mode(v2) v3 < c(NA, NA, NA) v4 < NA str(v3); class(v3); mode(v3) str(v4); class(v4); mode(v4) v5 < c(NaN, NaN, NaN) v6 < NaN str(v5); class(v5); mode(v5) str(v6); class(v6); mode(v6)Clearly, it is evident that the NULL vector will always be an empty one, regardless of the elements it can hold. With NA and NaN, it will be the length of the elements it holds, with a slight difference, that NA will be a vector of class Logical, whereas NaN will be a vector of class numeric.
NULL vector will not change the size but class when combined with a mathematical operation:
#operation on NULL Vector v1 < c(NULL, NULL, NULL) str(v1) # NULL v1 < v1+1 str(v1) # num(0)This will only change the class but not the length and still any of the data will not persist in the vector.
With data.frames it is relatively the same behavior.
#empty data.frame df1 < data.frame(v1=NA,v2=NA, v3=NA) df2 < data.frame(v1=NULL, v2=NULL, v3=NULL) df3 < data.frame(v1=NaN, v2=NaN, V3=NaN) str(df1); str(df2);str(df3)Dataframe consisting of NULL values for each of the column will presented as dataframe with 0 observations and 0 variables (0 columns and 0 rows). Dataframe with NA and NaN will be of 1 observation and 3 variables, of logical data type and of numerical data type, respectively.
When adding new observations to data frames, different behavior when dealing with NULL, NA or NaN.
Adding to “NA” data.frame:
# adding new rows to existing dataframe df1 < rbind(df1, data.frame(v1=1, v2=2,v3=3)) #explore data.frame df1it is clear that new row is added, and when adding a new row (vector) of different size, it will generate error, since the dataframe definitions holds the dimensions. Same behavior is expected when dealing with NaN value. On the other hand, different results when using NULL values:
#df2 will get the dimension definition df2 < rbind(df2, data.frame(v1=1, v2=2)) #this will generate error since the dimension definition is set df2 < rbind(df2, data.frame(v1=1, v2=NULL)) #and with NA should be fine df2 < rbind(df2, data.frame(v1=1, v2=NA))with first assignment, the df2 will get the dimension definition, albeit the first construction of df2 was a nullable vector with three elements.
NULLable is also a result when we are looking in the vector element that is not existent, due to the fact that is out of boundaries:
l < list(a=1:10, b=c("a","b","c"), c=seq(0,10,0.5)) l$a # [1] 1 2 3 4 5 6 7 8 9 10 l$c # [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 l$r # NULLwe are calling the sublist r of list l, which is a NULL value, but is not missing or not existing, it is NULL, which in fact is rather contradictory, since the definition is not set. Different results (Not Available) would be returned when calling a vector element:
v < c(1:3) v[4] #[1] NABoundaries in list and in vector are defined differently for NA and NULL data types.
Getting data from SQL Server
I will use several different datatypes deriving from following SQL table.
USE AzureLanding; GO CREATE TABLE R_Nullables ( ID INT IDENTITY(1,1) NOT NULL ,num1 FLOAT ,num2 DECIMAL(20,10) ,num3 INT ,tex1 NVARCHAR(MAX) ,tex2 VARCHAR(100) ,bin1 VARBINARY(MAX) ) INSERT INTO R_Nullables SELECT 1.22, 21.535, 245, 'This is Nvarchar text','Varchar text',0x0342342 UNION ALL SELECT 3.4534, 25.5, 45, 'This another Nvarchar text','Varchar text 2',0x03423e3434tg UNION ALL SELECT NULL, NULL, NULL, NULL,NULL,NULL UNION ALL SELECT 0, 0, 0, '','',0xBy using RODBC R library, data will be imported in R environment:
library(RODBC) SQLcon < odbcDriverConnect('driver={SQL Server};server=TOMAZK\\MSSQLSERVER2017;database=AzureLanding;trusted_connection=true') # df < sqlQuery(SQLcon, "SELECT * FROM R_Nullables") df < sqlQuery(SQLcon, "SELECT ID ,num1 ,num2 ,num3 ,tex1 ,tex2 FROM R_Nullables") close(SQLcon)When SELECT * query is executed, the varbinary data type from SQL Server is represented as 2GiB binary object in R and most likely, you will receive an error, because R will not be able to allocate memory:
After altering the columns, the df object will be created. The presentation is straightforward, yet somehow puzzling:
ID num1 num2 num3 tex1 tex2 1 1 1.2200 21.535 245 This is Nvarchar text Varchar text 2 2 3.4534 25.500 45 This another Nvarchar text Varchar text 2 3 3 NA NA NA 4 4 0.0000 0.000 0When put sidebyside the output from SQL Server and output in R, there are some differences:
What is presented in SQL Server as NULL value, it is represented in R as NA; which is a logical type, but not the real NA. And only the is logical object, that is the Not Available information. So this means, that handling NA is not only about the “Not Available” but also the type of “Not Available” information and each of these needs special attention, otherwise when doing some calculations or functions, coerce error will be constantly emerging.
Data imported using SQL Server can be used as normal dataset imported in R in any other way:
#making some elementary calculations df$num1 * 2 # [1] 2.4400 6.9068 NA 0.0000 is.na(df$num1) # [1] FALSE FALSE TRUE FALSESame logic applied to text1 and text2 fields. Both are factors, but NULL or NA values can be treated respectively.
# Text df$text2 # NULL df$text1 # NULLThis is rather unexpected, since the SQL Server data types again are not working for R environment. So changing the original SQL query to cast all the values:
df < sqlQuery(SQLcon, "SELECT ID ,num1 ,num2 ,num3 ,CAST(tex1 AS VARCHAR(100)) as text1 ,CAST(tex2 AS VARCHAR(100)) as text2 FROM R_Nullables")and rerunning the df population, the result of df$text1 will be:
[1] This is Nvarchar text This another Nvarchar text Levels: This another Nvarchar text This is Nvarchar text Getting data from TXT / CSV filesI have created a sample txt/csv file that will be imported into R by executing:
setwd("C:\\Users\\Tomaz\\") dft < read.csv("import_txt_R.txt") dftSidebyside; R and CSV file will show that data types are handled great:
but only on the first glance. Let’s check the last observation by examing the tpye:
is.na(dft[5,]) # text1 text2 val1 val2 #5 TRUE FALSE FALSE FALSEThis is the problem, as the factor and the values of each, will be treated differently, although both are same type, but one is real NA, and the other is not.
identical(class(dft[5,2]),class(dft[5,1])) # [1] TRUEBefore going on next voyage, make sure you check all the data types and values.
As always, code is available at Github. Happy coding!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Devil is in the Data is moving
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Dear readers,
I have been consolidating my online presence and the Devil is in the Data blog has moved to the Lucid Manager website.
This will be the last post on this domain and I hope to see you again on The Lucid Manager.
All existing subscriptions and RSS will be transferred to the new website and all pages on this site will be redirected to lucidmanager.org.
Peter
The post The Devil is in the Data is moving appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rough looking figures from R
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
A recent blog post regarding data visualization had some barplots I liked the look of (aesthetically…for research purposes, they wouldn’t be suitable). They look as if they’ve be coloured in with a pencil, rather than having solid blocks of colour… I wondered whether it’s possible with R, and indeed it is. There’s a github project called ggrough that interacts with ggplot2.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Visualize the World Cup with R! Part 1: Recreating Goals with ggsoccer and ggplot2
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
After posting a couple of my World Cup viz on Twitter, I thought I'll collate some of them into a blog post. This will be Part 1 of a series as the World Cup goes on and I keep improving my viz skills throughout the tournament. I will also explain how I made improvements in each new plot, practice makes perfect!
Let's look at some of the packages I will use!
library(ggplot2) # plotting on top of ggsoccer library(ggsoccer) # create soccer pitch overlay library(dplyr) # data manipulation library(purrr) # create multiple dataframes for tweenr library(tweenr) # build frames for animation library(gganimate) # animate plots library(extrafont) # insert custom fonts into plots library(ggimage) # insert images and emoji into plotsThe important package here is the ggsoccer package made by Ben Torvaney, check out the GitHub repo here.
Showing is better than telling in this instance so let's take a look at the pitch:
library(ggplot2) library(ggsoccer) data < data.frame(x = 1, y = 1) ggplot(data) + annotate_pitch() + theme_pitch() + coord_flip(xlim = c(49, 101), ylim = c(1, 101))Basically, annotate_pitch() creates the markings for the soccer field such as the center circle, 18yard box, penalty spot, etc. while theme_pitch() erases the extraneous axes and background from the default ggplot style. By using the limits arguments in coord_flip(), we can focus on a certain area of the pitch and orient it in a way that we want, as I want to recreate goals I'm going to show only one half of the field and orient the view to face the goal. With this as the base, we can now input positional data and then use a combination of geom_segment() and geom_curve() to show the path of the ball and the players!
The only problem with doing this is manually creating the data points. This is more a problem of access to the data rather than availability as sports analytics firms, most notably Opta, generate a huge amount of data for every player in every match, however it is not easy for a regular guy like me to buy it.
Some people have managed to create some nice heatmaps by scraping WhoScored.com and other sites (that create their viz from purchased data from Opta) with RSelenium or some other JS scrapers but that was a bit out of my expertise so I resorted to creating the coordinate positions by hand. Thankfully, due to the plotting system in ggsoccer and ggplot2, it's very easy to figure out the positions on the soccer field plot and with a little bit of practice it doesn't take too much time.
To save space I don't show the data frames with the coordinate points and labelling data for all of the graphics, however you can find all of them here in the GitHub repo!
Gazinsky Scores The First Goal! ggplot(pass_data) + annotate_pitch() + theme_pitch() + theme(text = element_text(family = "Trebuchet MS")) + coord_flip(xlim = c(49, 101), ylim = c(1, 101)) + geom_segment(aes(x = x, y = y, xend = x2, yend = y2), arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + geom_segment(data = ball_data, aes(x = x, y = y, xend = x2, yend = y2), linetype = "dashed", size = 0.85, color = c("black", "red")) + geom_segment(data = movement_data, aes(x = x, y = y, xend = x2, yend = y2), linetype = "dashed", size = 1.2, color = "darkgreen") + geom_curve(data = curve_data, aes(x = x, y = y, xend = x2, yend = y2), curvature = 0.25, arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + geom_image(data = goal_img, aes(x = x, y = y, image = image), size = 0.035) + ggtitle(label = "Russia (5) vs. (0) Saudi Arabia", subtitle = "First goal, Yuri Gazinsky (12th Minute)") + labs(caption = "By Ryo Nakagawara (@R_by_Ryo)") + geom_label(data = label_data, aes(x = x, y = y, label = label, hjust = hjust, vjust = vjust)) + annotate("text", x = 69, y = 65, family = "Trebuchet MS", label = "After a poor corner kick clearance\n from Saudi Arabia, Golovin picks up the loose ball, \n exchanges a giveandgo pass with Zhirkov\n before finding Gazinsky with a beautiful cross!")Not bad for a first try. Let's take a closer look at how I plotted the soccer ball image into the plot.
goal_img < data.frame(x = 100, y = 47) %>% mutate(image = "https://d30y9cdsu7xlg0.cloudfront.net/png/43563200.png") ## ggplot2 code ## geom_image(data = goal_img, aes(x = x, y = y, image = image), size = 0.035) ## ggplot2 code ##I used the ggimage package to be able to create a geom layer for an image. I created a column called image in a dataframe with the URL link to the soccer ball image I wanted and then in the geom_image() function I specified it in the image argument.
Cristiano's Hattrick!In my excitement after seeing Portugal vs. Spain, a candidate for match of the tournament for the group stages if not for the whole tournament, I drew up Cristiano Ronaldo's hattrick!
ggplot(goals_data) + annotate_pitch() + theme_pitch() + theme(text = element_text(family = "Dusha V5"), legend.position = "none") + coord_flip(xlim = c(55, 112), ylim = c(1, 101)) + geom_segment(x = 80, y = 48, xend = 97, yend = 48) + # 2nd geom_segment(x = 97, y = 48, xend = 100, yend = 45.5, arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + # degea fumble geom_curve(data = curve_data, aes(x = x, y = y, xend = xend, yend = yend), # FREEKICK curvature = 0.3, arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + geom_text(data = annotation_data, family = "Dusha V5", aes(x = x, y = y, hjust = hjust, label = label), size = c(6.5, 4.5, 3, 3.5, 3.5, 3.5)) + geom_flag(data = flag_data, aes(x = x, y = y, image = image), size = c(0.08, 0.08)) + # Portugal + Spain Flag ggimage::geom_emoji(aes(x = 105, y = c(45, 50, 55)), image = "26bd", size = 0.035) + geom_point(aes(x = x, y = y), shape = 21, size = 7, color = "black", fill = "white") + geom_text(aes(x = x, y = y, label = label, family = "Dusha V5"))Compared to the first plot, I increased the xaxis limit so that we could place our geom_text() annotations and flag images together without having to use grobs. This also meant that we put the plot title and subtitle in the geom_text() rather than in the labs() function, which let all the text/label data be organized in one dataframe, annotation_data.
annotation_data < data.frame( hjust = c(0.5, 0.5, 0.5, 0, 0, 0), label = c("Portugal (3) vs. Spain (3)", "Cristiano's Hattrick (4', 44', 88')", "by Ryo Nakagawara (@R_by_Ryo)", "1. Fouled by Nacho in the box,\nCristiano confidently strokes the ball\ninto the right corner from the spot.", "2. Guedes lays it off to Cristiano whose\nstrong shot is uncharacteristically\nfumbled by De Gea into the net.", "In the final minutes of the game,\nCristiano wins a freekick against Pique\nand curls it beautifully over the wall."), x = c(110, 105, 53, 76, 66, 66), y = c(30, 20, 85, 5, 5, 55) )Overall, it's a slightly hacky solution to include a lot of blank spaces between the country name and the score to put the flags in between them, but I don't know of any geoms that can incorporate both text and images at the same time so the hacky solution will do!
To show the flags I use the geom_flag() function from the ggimage package. The function requires you to pass a twodigit ISO code in the image argument for the flags of the countries you want. You can find the ISO codes for countries with a quick google search, Portugal is “PT” and Spain is “ES”.
flag_data < data.frame( image = c("PT", "ES"), x = c(110, 110), y = c(19.1, 50.1) ) ## ggplot2 code ## geom_flag(data = flag_data, aes(x = x, y = y, image = image, size = size)) ## ggplot2 code ##Some other options to do this include using the ggflags package or if you don't like the flags used in geom_flag(), pass an image of a flag of your choosing to geom_image().
There is actually a better way to search for the ISO codes which I will show later!
This time, instead of the soccer ball image, I used the emoji_search() function from the emoGG package to find a soccer ball emoji. Then I can use either emoGG or ggimage's geom_emoji() function to insert it into my ggplot!
library(emoGG) library(ggimage) emoji_search("soccer") # "26bd" ## emoji code keyword ## 2537 soccer 26bd sports ## 2538 soccer 26bd football ## 4130 o 2b55 circle ## 4131 o 2b55 round ## 5234 eritrea 1f1ea\\U0001f1f7 er ## 5956 somalia 1f1f8\\U0001f1f4 so ## ggplot2 code ## ggimage::geom_emoji(aes(x = 105, y = c(45, 50, 55)), image = "26bd", size = 0.035) ## ggplot2 code ##From now on, instead of the soccer ball image in the first graphic, I will be using the emoji version!
The official World Cup font, “Dusha”, was created by a Portugese design agency back in 2014 and has been used in all official World Cup prints and graphics. Some of the letters may look a bit squished but overall I quite like it, so I wanted to incorporate it in my plots. To do so you need to download the .TTF file from here, then rightclick and install it. Now, we need to make sure R can use it, this can be done by using the extrafont package!
font_import() # import font files in your computer font_install() # install any new font files added to your computer loadfonts() # run every new session once! fonts() # to check out what fonts are ready for use in R!For more details check out the package README here. Again, remember to run loadfont() everytime you open up a new session!
Osako's Winner vs. ColombiaI wasn't expecting much from Japan's World Cup journey this time around due to our poor performances in the friendlies (besides the Paraguay game) and the fact that we changed our manager in April! However, with a historic win (our first against South American opposition in the World Cup), I couldn't resist making another R graphic:
library(ggplot2) library(dplyr) library(ggsoccer) library(extrafont) library(ggimage) library(countrycode) cornerkick_data < data.frame(x = 99, y = 0.3, x2 = 94, y2 = 47) osako_gol < data.frame(x = 94, y = 49, x2 = 100, y2 = 55.5) player_label < data.frame(x = c(92, 99), y = c(49, 2)) annotation_data < data.frame( x = c(110, 105, 70, 92, 53), y = c(30, 30, 45, 81, 85), hjust = c(0.5, 0.5, 0.5, 0.5, 0.5), label = c("Japan (2) vs. Colombia (1)", "Kagawa (PEN 6'), Quintero (39'), Osako (73')", "Japan press their man advantage, substitute Honda\ndelivers a delicious corner kick for Osako to (somehow) tower over\nColombia's defense and flick a header into the far corner!", "Bonus: Ospina looking confused and\ndoing a lil' twosteporgodknowswhat.", "by Ryo Nakagawara (@R_by_Ryo)") ) flag_data < data.frame( x = c(110, 110), y = c(13, 53), team = c("japan", "colombia") ) %>% mutate( image = team %>% countrycode(., origin = "country.name", destination = "iso2c") ) %>% select(team) wc_logo < data.frame(x = 107, y = 85) %>% mutate(image = "https://upload.wikimedia.org/wikipedia/en/thumb/6/67/2018_FIFA_World_Cup.svg/1200px2018_FIFA_World_Cup.svg.png") ggplot(osako_gol) + annotate_pitch() + theme_pitch() + theme(text = element_text(family = "Dusha V5"), plot.margin=grid::unit(c(0,0,0,0), "mm")) + coord_flip(xlim = c(55, 112), ylim = c(1, 101)) + geom_curve(data = cornerkick_data, aes(x = x, y = y, xend = x2, yend = y2), curvature = 0.15, arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + geom_segment(aes(x = x, y = y, xend = x2, yend = y2), arrow = arrow(length = unit(0.25, "cm"), type = "closed")) + geom_label(data = player_label, aes(x = x, y = y), label = c("Osako", "Honda"), family = "Dusha V5") + geom_point(aes(x = 98, y = 50), size = 3, color = "green") + geom_text(aes(x = 99.7, y = 50), size = 5, label = "???", family = "Dusha V5") + geom_text(data = annotation_data, family = "Dusha V5", aes(x = x, y = y, hjust = hjust, label = label), size = c(6.5, 4.5, 4, 3.5, 3)) + ggimage::geom_flag(data = flag_data, aes(x = x, y = y, image = image), size = c(0.08, 0.08)) + ggimage::geom_emoji(aes(x = 95, y = 50), image = "26bd", size = 0.035) + geom_image(data = wc_logo, aes(x = x, y = y, image = image), size = 0.17)I could have used the annotate() function to add the little comment about Ospina being stuck in noman'sland but I prefer to have all of my text in a single dataframe. Like before, I again had to expand the xaxis limits in the coord_flip(). This is also so we can insert the World Cup image on the top right without using grobs/Magick and such. To grab that World Cup logo, we do the same things as we did when we added the soccer ball image in the first plot with ggimage.
For finding the ISO codes to input for the geom_flag() function we can do one better than previous attempts by using the countrycode package to find ISO codes without having to manually search online!
By passing country names into countrycode() function and labelling them as “country.name” in the origin argument, the function will know that the input is the regular name for a country. Then you specify the output such as “iso2c” for the twodigit ISO codes such as in our case, “wb” for World Bank codes, “eurostat.name” for country names in the Eurostat database and so on…!
library(countrycode) flag_data < data.frame( x = c(110, 110), y = c(13, 53), team = c("japan", "colombia") ) %>% mutate( image = team %>% countrycode(., origin = "country.name", destination = "iso2c") ) %>% select(team) glimpse(flag_data) ## Observations: 2 ## Variables: 3 ## $ x 110, 110 ## $ y 13, 53 ## $ image "JP", "CO"Although the ISO codes are pretty intuitive for countries like Japan and Colombia, when you're dealing with lots of countries like at the World Cup or the Olympics, having a reproducible workflow for this is very helpful!
In a future part (not necessarily the next part), I want to animate some of these goal graphics using the great gganimate and tweenr packages. I've been slowly working my way through them in the past week so here is a preview:
I'll only show a gganimate version of Gazinsky's goal for now as I'm still figuring out how to interpolate multiple moving objects (the ball and the players) as well as making the green movement lines disappear after the player finished moving.
For Osako's goal, here's a preview of the tweenr version. Working on this was much easier as I had made it so that the only moving bit to interpolate was the path of the ball.
I've been playing around a lot with the different easing functions using this website as a reference but it still doesn't feel 100% right… For this one I used “quadraticout”. I want to make sure that the ball doesn't completely come to a full stop when it reaches Osako but most keep doing that.
These goals are just from the first week of the World Cup and if I had the time I would do more as there have been some fantastic individual and team goals so far!
With the Group Stages done, I am looking forward to an even more exciting Knockout Stage, good luck to all of your favorite teams!
Part 2 will be coming soon!
Related Post
 Creating Slopegraphs with R
 How to use paletteR to automagically build palettes from pictures
 Visualize Market Basket analysis in R
 LatticeLike Forest Plot using ggplot2 in R
 How does visualization in Plotly differ from Seaborn
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Local GoodnessofFit Plots / Wangkardu Explanations – a new DALEX companion
(This article was first published on English – SmarterPoland.pl, and kindly contributed to Rbloggers)
The next DALEX workshop will take place in 4 days at UseR. In the meantime I am working on a new explainer for a single observation.
Something like a diagnostic plot for a single observation. Something that extends Ceteris Paribus Plots. Something similar to Individual Conditional Expectation (ICE) Plots. An experimental version is implemented in ceterisParibus package.
Intro
For a single observation, Ceteris Paribus Plots (WhatIf plots) show how predictions for a model change along a single variable. But they do not tell if the model is well fitted around this observation.
Here is an idea how to fix this:
(1) Take N points from validation dataset, points that are closest to a selected observation (Gower distance is used by default).
(2) Plot N Ceteris Paribus Plots for these points,
(3) Since we know the true y for these points, then we can plot model residuals in these points.
Examples
Here we have an example for a random forest model. The validation dataset has 9000 observations. We use N=18 observations closest to the observation of interest to show the model stability and the local goodnessoffit.
The empty circle in the middle stands for the observation of interest. We may read its surface component (OX axis, around 85 sq meters), and the model prediction (OY axis, around 3260 EUR).
The thick line stands for Ceteris Paribus Plot for the observation of interest.
Grey points stands for 18 closest observations from the validation dataset while grey lines are their Ceteris Paribus Plots.
Red and blue lines stand for residuals for these neighbours. Corresponding true values of y are marked with red and blue circles.
Red and blue intervals are short and symmetric so one may say that the model is well fitted around the observation of interest.
Let’s see an another example for the same model but a different point prediction. This time for a large apartment.
Residuals are larger, all negative (true values are higher than predictions). The local goodnessoffit for the model is pretty bad. Other diagnostic tools (see auditor) show that for this dataset the random forest model is heavily biased for expensive apartments.
Wangkardu
Side story: In AGNSW I found a painting named Wangkardu that looks (to me) like above graphs. If you want to use its colors, just add palette = „wangkardu” to the plot function (R code here).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Summary in One Go
(This article was first published on Coastal Econometrician Views, and kindly contributed to Rbloggers)
This function and package is long pending for publishing from my side, this time expecting soon to put as package for quick usage, before that thought releasing it for feedback.
Below function provides R code for getting data description details like missing, distinct, min, max, mean, median, mode in one go for ready to use and for quick interpretation purposes.
This provide regular data summary stats needed (as shown in below image) in *.csv format which can be copied an pasted to excel as per your needs.
To use it follow the syntax:
source(“https://raw.githubusercontent.com/pradeepmav/data_description_function/master/data_description.R”)
data_description(“datasetname”)
Happy R Programming!
Author trains & develops Machine Learning (AI) applications, and can be reached at info@tatvaai.com or besteconometrician@gmail.com for more details.
Find more about author at http://in.linkedin.com/in/pradeepmavuluri
To leave a comment for the author, please follow the link and comment on their blog: Coastal Econometrician Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tidily evaluated ggplot2
(This article was first published on HighlandR, and kindly contributed to Rbloggers)
Test driving the new release of ggplot2 –
A new release of ggplot2Now that tidy evaluation is baked into ggplot2, as of TODAY, let’s take it for a spin:
ggplot2 3.0.0 %>%
create function %>%
test function %>%
end
Let’s test it:
gtest(newdata,date,value,key)End
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: HighlandR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
anytime 0.3.1
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new minor release of the anytime package is now on CRAN. This is the twelveth release, and the first in a little over a year as the package has stabilized.
anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.
This release adds a few minor tweaks. For numeric input, the function is now immutable: arguments that are not already cast to a different type (a common use case) are now cloned so that the input is never changed. We also added two assertion helpers for dates and datetimes, a new formatting helper for the (arguably awful, but common) ‘yyyymmdd’ format, and expanded some unit tests.
Changes in anytime version 0.3.1 (20170605)
Numeric input is now preserved rather than silently cast to the return object type (#69 fixing #68).

New assertion function assertDate() and assertTime().

Unit tests were expanded for the new functions, for conversion from integer as well as for yyyymmdd().
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.
For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Marginal Effects for Regression Models in R #rstats #dataviz
(This article was first published on R – Strenge Jacke!, and kindly contributed to Rbloggers)
Regression coefficients are typically presented as tables that are easy to understand. Sometimes, estimates are difficult to interpret. This is especially true for interaction or transformed terms (quadratic or cubic terms, polynomials, splines), in particular for more complex models. In such cases, coefficients are no longer interpretable in a direct way and marginal effects are far easier to understand. Specifically, the visualization of marginal effects makes it possible to intuitively get the idea of how predictors and outcome are associated, even for complex models.
The ggeffectspackage (Lüdecke 2018) aims at easily calculating marginal effects for a broad range of different regression models, beginning with classical models fitted with lm() or glm() to complex mixed models fitted with lme4 and glmmTMB or even Bayesian models from brms and rstanarm. The goal of the ggeffectspackage is to provide a simple, userfriendly interface to calculate marginal effects, which is mainly achieved by one function: ggpredict(). Independent from the type of regression model, the output is always the same, a data frame with a consistent structure.
The idea behind this function is to compute (and visualize) the relationship between a model predictor (independent variable) and the model response (dependent variable). The predictor of interest needs to be specified in the termsargument.
data(mtcars) m < lm(mpg ~ hp + wt + cyl + am, data = mtcars) ggpredict(m, "cyl") #> # A tibble: 3 x 5 #> x predicted conf.low conf.high group #> #> 1 4 21.7 19.1 24.4 1 #> 2 6 20.2 19.3 21.1 1 #> 3 8 18.7 16.5 21.0 1The relationship can be differentiated depending on further predictors, which is useful e.g. for interaction terms. Up to two further predictors that indicate the „grouping“ structure can be used to calculate marginal effects. The names of these predictors need to be passed as character vector to ggpredict().
m < lm(mpg ~ wt * cyl + am + wt + cyl, data = mtcars) p < ggpredict(m, c("wt", "cyl")) p #> # A tibble: 27 x 5 #> x predicted conf.low conf.high group #> #> 1 1.5 31.7 28.5 34.9 4 #> 2 1.5 26.3 23.5 29.2 6 #> 3 1.5 21.0 16.8 25.1 8 #> 4 2 28.7 26.6 30.8 4 #> 5 2 24.2 22.2 26.3 6 #> 6 2 19.8 16.4 23.1 8 #> 7 2.5 25.7 24.3 27.2 4 #> 8 2.5 22.1 20.7 23.5 6 #> 9 2.5 18.5 15.9 21.2 8 #> 10 3 22.8 20.8 24.7 4 #> # ... with 17 more rowsThere’s a plot()method, based on ggplot2:
plot(p)The simple approach of ggpredict() can be used for all supported regression models. Thus, to calculate marginal effects with ggpredict(), it makes no differences if the model is a simpel linear model or a negative biniomial multilevel model or a cumulative link model etc. In case of cumulative link models, ggpredict() automatically takes care of proper grouping, in this case for the different levels of the response variable:
library(MASS) library(ordinal) data(housing) m < clm(Sat ~ Type * Cont + Infl, weights = Freq, data = housing) p < ggpredict(m, c("Cont", "Type")) plot(p)ggeffects also allows easily calculating marginal effects at specific levels of other predictors. This is particularly useful for interaction effects with continuous variables. In the following example, both variables of the interaction term have a larger range of values, which obscure the moderating effect:
m < lm(mpg ~ wt * hp + am + wt, data = mtcars) p < ggpredict(m, c("hp", "wt")) plot(p)However, you can directly specify certain values, at which marginal effects should be calculated, or use „shortcuts“ that compute convenient values, like mean +/ 1 SD etc.
p < ggpredict(m, c("hp", "wt [meansd]")) plot(p)The latest update of ggeffects on CRAN introduced some new features. There is a dedicated website that describes all the details of this package, including some vignettes with lots of examples.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Strenge Jacke!. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Presenting survey data
(This article was first published on R – scottishsnow, and kindly contributed to Rbloggers)
I work in a social science and economics department and a lot of our work is based on surveys of opinion. One question style we use a lot is likert, I’ve tried a few ways to present these results. My current favourite is a stacked bar graph, ordered by agreement.
I’ve created an example of this using data from a political blog. I was struggling to interpret the survey results in their table form, so spent a few minutes cleaning them in a spreadsheet and made the plots below. I don’t think the results entirely support the headline, but that’s news and politics for you!
The question asked was: In your personal experience/opinion, how would you rate the overall state of the following public services?
The code to produce this is below. Note the first mutate call is reordering the factor to ensure it appears correctly and not alphabetically.
library(tidyverse) library(RColorBrewer) df = read_csv("Downloads/voters.csv") %>% gather(Impression, Value, Voters, Area) %>% mutate(Impression=factor(Impression, levels=c("Good", "Neither", "Bad"))) ggplot(df, aes(Voters, Value, fill=Impression)) + geom_col(position="fill") + coord_flip() + facet_wrap(~Area) + scale_y_continuous(labels=scales::percent) + scale_fill_brewer(type="div") + labs(x="", y="Voters")I’ve also experimented with displaying proportion data as heat maps, but feel there is too much work for the reader to interpret the results. Here’s an example, with code below.
ggplot(df, aes(Voters, Area, fill=Value)) + geom_raster() + facet_wrap(~Impression) + scale_fill_distiller(palette="Greens", direction=1) + labs(x="", y="", fill="% of\nvoters") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – scottishsnow. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Predict Customer Churn with Gradient Boosting
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
Customer churn is a key predictor of the long term success or failure of a business. But when it comes to all this data, what’s the best model to use? This post shows that gradient boosting is the most accurate way of predicting customer attrition. I’ll show you how you can create your own data analysis using gradient boosting to identify and save those at risk customers!
Why predict customer churn?Customer retention should be a top priority of any business as acquiring new customers is often far more expensive that keeping existing ones. It is no longer a given that long standing customers will remain loyal given the numerous options in the market. Therefore, it is vital that companies can proactively determine the customers most at risk of leaving and take preventative measures against this.
Predictive models for customer churn can show the overall rate of attrition, while knowledge of how the churn rate varies over a period of time, customer cohort, product lines and other changes can provide numerous valuable insights. Yet, customers also vary enormously in their behaviors and preferences which means that applying a simple “rule of thumb” analysis will not work. Here’s where a predictive model using gradient boosting can help you.
First, I’m going to describe the data. Then I’ll use gradient boosting to predict who will churn and who will stay. Finally I’ll benchmark my result against other models.
The dataI’ll aim to predict Churn, a binary variable indicating whether a customer of a telecoms company left in the last month or not.
To do this I’ll use 19 variables including:
 Length of tenure in months.
 Types of services signed up for such as phone, internet and movie streaming.
 Demographic information.
 Monthly charges, type of contract and billing.
The full data set is available here.
The breakdown of Churn is shown below. If we predict No (a customer will not churn) for every case, we can establish a baseline. Our baseline establishes that 73% is the minimum accuracy that we should improve on.
Gradient boostingIn this earlier post I explained gradient boosting. Gradient boosting sits alongside regression, decision trees, support vector machines and random forests. They are all supervised learning algorithms capable of fitting a model to train data and make predictions.
A common strategy when working with any of these models is to split the data into a training sample and a testing sample. The model learns the associations between the predictor variables and the target outcome from the training sample. The testing sample is used to provide an unbiased estimate of the prediction accuracy on unseen data.
I randomly split the data into a a 70% training sample and a 30% testing sample. I then perform gradient boosting with an underlying tree model. The chart below shows the 10 most important variables. We learn that having a monthly contract, length of tenure and amount of charges are useful predictors of churn.
How accurately can we predict customer churn with gradient boosting?Gradient boosting has various internal parameters know generically as hyperparameters. These settings determine the size of the underlying trees and the impact that each round of boosting has on the overall model. It can be time consuming to explore all of the possibilities to find the best values. To create the model below I automatically performed a grid search of 36 different combinations of hyperparameters. I selected the best set by 5fold cross validation.
We’ve already established that our baseline when always predicting that a customer will not churn is 73%. This amounts to a shot in the dark when trying to determine whether or not a customer will churn. Not great, right? However, when we can input more information like different variables such as a person’s bills, their contract length or tenure and their comfort level with the technology, we can learn much more about this customer.
From this information, we can more accurately pinpoint who will churn and our prediction accuracy rises by 8% to 80.87%. This gives us a much greater edge in being able to identify the factors that may lead to customer attrition and more of an edge when it comes to the crucial business of customer retention!
Why choose gradient boosting over other models?In the same way that I just fitted a gradient boosting model, we can fit other models. I tried 3 other approaches. Each time I followed the same procedure as above, selecting the same variables, fitting with the training sample and calculating accuracy from the testing sample. The results are:
Model Accuracy Gradient Boosted Tree 80.87% CART 79.21% Random Forest 79.94% Linear Discriminant Analysis 79.97%Whilst this is not a comprehensive comparison, gradient boosting performs the best amongst these models with the highest accuracy score.
TRY IT OUT
The analysis in this post was performed in Displayr using R. The flipMultivariates package, which uses the xgboost package, performs the machine learning calculations. You can try this analysis for yourself in Displayr.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Comparative Review of the Rattle GUI for R
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
IntroductionRattle is a popular free and open source Graphical User Interface (GUI) for the R software, one that focuses on beginners looking to pointandclick their way through data mining tasks. Such tasks are also referred to as machine learning or predictive analytics. Rattle’s name is an acronym for “R Analytical Tool To Learn Easily.” Rattle is available on Windows, Mac, and Linux systems.
This post is one of a series of reviews which aim to help nonprogrammers choose the GUI that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.
Terminology
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include pointandclick style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.
Installation
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The Help Desks at most universities are flooded with such calls at the beginning of each semester!
The steps to install Rattle are:
 Install R
 In R, install the toolkit that Rattle is written in by executing the command: install.packages(“RGtk2”)
 Also in R, install Rattle itself by executing the command:
install.packages(“rattle”, dependencies=TRUE)
The very latest development version is available here.
Note that while Rattle’s name is capitalized, the name of the rattle package is spelled in all lowercase letters!  If you wish to take advantage of interactive visualization (highly recommended) then install the GGobi software from: http://www.ggobi.org/downloads/.
Plugin Modules
When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes builtin, it’s good to know how active the development community is. They contribute “plugins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).
Rattle’s complete capability was designed and programmed by Graham Williams of Togaware. As a result, it doesn’t have plugins, but it does include a comprehensive set of data mining tools.
Startup
Some user interfaces for R, such as jamovi, start by doubleclicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.
Rattle is run as a part of R itself, so the steps to start it begin with starting R:
 Start R.
 Load Rattle from your library by executing the command: library(“rattle”)
 Start Rattle by executing the command: rattle()
Data Editor
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.
Rattle’s data editor is unique for a GUI in that it does not offer a way to create a data set. It lets you edit any data set you open using R’s builtin edit function, but that function offers very few features. Clicking on a variable name will cause a dialog to open, offering to change the variable’s name or type as numeric or character (see Figure 2). Rattle automatically converts variables that have fewer than 10 values into “categorical” ones. R would call these factors. You can always recode variables from numeric to categorical (or vice versa) in the “Transform” tab (see Data Management section).
Data Import
Since R GUIs are using R to do the work behind the scenes, they often include the ability to read a wide range of files, including SAS, SPSS, and Stata. Some, like BlueSky Statistics, also include the ability to read directly from SQL databases. Of course you can always use R code to import data from any source and then continue to analyze it using any GUI, but the point of GUIs is to avoid programming.
Rattle skips many common statistical data formats, but it includes a couple exclusive ones, such as the AttributeRelation File Format used by other data mining tools. It also includes “corpus” which reads in text documents, and it then it performs the popular tfidfcalculation to prepare them for analysis using the other numericallybased analysis methods.
On its “Data” tab, Rattle offers several formats:
 File: CSV
 File: TXT
 File: Excel
 AttributeRelation File Format (ARFF)
 Open Database Connectivity (ODBC)
 R Dataset
 RData File
 Library
 Corpus (for text analysis)
 Script
Data Management
It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as BlueSky Statistics or the R Commander can handle all, or nearly all, of these tasks.
Rattle provides minimal data management tools. Its designer chose to focus on reading a single data set, and making transformations that are common in data mining projects quick and easy. More complex data management tasks are left to other tools such as SQL in a database before the data set is read in, or using R programming.
Rattle’s “Transform” tab cycles through various data management “types.” The way it works is quite unique. As you can see in Figure X, I have selected the Transform tab by clicking on it. I then held the CTRL key down to select several variables that are highlighted in blue. If the variables had been next to one another, I could have clicked on the first one, then shiftclicked on the last to select them all. Next I chose my transformation, by choosing “Recode” and then “Recenter.” Finally, I clicked the “Execute” button (or F2) to complete the process by adding three new recoded variables to the data set. Original variables are never changed, and you never have the ability to choose the name of the new variable(s). A prefix is appended to the variable name(s) automatically to speed the process. In this case, my Rainfall variable was transformed into “RRC_Rainfall”. The RRC prefix stands for “Recoded, ReCentered.”
Whenever a variable is transformed, its status in the “Data” tab switches from “Input” to “Ignore”, while the transformed version of variable enters the data with an “Input” role.
As easy as some transformations are, other transformations are impossible. For example, if you had a formula to calculate recommended daily allowances of vitamins, there’s no way to do it. Conditional transformations, those which have different formulas for different subsets of the observations (e.g. daily allowances of vitamins calculated differently for men and women) are also not possible. Here are the available transformations:
Transform> Rescale> Normalize
 Recenter (Zscore)
 Scale 0 to 1
 (Var – Median)/Mean Absolute Deviation (MAD)
 Natural Log
 Log 10
 Matrix (divide all by a constant)
Transform> Impute
 Replace missing with zeros (e.g. requesting nothing gets you nothing)
 Mean
 Median
 Mode
 Constant
Transform> Recode
 Binning> Quantiles
 Binning> KMeans clusters
 Binning> Equal width intervals
 Binning> N Equally spaced intervals
 Indicator variables
 Join Categorics
 As Categoric
 As Numeric
Continued here…
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Elo and EloBeta models in snooker
(This article was first published on QuestionFlow , and kindly contributed to Rbloggers)
Research about adequacy of Elo based models applied to snooker match results. Contains a novel approach (EloBeta) targeted for sport results with variable “best of N” format.
For many years I’ve been following snooker as a sport. It has it all: hypnotic beauty of smart play, elegance of cue strikes and psychological tension of competition. The one thing I don’t like that much is its ranking system.
Generally speaking, current snooker ranking is based on player accomplishments in tournaments (events) with different “weights” for different tournaments. Long time ago, it just used World Championships. Then, after more events had emerged, there was a table of points player could earn for winning at certain stage of tournament. Now it has the form of “rolling” sum of prize money player won during (roughly) past two calendar years.
This system has two main advantages: it is simple (win more money > rise in rankings) and predictabile (want to get certain ranking > win certain amount of money, other things being equal). The problem is that this type of rankings doesn’t account the strength (skill, form) of player’s opponents. The usual counterargument for this is that if player reached high stage of tournament then he/she is “strong” at this moment of time (“weak players don’t win tournaments”). Well, it does sound quite convincing. However, in snooker, as in any sport, the role of chance should be taken into account: if player is “weaker” it doesn’t mean that he can’t ever win in a match with “stronger” player. It means that this happens less often then the other way around. Here where Elo model comes into play.
The idea behind Elo model is that each player is associated with numerical rating. The assumption is that a result of a game between two players can be predicted based on difference of their ratings: more value indicates more probability of “stronger” (with higher rating) player to win. Elo ranking is based on current player “strength” derived by wins against other players. This eliminates main disadvantage of current official ranking system. It is also capable of updating player rating during tournament to numerically react to player’s strong tournament performance.
Having some practical experience with Elo ratings, I think it can do well in snooker too. However, there is one obstacle: it is devised for competitions with uniform type of games. Yes, there are some variations to account for home field advantage in football or first move advantage in chess (both by adding fixed amount of rating points to “less advantageous” player). In snooker, however, matches are played in the “best of \(N\)” format: the first one to win \(n = \frac{N + 1}{2}\) frames wins a match. We will also call this format “\(n\) to win”.
Intuitively, winning a “10 to win” match (final of major tournament) should be harder for “weaker” player then “4 to win” match (first rounds of current Home Nations tournaments). This is taken into account by my proposed EloBeta model.
To celebrate actual start of 2018/19 snooker season I decided to write this post in which I explore adequacy of both Elo and EloBeta models on snooker match results. Note that the goal is not to build models for forecasting and gambling purposes but for assessing players “strength” and creating “fair” ranking.
The idea of using Elo rating in snooker is not new at all. There are works on this topic, for example:
 Snooker Analyst provides “Elo inspired” (more like Bradley–Terry model) rating system based on the idea of updating rating based on difference between “actual frames won” and “expected frames won”. This approach is a little bit questionable. Surely, more frame difference should indicate more difference in strength, however, achieving that is not player’s goal. In snooker aim is “just” to win match, i.e. get certain amount of frame wins before the opponent.
 This forum discussion with implementation of basic Elo model.
 Possibly, there are other works that I’ve missed. I will highly appreciate any information on this topic.
This post is intended for both R users interested in Elo ratings and snooker analysts and fans. All experiments are written with intention to be reproducible. All code is hidden under spoilers (text appears after clicking on its summary, usually beginning with “Code for …”). It has commentaries and uses tidyverse packages, so it might be an interesting read for R users and programmers.
This post is organized as follows:
 Models describes Elo and EloBeta models with their R implementations.
 Experiment describes details and intentions of computations: which data and methodology were used (and why) and what are the results.
 Exploration of EloBeta ratings has application results of EloBeta model to actual snooker data. This section is written more for Snooker fans than for R enthusiasts.
We will need the following setup:
Code for setup # Data wrangling packages suppressPackageStartupMessages(library(dplyr)) library(tidyr) library(purrr) # Visualization package library(ggplot2) # Package for ratings suppressPackageStartupMessages(library(comperank)) theme_set(theme_bw()) # Shouldn't be needed. Added just in case. set.seed(20180703) ModelsBoth models are based on the following assumptions:
 There is a fixed set of players which should be ranked from “strongest” (first place) to “weakest” (last place).
 Ranking is done by associating player \(i\) with numerical rating \(r_i\): a number indicating the “strength” of player (more value > “stronger”).
 The more difference in player ratings before the match the less favorable is “weaker” player to win it.
 Ratings are updated after every match based on its result and the ratings before it.
 Winning against “stronger” opponent should lead to bigger increase in rating than winning against “weaker” opponent. The opposite should be true for losing.
Elo model updates ratings by the following steps:

Compute probability (before the match) of certain player to win the match. Probability of one player (we will call him/her “first”) with “identifier” \(i\) and rating \(r_i\) winning against the other player (“second”) with “identifier” \(j\) and rating \(r_j\) is equal to
\[Pr(r_i , r_j) = \frac{1}{1 + 10^{(r_j – r_i)/400}}\]
This way of computing probability is aligned with third model assumption.
Difference normalization by 400 is a mathematical way to say which difference is considered “big”. This can be replaced by a model parameter \(\xi\), however this only affects the spread of future ratings and is often an overkill. Number 400 is fairly standard in chess world.
In general approach probability is equal to \(L(r_j – r_i)\) where \(L(x)\) is some strictly increasing function with values from 0 to 1. We will use logistic curve to compute winning probability. More thorough study can be found in this article.
 Obtain match result \(S\). In basic model it is 1 if first player wins (second player loses), 0.5 in case of draw and 0 if second player wins.
 Update ratings:
 \(\delta = K \cdot (S – Pr(r_i , r_j))\). This is the value by which ratings will change. It introduces the “K factor” (only model parameter). Less \(K\) (ratings being equal) means less change in ratings – model is more conservative, i.e. more wins is needed to “prove” increase in “strength”. On the other hand, more \(K\) means more “trust” to the current results than current ratings. Choosing “optimal” \(K\) is a way to produce “good” ranking system.
 \(r_i^{(new)} = r_i + \delta\), \(r_j^{(new)} = r_j – \delta\).
Notes:
 As one can see from rating update formulas, the sum of ratings for all ranked players doesn’t change over time: rating increase of one rating can be only done by taking this amount from another player.
 Players without any matches played are associated with initial rating 0. The usual value is 1500, however I don’t see any other reason except psychological for this. With previous note, using 0 means that sum of all ratings will always be 0, which is kind of beautiful.
 It is needed some matches to be played in order for rating to represent player’s “strength”. This introduces a problem: newly added players start with rating 0 which is almost surely not the lowest among current players. In other words, newcomers are considered to be “stronger” than some other players. This should be dealt with by external procedures of rating updates when introducing new player: maybe he/she should start with some low rating while compensating overall sum decrease by uniform increasing of other players’ ratings.
 Why this procedure makes sense? In case of equal ratings \(\delta\) always equals \(0.5 \cdot K\). Let’s say, for example, that \(r_i = 0\) and \(r_j = 400\). It means that probability of first player winning is \(\frac{1}{1 + 10} \approx 0.0909\), i.e. he/she will win 1 out of 11 matches.
 In case of win, he/she will be awarded with approximately \(0.909 \cdot K\) increase, which is more than in case of equal ratings.
 If he/she is defeated, then rating is decreased by approximately \(0.0909 \cdot K\), which is less than in case of equal ratings.
This shows that Elo model is aligned with fifth model assumption: winning against “stronger” opponent leads to bigger increase in rating than winning against “weaker” opponent and vice versa.
Of course, Elo model has its (fairly highlevel) practical issues. However, the most important for our research is that “it thinks” that all matches are played in uniform conditions. It means, that match length isn’t taken into account: winning in “4 to win” match is rewarded the same as winning in “10 to win” match. That is where EloBeta comes into play.
EloBeta Code for EloBeta model #' @details This function is vectorized by all its arguments. #' #' @return A probability of player 1 (rating `rating1`) wins in a match with #' player 2 (rating `rating2`). Match is played until either player wins #' `frames_to_win` frames. Here difference in ratings directly affects #' the probability of winning one frame. elobeta_win_prob < function(rating1, rating2, frames_to_win, ksi = 400, ...) { prob_frame < elo_win_prob(rating1 = rating1, rating2 = rating2, ksi = ksi) # Probability that first player wins `frames_to_win` frames sooner than second # player based on probability of first player to win one frame `prob_frame`. # Frames are treated as independent games. pbeta(prob_frame, frames_to_win, frames_to_win) } #' @return Match result in terms of player 1 win: 1 if he/she wins, 0.5 in case #' of a draw, and 0 if he/she loses. get_match_result < function(score1, score2) { # There are no ties in snooker but this handles general case near_score < dplyr::near(score1, score2) dplyr::if_else(near_score, 0.5, as.numeric(score1 > score2)) } #' @return A rating function for EloBeta model that can be supplied to #' `comperank::add_iterative_ratings()`. elobeta_fun_gen < function(K, ksi = 400) { function(rating1, score1, rating2, score2) { prob_win < elobeta_win_prob( rating1 = rating1, rating2 = rating2, frames_to_win = pmax(score1, score2), ksi = ksi ) match_result < get_match_result(score1, score2) delta < K * (match_result  prob_win) c(rating1 + delta, rating2  delta) } }In Elo model difference in ratings directly affects the outcome probability of winning the whole match. The main idea behind EloBeta model is to make difference in ratings directly affect the outcome of one frame and actually compute the probability of a player winning \(n\) frames before his opponent.
The question is: how to compute this probability? Turns out, this is one of the oldest task in the history of probability theory and has its own name – Problem of points. Very nice description can be found in this post. Using its notation, the outcome probability equals to:
\[
P(n, n) = \sum\limits_{j = n}^{2n1}{{{2n1} \choose {j}} p^j (1p)^{2n1j}}
\]
Here \(P(n, n)\) is a probability of first player to win in “\(n\) to win” match; \(p\) is his/her probability to win one frame (\(1p\) – opponents probability). With this approach one assumes that result of frames inside a match is independent one from another. This is arguable, but necessary assumption (in terms of this model).
Is there a way to compute this faster? It turns out, that the answer is yes. After hours of formula wrangling, practical experiments and internet search I found the following property of regularized incomplete beta function \(I_x(a, b)\). By substituting \(m = k,~ n = 2k – 1\) into that property and changing \(k\) into \(n\) we obtain that \(P(n, n) = I_p(n, n)\).
This is also very good news for R users, because \(I_p(n, n)\) can be computed as simply as pbeta(p, n, n). Note that the general case probability of winning \(n\) frames before opponent wins \(m\) can also be computed as \(I_p(n, m)\) and pbeta(p, n, m) respectively. This introduces the reach field of updating winning probabilities during the mach.
So the procedure of updating ratings within EloBeta model is as follows (given ratings \(r_i, r_j\), number of frames \(n\) to win in match and match outcome \(S\), as in Elo model):
 Compute probability of first player to win the frame: \(p = Pr(r_i , r_j) = \frac{1}{1 + 10^{(r_j – r_i)/400}}\).
 Compute probability of this player to win the match: \(Pr^{Beta}(r_i, r_j) = I_p(n, n)\). For example, if \(p\) equals 0.4 then probability to win “4 to win” match drops to 0.29 and with “18 to win” to 0.11.
 Update ratings:
 \(\delta = K \cdot (S – Pr^{Beta}(r_i , r_j))\).
 \(r_i^{(new)} = r_i + \delta\), \(r_j^{(new)} = r_j – \delta\).
Note that, as difference in ratings affects probability of winning one frame (not the whole match), we should expect lower optimal \(K\): part of \(\delta\)’s value comes from amplifying effect of \(Pr^{Beta}(r_i, r_j)\).
The idea of computation match result based on probability of winning one frame is not very novel. On this site, authored by François Labelle, you can find online computation of probability of winning a “best of \(N\)” match (with some other functionality). I was very glad to notice that results of our computations match. However, I didn’t find any sources of incorporating this into Elo updating procedure. I will highly appreciate any information on this topic.
I’ve only managed to found this post and this description of Elo system on backgammon internet server (FIBS). Here different matches are handled by multiplying rating difference by square root of match length. However, there seems to be no strong theoretical reason for this.
ExperimentThe experiment has several goals. Based on snooker data:
 Derive best value of “K factor” for both models.
 Study stability of models in terms of prediction probability accuracy.
 Study the effect of using “invitational” events on ratings.
 Produce “best” rating history of all professional players from 2017/18 season.
We will use snooker data from comperank package. The original source is snooker.org site. Results are taken from the following matches:
 Match is from 2016/17 or 2017/18 seasons.
 Match is a part of “professional” snooker event. That is:
 It has “Invitational”, “Qualifying”, or “Ranking” type. We will also differ two sets of matches: “all matches” (from all these events) and “official matches” (not from invitational events). There are two main reasons behind it:
 In invitational events not all players are given opportunity to change their ratings (which isn’t a clearly bad thing under Elo and EloBeta models).
 Belief that players “take seriously” only official ranking events. Note that most of “Invitational” events are from “Championship League” which, I think, are treated by most players as practice with opportunity to win money, i.e. “not very seriously”. Their presence can affect outcome ratings.
Besides “Championship League”, other invitational events are: “2016 China Championship”, both “Champion of Champions”, both “Masters”, “2017 Hong Kong Masters”, “2017 World games”, “2017 Romanian Masters”.
 It describes traditional snooker (not 6 Red or Power Snooker) between individual players (not teams).
 Both genders can take part (not only men or women).
 Players of all ages can take part (not only seniors or under 21).
 It is not “ShootOut” as those events are treated differently in snooker.org database.
 It has “Invitational”, “Qualifying”, or “Ranking” type. We will also differ two sets of matches: “all matches” (from all these events) and “official matches” (not from invitational events). There are two main reasons behind it:
 Match actually happened: its result is due to actual play from both players.
 Match is between two professionals (pros). List of professionals is taken as for 2017/18 season (131 players). This seems like the most controversial decision, as removing “proama” (“ama” for “amateur”) and “amaama” matches leads to “closing eyes” on pros’ losses to amateurs, and thus giving unfair advantage to those pros. I think this choice is necessary to ensure absence of rating inflation which will happen if matches with amateurs are taken into account. Another possibility would be to study pros and amas together, but this seems unreasonable to me within this research. Professional’s loss to amateur is treated as loss of opportunity to increase rating.
The final numbers of used matches are 4118 for “all matches” and 3644 for “official matches” (62.9 and 55.6 matches per player respectively).
Methodology Code for methodology functions #' @param matches A `longcr` or `widecr` object with column `matchType` (with #' type of match for the experiment: "train", "validation", or "test"). #' @param test_type A type of match to be used for computing goodness of fit. #' For experiment correctness, all matches with this type should happen later #' than all other ("warmup") matches. This means having bigger values in #' `game` column. #' @param k_vec Vector of "K factor" values to compute goodness of fit. #' @param rate_fun_gen Function that, given "K factor" value, returns rating #' function that can be supplied to `comperank::add_iterative_ratings()`. #' @param get_win_prob Function to compute rating probability based on #' ratings of players (`rating1`, `rating2`) and number of frames needed to #' win in a match (`frames_to_win`). __Note__ that it should be vectorized by #' all its arguments. #' @param initial_ratings Initial ratings in format ready for #' `comperank::add_iterative_ratings()`. #' #' @details This function computes: #'  History of iterative ratings after arranging `matches` by `game` column. #'  For matches with type equals to `test_type`: #'  Probability of player 1 winning. #'  Match result in terms of player 1 win: 1 if he/she wins, 0.5 in case of #' a draw, and 0 if he/she loses. #'  Goodness of fit in the form of RMSE: square root of mean square error, #' where "error" is difference between predicted probability and match result. #' #' @return A tibble with columns 'k' for "K factor" and 'goodness' for RMSE #' goodness of fit. compute_goodness < function(matches, test_type, k_vec, rate_fun_gen, get_win_prob, initial_ratings = 0) { cat("\n") map_dfr(k_vec, function(cur_k) { # Track execution cat(cur_k, " ") matches %>% arrange(game) %>% add_iterative_ratings( rate_fun = rate_fun_gen(cur_k), initial_ratings = initial_ratings ) %>% left_join(y = matches %>% select(game, matchType), by = "game") %>% filter(matchType %in% test_type) %>% mutate( # Number of frames needed to win in a match framesToWin = pmax(score1, score2), # Probability of player 1 winning a match with `frame_to_win` frames # needed to win. winProb = get_win_prob( rating1 = rating1Before, rating2 = rating2Before, frames_to_win = framesToWin ), result = get_match_result(score1, score2), squareError = (result  winProb)^2 ) %>% summarise(goodness = sqrt(mean(squareError))) }) %>% mutate(k = k_vec) %>% select(k, goodness) } #' A wrapper for `compute_goodness()` to be used with design matrix data. compute_goodness_wrap < function(matches_name, test_type, k_vec, rate_fun_gen_name, win_prob_fun_name, initial_ratings = 0) { matches_tbl < get(matches_name) rate_fun_gen < get(rate_fun_gen_name) get_win_prob < get(win_prob_fun_name) compute_goodness( matches_tbl, test_type, k_vec, rate_fun_gen, get_win_prob, initial_ratings ) } #' Function to perform experiment. #' #' @param test_type Vector of values for `test_type` for `compute_goodness()`. #' @param rating_type Names of rating models. #' @param data_type Suffixes of data types. #' @param k_vec,initial_ratings Values for `compute_goodnes()` #' #' @details This function generates design matrix and computes multiple values #' of goodness of fit for different combinations of rating and data types. For #' this to work, variables with the following combinations of names should be #' created in the global environment: #'  "pro_matches_" + `` + `` for matches data. #'  `` + "_fun_gen" for rating function generators. #'  `` + "_win_prob" for functions that compute win probability. #' #' @return A tibble with columns: #'  __testType__ : Test type identifier. #'  __ratingType__ : Rating type identifier. #'  __dataType__ : Data type identifier. #'  __k__ : Value of "K factor". #'  __goodness__ : Value of goodness of fit. do_experiment < function(test_type = c("validation", "test"), rating_type = c("elo", "elobeta"), data_type = c("all", "off"), k_vec = k_grid, initial_ratings = 0) { crossing( testType = test_type, ratingType = rating_type, dataType = data_type ) %>% mutate( dataName = paste0("pro_matches_", testType, "_", dataType), kVec = rep(list(k_vec), n()), rateFunGenName = paste0(ratingType, "_fun_gen"), winProbFunName = paste0(ratingType, "_win_prob"), initialRatings = rep(list(initial_ratings), n()), experimentData = pmap( list(dataName, testType, kVec, rateFunGenName, winProbFunName, initialRatings), compute_goodness_wrap ) ) %>% unnest(experimentData) %>% select(testType, ratingType, dataType, k, goodness) }To find “optimal” value of \(K\) we will use the even grid \(K = 1, 2, …, 100\). Accounting for greater values seems to be unreasonable which is confirmed by the experiment. The following procedure is used to find it:
 For every \(K\):
 Compute history of iterative ratings of certain model based on certain data set. It means that ratings of players would be known before every match. This is done with add_iterative_ratings() function from comperank package. This procedure corresponds to “live ratings” which update after every match.
 Based on data, starting from a certain (distant from the beginning) moment in time, compute goodness of model fit. We will use RMSE between probability of first player to win (computed based on model) and match result. That is \(RMSE = \sqrt{\frac{1}{T} \sum\limits_{t \in T}{(S_t – P_t)^2}}\), where \(T\) – indices of used matches, \(T\) – number of used matches, \(S_t\) – result of match for first player, \(P_t\) – probability of first player to win the match (computed based on model). Not including matches from the beginning of data is needed for ratings to “catch up” to “current strength” from initial ratings.
 The value of \(K\) with stable minimal RMSE is said to be optimal. Here by “stable” we mean that small RMSE values is present in some neighborhood of optimal \(K\) (will be controlled not very strictly by looking at graphs). Values of RMSE lower 0.5 (value for “model” with constant 0.5 probability) will be considered a success.
As one of the goals is to study stability of models, data will be split into three subsets: “train”, “validation” and “test”. They are ordered in time, i.e. any “train”/“validation” match has ending time earlier than any “validation”/“test” match. I decided to do actual split in 50/25/25 proportion for “all matches”. Split of “official matches” is done by removing from “all matches” invitational events. It gives split not totally in desired proportion, but rather 49.7/27.8/22.5. However, this approach ensures that matches with same type in both match data represent identical time periods.
Experiment will be performed for all combinations of the following variables:
 Type of model: Elo and EloBeta.
 Type of match data: “All matches” and “official matches”.
 Experiment type: “Validation” (“validation” matches are used for computing RMSE after “warming up” on “train” matches) and “Test” (“test” matches are used after “warming up” on both “train” and “validation” ones).
From the experiment we can make the following conclusions:
 As it was expected, optimal \(K\) values for EloBeta are lower than for Elo.
 Using official matches (without invitational events) gives more stable results (“Validation” and “Test” results differ less). This shouldn’t be considered as a point that professionals take invitational events not seriously. Probably, this is due to quality of match results from “Championship League”: it has rather unpredictable “3 to win” format and tight schedule.
 Change in RMSE for optimal \(K\) is not substantial. That is, RMSE didn’t change drastically after computing optimal \(K\) in “Validation” and applying it in “Test” experiment. Moreover, on “official matches” it even decreased.
 All optimal K values from test experiment (with longer “warm up”) are lower than from validation experiment. This may be the result of longer “warm up” or just a feature of particular data.
 Best goodness of Elo and EloBeta fits are almost the same. Both are stable and below 0.5. Data for “official matches” (as they demonstrate stable behavior) is presented below. As results don’t differ that much, we will round optimal \(K\) to a factor of 5: for Elo model it is 30 and for EloBeta – 10.
Based on these results, I am inclined to conclude that Elo model with \(K = 30\) and EloBeta model with \(K = 10\) can be usefully applied to officially ranked snooker matches. However, EloBeta model accounts for different “\(n\) to win” matches, so it should be preferred over Elo.
Exploration of EloBeta ratingsThe following results are computed based on “official matches” with EloBeta model (\(K = 10\)). All possible conclusions shouldn’t be viewed as personal to any player.
Top 16 by the end of 2017/18 Code for 2017/18 top 16 # Helper function gather_to_longcr < function(tbl) { bind_rows( tbl %>% select(matches("2")) %>% rename_all(funs(gsub("1", "", .))), tbl %>% select(matches("1")) %>% rename_all(funs(gsub("2", "", .))) ) %>% arrange(game) } # Extract best "K factor" value best_k < experiment_tbl %>% filter(testType == "test", ratingType == "elobeta", dataType == "off") %>% slice(which.min(goodness)) %>% pull(k) #!!! Round to "pretty" number as it doesn't affect result that much!!! best_k < round(best_k / 5) * 5 # Compute ratings at the end of the data elobeta_ratings < rate_iterative( pro_matches_test_off, elobeta_fun_gen(best_k), initial_ratings = 0 ) %>% rename(ratingEloBeta = rating_iterative) %>% arrange(desc(ratingEloBeta)) %>% left_join( y = snooker_players %>% select(id, playerName = name), by = c(player = "id") ) %>% mutate(rankEloBeta = order(ratingEloBeta, decreasing = TRUE)) %>% select(player, playerName, ratingEloBeta, rankEloBeta) elobeta_top16 < elobeta_ratings %>% filter(rankEloBeta <= 16) %>% mutate( rankChr = formatC(rankEloBeta, width = 2, format = "d", flag = "0"), ratingEloBeta = round(ratingEloBeta, 1) )Top 16 by EloBeta model at the end of 2017/18 season looks like this (official data is also taken from snooker.org site):
Player EloBeta rank EloBeta rating Official rank Official rating EloBeta rank increase Ronnie O’Sullivan 1 128.8 2 905 750 1 Mark J Williams 2 123.4 3 878 750 1 John Higgins 3 112.5 4 751 525 1 Mark Selby 4 102.4 1 1 315 275 3 Judd Trump 5 92.2 5 660 250 0 Barry Hawkins 6 83.1 7 543 225 1 Ding Junhui 7 82.8 6 590 525 1 Stuart Bingham 8 74.3 13 324 587 5 Ryan Day 9 71.9 16 303 862 7 Neil Robertson 10 70.6 10 356 125 0 Shaun Murphy 11 70.1 8 453 875 3 Kyren Wilson 12 70.1 9 416 250 3 Jack Lisowski 13 68.8 26 180 862 13 Stephen Maguire 14 63.7 17 291 025 3 Mark Allen 15 63.7 12 332 450 3 Yan Bingtao 16 61.6 23 215 125 7Some observations:
 Current official #1 Mark Selby is ranked 3 places lower in EloBeta. This might be a sign that current distribution of prize money doesn’t quite reflect efforts needed to win them (on average).
 Most “underrated” players according to official ranking are Jack Lisowski (astonishing 13 place difference), Ryan Day and Yan Bingtao (both have 7 place difference).
 Stuart Bingham is ranked 5 positions higher by EloBeta probably because he didn’t play for six month due to WPBSA ban. His EloBeta rating didn’t change during this period but in official rating he lost points because of its “rolling” nature. This case demonstrates one important differences between two approaches: official system is good to account for “not playing” and EloBeta is good to account for “playing”.
 Judd Trump and Neil Robertson are ranked the same under both methods.
 With EloBeta model Allister Carter (officially ranked #11), Anthony McGill (#14) and Luca Brecel (#15) are not in top 16. Instead, Jack Lisowski (#26), Yan Bingtao (#23) and Stephen Maguire (#17) are in.
Here is an example of predictions Based on EloBeta model. The probability of 16th player (Yan Bingtao) win one frame in a match against first player (Ronnie O’Sullivan) equals to 0.404. In “4 to win” match it drops to 0.299, in “10 to win” – 0.197 and in World Championship final “18 to win” – 0.125. In my opinion, these numbers might be close to reality.
Collective evolution of EloBeta ratings Code for rating evolution # Helper data seasons_break < ISOdatetime(2017, 5, 2, 0, 0, 0, tz = "UTC") # Compute evolution of ratings elobeta_history < pro_matches_test_off %>% add_iterative_ratings(elobeta_fun_gen(best_k), initial_ratings = 0) %>% gather_to_longcr() %>% left_join(y = pro_matches_test_off %>% select(game, endDate), by = "game") # Generate plot plot_all_elobeta_history < function(history_tbl) { history_tbl %>% mutate(isTop16 = player %in% elobeta_top16$player) %>% ggplot(aes(endDate, ratingAfter, group = player)) + geom_step(data = . %>% filter(!isTop16), colour = "#C2DF9A") + geom_step(data = . %>% filter(isTop16), colour = "#22A01C") + geom_hline(yintercept = 0, colour = "#AAAAAA") + geom_vline( xintercept = seasons_break, linetype = "dotted", colour = "#E41A1C", size = 1 ) + geom_text( x = seasons_break, y = Inf, label = "End of 2016/17", colour = "#E41A1C", hjust = 1.05, vjust = 1.2 ) + scale_x_datetime(date_labels = "%Y%m") + labs( x = NULL, y = "EloBeta rating", title = paste0( "Most of current top 16 established at the end of 2016/17 season" ), subtitle = paste0( "Winning of event is well noticable as rapid increase without ", "descrease at the end." ) ) + theme(title = element_text(size = 14)) } plot_all_elobeta_history(elobeta_history) Evolution of EloBeta top 16 Code for rating evolution of top 16 # Compute plot data top16_rating_evolution < elobeta_history %>% # Using `inner_join` to leave only players from `elobeta_top16` inner_join(y = elobeta_top16 %>% select(ratingEloBeta), by = "player") %>% # Leave games only from 2017/18 season semi_join( y = pro_matches_test_off %>% filter(season == 2017), by = "game" ) %>% mutate(playerLabel = paste(rankChr, playerName)) # Generate plot plot_top16_elobeta_history < function(elobeta_history) { ggplot(elobeta_history) + geom_step(aes(endDate, ratingAfter, group = player), colour = "#22A01C") + geom_hline(yintercept = 0, colour = "#AAAAAA") + geom_rug( data = elobeta_top16, mapping = aes(y = ratingEloBeta), sides = "r" ) + facet_wrap(~ playerLabel, nrow = 4, ncol = 4) + scale_x_datetime(date_labels = "%Y%m") + labs( x = NULL, y = "EloBeta rating", title = "Rating evolution for EloBeta top 16 (as of 2017/18 end)", subtitle = paste0( "Ronnie O'Sullivan and Mark J Williams did very well in 2017/18 ", "season.\n", "As did Jack Lisowski: rise from negative rating to place 13." ) ) + theme(title = element_text(size = 14), strip.text = element_text(size = 12)) } plot_top16_elobeta_history(top16_rating_evolution) Conclusions Solving “Problem of points” in R is as easy as pbeta(p, n, m).
 EloBeta model is a modification of Elo for “best of \(N\)” (or “\(n\) to win”) matches. It can make predictions for different match length.
 Elo model with \(K = 30\) and EloBeta model with \(K = 10\) can be usefully applied to officially ranked snooker matches.
 Snooker related:
 Most “underrated” players from EloBeta top 16 according to official ranking (as of end of 2017/18 season) are Jack Lisowski, Ryan Day and Yan Bingtao.
 Season 2017/18 was very productive for Ronnie O’Sullivan, Mark J Williams and Jack Lisowski.
To leave a comment for the author, please follow the link and comment on their blog: QuestionFlow . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
seplyr 0.5.8 Now Available on CRAN
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
We are pleased to announce that seplyr version 0.5.8 is now available on CRAN.
seplyr is an R package that provides a thin wrapper around elements of the dplyr package and (now with version 0.5.8) the tidyr package. The intent is to give the part time R user the ability to easily program over functions from the popular dplyr and tidyr packages. Our assumption is always that a data scientist most often comes to R to work with data, not to tinker with the programming language itself.
Tools such as seplyr, wrapr or rlang are needed when you (the data scientist temporarily working on a programming subtask) do not know the names of the columns you want your code to be working with. These are situations where you expect the column names to be made available later, in additional variables or parameters.
For an example: suppose we have following data where for two rows (identified by the “id” column) we have two measurements each (identified by the column names “measurement1” and “measurement2”).
library("wrapr") d < build_frame( 'id', 'measurement1', 'measurement2'  1 , 'a' , 10  2 , 'b' , 20 ) print(d) # id measurement1 measurement2 # 1 1 a 10 # 2 2 b 20Further suppose we wished to have each measurement in its own row (which is often required, such as when using the ggplot2 package to produce plots). In this case we need a tool to convert the data format. If we are doing this as part of an adhoc analysis (i.e. we can look at the data and find the column names at the time of coding) we can use tidyr to perform the conversion:
library("tidyr") gather(d, key = value_came_from_column, value = value_was, measurement1, measurement2) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20Notice, however, all column names are specified in gather() without quotes. The names are taken from unexecuted versions of the actual source code of the arguments to gather(). This is somewhat convenient for the analyst (they can skip writing a few quote marks), but a severe limitation imposed on the script writer or programmer (they have problems taking the names of columns from other sources).
seplyr now supplies a standard value oriented interface for gather(). With seplyr we can write code such as the following:
library("seplyr") gather_se(d, key = "value_came_from_column", value = "value_was", columns = c("measurement1", "measurement2")) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20This sort of interface is handy when the names of the columns are coming from elsewhere, in variables. Here is an example of that situation:
# pretend these assignments are done elsewhere # by somebody else key_col_name < "value_came_from_column" value_col_name < "value_was" value_columns < c("measurement1", "measurement2") # we can use the above values with # code such as this gather_se(d, key = key_col_name, value = value_col_name, columns = value_columns) # id value_came_from_column value_was # 1 1 measurement1 a # 2 2 measurement1 b # 3 1 measurement2 10 # 4 2 measurement2 20There are ways to use gather() with “to be named later” column names directly, but it is not simple as it neeedlessly forces the user to master a number of internal implementation details of rlang and dplyr. From documentation and “help(gather)” we can deduce at least 3 related “pure tidyeval/rlang” programming over gather() solutions:
# possibly the solution hinted at in help(gather) gather(d, key = !!key_col_name, value = !!value_col_name, dplyr::one_of(value_columns)) # concise rlang solution gather(d, key = !!key_col_name, value = !!value_col_name, !!!value_columns) # fully qualified rlang solution gather(d, key = !!rlang::sym(key_col_name), value = !!rlang::sym(value_col_name), !!!rlang::syms(value_columns))In all cases the user must prepare and convert values for use. Really this is showing gather() does not conveniently expect parametric columns (column names supplied by variables or parameters), but will accept a workaround if the user recodes column names in some way (some combination of quoting and dequoting). With “gather_se()” the tool expects to take values and the user does not have to make special arrangements (or remember special notation) to supply them.
Our advice for analysts is:
 If your goal is to work with data: use a combination of wrapr::let() (a preferred user friendly solution that in fact predates rlang) and seplyr (a datafriendly wrapper over dplyr and tidyr functions).
 If your goal is to write an article about rlang: then use rlang.
 If you are interested in more advanced data manipulation, please check out our cdata package (video introduction here). The cdata equivilant
of the above transform is. library("cdata") control_table < build_unpivot_control( nameForNewKeyColumn = key_col_name, nameForNewValueColumn = value_col_name, columnsToTakeFrom = value_columns) rowrecs_to_blocks(d, control_table, columnsToCopy = "id") # id value_came_from_column value_was # 1 1 measurement1 a # 2 1 measurement2 10 # 3 2 measurement1 b # 4 2 measurement2 20  If you need high inmemory performance: try data.table.
In addition to wrapping a number of dplyr functions and tidyr::gather()/tidyr::spread(), seplyr 0.5.8 now also wraps tidyr::complete() (thanks to a contribution from Richard Layton).
We hope you try seplyr out both in your work and in your teaching.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Handling Outliers with R
(This article was first published on R – MADStat, and kindly contributed to Rbloggers)
Recently, I attended a presentation where the following graph was shown illustrating the response to stimulation with Thalidomide among a cohort of HIV1 patients. The biomarker used to measure the response in this case was TNF (tumor necrosis factor) and the response was measured at four time points: time of drug administration and 3, 11 and 23 weeks after administration. This graph appears in a published article, which I won’t cite directly because except for this one statistical problem, the research is outstanding.
Part of the problem in this graph comes from the software being used. Clearly not R and clearly not ggplot. It comes from one of the highly automated programs that assumes you, the scientist, are a total jerk and incapable of specifying a graph on your own. I have highlighted the problem I saw. That very high value in week 23 seems questionable and could be distorting the summary measures that are being used in the inferences in the article. The following table of the data gives some detail to the problem:
Beside the large number of missing values, we can see that the value for Patient 25 in week 23 (visit 4), 1339.552, is 10 times higher than the next value (130.997). The fact that there a number of similar digits between it and the next lower value also raises suspicion that this could simply be a data entry error or an error translating the data from its original home in a spreadsheet to the statistical program used. (I will also not mention the name of the program used, since I try to follow the English comedian Frankie Howerd’s advice: “Don’t mock the afflicted”.)
Is such a high value reasonable? And, if it it is simply a data error, could the use of this value lead to the wrong treatment recommendation based on this cohort.
First Question — Is It an Outlier?The first question that we need to address is whether this value, 1339.552, is really an outlier that needs attention and, if so, how should we modify it?
What’s an outlier? The most common definition is that it is a value that lies far away from the main body of observations for a variable and could distort summaries of the distribution of values. In practice this translates to a value that is 1.5 times the IQR (interquartile range) more extreme than the quartiles of the distribution.
The most effective way to see an outlier is to use a boxplot. The following figure relates the parts of a boxplot to a distribution and its histogram. I have taken it from the excellent book on R by Hadley Wickham and Garrett Grolemund, R for Data Science, which is available for reading here.
Boxplots are an excellent way to identify outliers and other data anomalies. We can draw them either with the base R function boxplot() or the ggplot2 geometry geom_boxplot(). Here, I am going to use the ggboxplot() function from the ggpubr package. I find that the functions from ggpubr keep me from making many mistakes in specifying parameters for the equivalent ggplot2 functions. It makes life a little simpler.
Here is the code that read in the original data and tidied the data to facilitate the graphing and subsequent analysis.
library(tidyverse) library(DescTools) library(ggpubr) tnf < readRDS("../tnf.rds") tnf_long < tnf %>% gather(vis0:vis4, key = "visit", value = "value") %>% group_by(visit)The next step is to look at the boxplot, which I have created with the following code:
gr_tnf_vis2 < tnf_long %>% filter(!is.na(visit) & !is.na(value)) %>% ggboxplot(x = "visit", y = "value", palette = "aaas", title = "TNF Levels by Visit  Thalidomide Group", xlab = "Visit", ylab = "TNF level", ggtheme = theme_gray()) suppressMessages(gr_tnf_vis2)While the graph shows three other outliers, they are not well away from the quartile boxes. However, the value at 1339.552 really stands out as an outlier.
So, what are we going to do about it?
What to Do about OutliersClearly, if a value has a reasonable scientific basis, it should not be modified even if it lies well outside the range of the other values for the sample or cohort. I normally refer to this as needing to pass the “compelling reason” test. There should be a compelling biological or medical reason to retain the data point.
If not, there are three commonly accepted ways of modifying outlier values.
 Remove the case. If you have many cases and there does not appear to be an explanation for the appearance of this value, or if the explanation is that it is in error, you can simply get rid of it.
 Assign the next value nearer to the median in place of the outlier value. Here that would be 130.997, the next lower value. You will find that this approach leaves the distribution close to what it would be without the value. You can use this approach if you have few cases and are trying to maintain the number of observations you do have.
 Calculate the mean of the remaining values without the outlier and assign that to the outlier case. While I have seen this frequently, I don’t really understand its justification and I think it distorts the distribution more than the previous solution.
Because we want to keep as many of the few values that we have (a commonplace concern in medical studies), I would choose the second option in this case and assign the value 130.997 to the outlier case. Here is the code that would do that. Patient 7 was the person who represented the value we seek and we are putting the value of that case into the slot for case 25.
# substitute tnf2 < tnf_long # case no. of next case tnf2$value[tnf2$id == "25" & tnf2$visit == "vis4"] < tnf2$value[tnf$id == "7" & tnf2$visit == "vis4"]The boxplot now shows that the extreme outlier has been controlled.
Problem Resolved?Yes and No. We now have values for tnf that fit in the distribution more comfortably. However, we needed to manipulate a value to cause that to happen.
There are three conclusions I would suggest from this type of analysis:
 Data always needs to be checked for outliers. Without doing this, you are likely to introduce a bias that could distort the results of your study.
 Conduct your analysis on the data both with and without the outlier. If the results are very close, you can use the original data without too many qualms. If not, you need to follow the next recommendation even more closely.
 When you find an outlier, you should report it in your findings and everything you have done to correct for it including differences between analyses that you conducted with and without the outlier value.
A last note is that canned statistical or data analysis software tends not to alert you to the presence and danger of outliers. You need to be especially vigilant when using these programs to avoid getting lulled into believing that everything is ok when it’s not. By forcing you to write your program, R avoids this problem but does not totally eliminate. You still need to take responsibility to going on an outlier hunt as you move through the data cleaning process.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – MADStat. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Freeing PDF Data to Account for the Unaccounted
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
I’ve mentioned @stiles before on the blog but for those new to my blatherings, Matt is a topnotch data journalist with the @latimes and currently stationed in South Korea. I can only imagine how much busier his life has gotten since that fateful, awful November 2016 Tuesday, but I’m truly glad his eyes, pen and R console are covering the important events there.
When I finally jumped on Twitter today, I saw this:
@hrbrmstr Do you have an R rig to convert this large PDF to csv? I tried xpdf and @TabulaPDF but I don't trust the results.
— Matt Stiles (@stiles) July 2, 2018
and went into action and figured I should blog the results as one can never have too many “convert this PDF to usable data” examples.
The ProblemThe U.S. Defense POW/MIA Accounting Agency maintains POW/MIA data for all our nation’s service members. Matt is working with data from Korea (the “All US UnaccountedFor” PDF direct link is in the code below) and needed to get the PDF into a usable form and (as you can see if you read through the Twitter thread) both Tabulizer and other tools were introducing sufficient errors that the resultant extracted data was either not complete or trustworthy enough to rely on (handchecking nearly 8,000 records is not fun).
There PDF in question was pretty uniform, save for the first and last pages. Here’s a sample:
We just need a reproducible way to extract the data with sufficient veracity to ensure we can use it faithfully.
The SolutionWe’ll need some packages and the file itself, so let’s get that bit out of the way first:
library(stringi) library(pdftools) library(hrbrthemes) library(ggpomological) library(tidyverse) # grab the PDF text mia_url < "http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf" mia_fil < "~/Data/pmkor_una_all.pdf" if (!file.exists(mia_fil)) download.file(mia_url, mia_fil) # read it in doc < pdf_text(mia_fil)Let's look at those three example pages:
cat(doc[[1]]) ## Defense POW/MIA Accounting Agency ## Personnel Missing  Korea (PMKOR) ## (Reported for ALL Unaccounted For) ## Total Unaccounted: 7,699 ## Name Rank/Rate Branch Year State/Territory ## ABBOTT, RICHARD FRANK M/Sgt UNITED STATES ARMY 1950 VERMONT ## ABEL, DONALD RAYMOND Pvt UNITED STATES ARMY 1950 PENNSYLVANIA ## ... ## AKERS, HERBERT DALE Cpl UNITED STATES ARMY 1950 INDIANA ## AKERS, JAMES FRANCIS Cpl UNITED STATES MARINE CORPS 1950 VIRGINIA cat(doc[[2]]) ## Name Rank/Rate Branch Year State/Territory ## AKERS, RICHARD ALLEN 1st Lt UNITED STATES ARMY 1951 PENNSYLVANIA ## AKI, CLARENCE HALONA Sgt UNITED STATES ARMY 1950 HAWAII ... ## AMIDON, DONALD PRENTICE PFC UNITED STATES MARINE CORPS 1950 TEXAS ## AMOS, CHARLES GEARL Cpl UNITED STATES ARMY 1951 NORTH CAROLINA cat(doc[[length(doc)]]) ## Name Rank/Rate Branch Year State/Territory ## ZAVALA, FREDDIE Cpl UNITED STATES ARMY 1951 CALIFORNIA ## ZAWACKI, FRANK JOHN Sgt UNITED STATES ARMY 1950 OHIO ## ... ## ZUVER, ROBERT LEONARD Pfc UNITED STATES ARMY 1950 CALIFORNIA ## ZWILLING, LOUIS JOSEPH Cpl UNITED STATES ARMY 1951 ILLINOIS ## This list of Korean War missing personnel was prepared by the Defense POW/MIA Accounting Agency (DPAA). ## Please visit our web site at http://www.dpaa.mil/OurMissing/KoreanWarPOWMIAList/ for updates to this list and other official missing personnel data lists. ## Report Prepared: 06/19/2018 11:25The poppler library's "layout" mode (which pdftools uses brilliantly) combined with the author of the PDF not being evil will help us make short work of this since:
 there's a uniform header on each page
 the "layout" mode returned uniform perpage, fixedwidth columns
 there's no "special column tweaks" that some folks use to make PDFs more readable by humans
There are plenty of comments in the code, so I'll refrain from too much blathering about it, but the general plan is to go through each of the 119 pages and:
 convert the text to lines
 find the header line
 find the column start/end positions from the header on the page (since they are different for each page)
 reading it in with readr::read_fwf()
 remove headers, preamble and epilogue cruft
 turn it all into one data frame
Now the data is both usable and sobering:
title < "Defense POW/MIA Accounting Agency Personnel Missing  Korea" subtitle < "Reported for ALL Unaccounted For" caption < "Source: http://www.dpaa.mil/portals/85/Documents/KoreaAccounting/pmkor_una_all.pdf" mutate(xdf, year = factor(year)) %>% mutate(branch = stri_trans_totitle(branch)) > xdf ordr < count(xdf, branch, sort=TRUE) mutate(xdf, branch = factor(branch, levels = rev(ordr$branch))) %>% ggplot(aes(year)) + geom_bar(aes(fill = branch), width=0.65) + scale_y_comma(name = "# POW/MIA") + scale_fill_pomological(name=NULL, ) + labs(x = NULL, title = title, subtitle = subtitle) + theme_ipsum_rc(grid="Y") + theme(plot.background = element_rect(fill = "#fffeec", color = "#fffeec")) + theme(panel.background = element_rect(fill = "#fffeec", color = "#fffeec")) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Pushing Ordinary Least Squares to the limit with Xy()
(This article was first published on rbloggers – STATWORX, and kindly contributed to Rbloggers)
Introduction to Xy()Simulation is mostly about answering particular research questions. Whenever the word simulation appears somewhere in a discussion, everyone knows that this means additional effort. At STATWORX we are using simulations as a first step to proof concepts we are developing. Sometimes such a simulation is simple, in other cases a simulation is plenty of work. Though, research questions are always pretty unique, which complicates the construction of a simulation framework. With the Xy() function I try to establish such a simulation framework for supervised learning tasks by trying to find the right balance between capability and simplicity.
I have already introduced this function in a recent blog post. Since then more and more functionalities were implemented and the overall interface was improved. If you are interested in the motivation behind the function, I strongly encourage you to read my last blog post. In the following we are going to analyze OLS coefficients under weak signal to noise ratios and get to know all revisions of the function.
Key FunctionalitiesThe purpose of Xy() is to simulate supervised learning data with maximum degrees of freedom. The user can independently choose how features generate the target by choosing interaction depths, userdefined nonlinearity transformations, categorical variables and all that jazz.
In the recent update I wrapped a package around the function which can be conveniently forked on GitHub or installed via library(devtools):
By default, the function simulates regression learning data with 1000 observations, two linear and two nonlinear features as well as five randomly generated variables, which are not used to generate the target. One main goal of the function is to challenge feature selection algorithms, thus there is an option to generate random variables. Let us create a simple example:
# Simulating regression data with four numeric effects, # one categorical effect and five irrelevant features reg_sim = Xy(n = 1000, numvars = c(2, 2), catvars = c(1, 2), noisevars = 5, family = Xy_task(), nlfun = function(x) {x^2}, interactions = 1, sig = c(1,10), cor = c(0.1,0.3), weights = c(5,10), sigma = NULL, stn = 4, noise.coll = FALSE, intercept = TRUE) print(reg_sim) Xy Simulation   + task regression  + observations 1000  + interactions 1D  + signal to noise ratio 4  + effects   linear 2   nonlinear 2   categorical 1   noise 5  + intervals   correlation [0.1,0.3]   weights [5,10]   sd [1,10] Target generating process: y = 416.07 + 9.81NLIN_1 + 5.23NLIN_2 + 5.53LIN_1 + 5.69LIN_2 + 6.75DUMMY_1__2 + e ~ N(0,487.15)The function comes with a print(), plot() and coef() method to further analyze the simulation object. The print method summarizes the settings that were used for the simulation. On the bottom you can see the target generating process, where you can see the coefficients and their respective weights as well as the noise. There are four key arguments in the function. The n argument controls the number of observations. The user can choose a desired amount of linear and nonlinear features with the numvars argument. Further, categorical variables and their respective levels can be determined with catvars. Of course there are many other functionalities, however going through all these would be a little dry, so why not use a toy example?
Pushing Ordinary Least Squares to the limitSuppose we are interested how OLS can cope with different signal strengths. The signal (strength) is defined as
Hence the signal strength controls how many information about the target can be extracted from our features. Higher values mean that there is relatively more variation in our features opposed to the noise. In such cases we can be more confident about the estimation results of our statistical model. In the following simulation we will try to find the sweet spot for the signal strength. In the Xy() function the signal strength can be manipulated with the stn argument.
# Simulating one supervised learning task with signal to noise ratio of 0.01 sig_sim = Xy(n = 1000, numvars = c(2, 0), catvars = 0, noisevars = 0, family = Xy_task(), sig = c(1,30), cor = 0, weights = c(10,10), stn = 0.01, intercept = TRUE) # extract the training data training < sig_sim$data # underlying effects coef(sig_sim) (Intercept) LIN_1 LIN_2 44.68159 6.07000 3.50000 coef(lm(y ~ ., data = training)) `(Intercept)` LIN_1 LIN_2 47.455572 4.807992 2.936850So in this setup we simulate a regression target which is generated by 2 linear features and dominating noise, since the signal to noise ratio is very weak. We can see that the OLS estimators are pretty off, which is not surprising with such a weak ratio.
In the following we are going to analyze the accuracy of the estimator by repeating the above simulation a thousand times for a single signal to noise ratio. We are going to analyze a total of nine different signal to noise ratios.
You can find the complete code of this simulation on my GitHub. I’ve summarized the results in the following figure.
You can observe the mean squared error of estimated coefficients against the real underlying coefficients of the Xy() simulation. The dotted line marks the breakeven point of a one to one signal to noise ratio. The marginal decrease in the MSE is saturates from a signal of onwards. Out of curiosity I ran this simulation with a signal to noise ratio of , to get a sense of the absolute differences in terms of MSE between a signal of and . The overall MSE of the OLS coefficients estimated with such a strong signal lies within the fourth decimal place, whereas it is only in the second decimal place at a signal to noise ratio of .
I hope you got a sense of the function and by now identified some applications for yourself. Feel free to drop me some feedback on the functionalities and on the overall interface. I am glad to improve the function and I will try to implement new functionalities on request as soon as possible.
In my next blog post I will introduce an upcoming feature of Xy() which will be especially useful in the field of machine learning.
Über den Autor André BleierDer Beitrag Pushing Ordinary Least Squares to the limit with Xy() erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: rbloggers – STATWORX. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Update on Polynomial Regression in Lieu of Neural Nets
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers)
There was quite a reaction to our paper, “Polynomial Regression as an Alternative to Neural Nets” (by Cheng, Khomtchouk, Matloff and Mohanty), leading to discussions/debates on Twitter, Reddit, Hacker News and so on. Accordingly, we have posted a revised version of the paper. Some of the new features:
 Though originally we had made the disclaimer that we had not yet done any experiments with image classification, there were comments along the lines of “If the authors had included even one example of image classification, even the MNIST data, I would have been more receptive.” So our revision does exactly that, with the result that polynomial regression does well on MNIST even with only very primitive preprocessing (plain PCA).
 We’ve elaborated on some of the theory (still quite informal, but could be made rigorous).
 We’ve added elaboration on other aspects, e.g. overfitting.
 We’ve added a section titled, “What This Paper Is NOT.” Hopefully those who wish to comment without reading the paper (!) this time will at least read this section.
 Updated and expanded results of our data experiments, including more details on how they were conducted.
We are continuing to add features to our associated R package, polyreg. More news on that to come.
Thanks for the interest. Comments welcome!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
One year as a subscriber to Stack Overflow
(This article was first published on Florian Privé, and kindly contributed to Rbloggers)
In this post, I follow up on a previous post describing how last year in July, I spent one month mostly procrastinating on Stack Overflow (SO). We’re already in July so it’s time to get back to one year of activity on Stack Overflow.
Am I still as much active as before? What is my strategy for answering questions on SO?
My activity on Stack OverflowAgain, we’ll use David Robinson’s package {stackr} to get data from Stack Overflow API in R.
# devtools::install_github("dgrtwo/stackr") suppressMessages({ library(stackr) library(tidyverse) library(lubridate) }) Evolution of my SO reputation myID < "6103040" myRep < stack_users(myID, "reputationhistory", num_pages = 40, fromdate = today()  years(1)) myRep %>% arrange(creation_date) %>% ggplot(aes(creation_date, cumsum(reputation_change))) + geom_point() + labs(x = "Date", y = "Reputation (squared transformed)", title = "Evolution of my SO reputation over the last year") + bigstatsr::theme_bigstatsr()So, it seems that my activity is slowing gently (my reputation is almost proportional to the square root of time). Yet, it is still increasing steadily; so what is my strategy for answering questions on SO?
Tags I’m involved inYou’ll have to wait for the answer to what is my strategy for answering questions on SO. For a hint, let’s analyze the tags I’m involved in.
If we don’t count my first month of activity:
stack_users(myID, "tags", num_pages = 40, fromdate = today()  months(11)) %>% select(name, count) %>% as_tibble() ## # A tibble: 155 x 2 ## name count ## ## 1 r 187 ## 2 performance 36 ## 3 rcpp 34 ## 4 parallelprocessing 33 ## 5 foreach 19 ## 6 rbigmemory 14 ## 7 vectorization 12 ## 8 forloop 11 ## 9 matrix 11 ## 10 doparallel 10 ## # ... with 145 more rowsI’m obviously answering only R questions. The tags I’m mostly answering questions from are “performance”, “rcpp”, “parallelprocessing”, “foreach”, “rbigmemory” and “vectorization”.
PerformanceAs you can see, all these tags are about performance of code. I really enjoy performance problems (get the same result but much faster).
I can spend hours on a question about performance and am sometimes rewarded with a solution that is 23 order of magnitude faster (see e.g. this other post).
I hope I could share my knowledge about performance through a tutorial in Toulouse next year.
Conclusion and answerSo, the question was “What is my strategy for answering questions on SO?”. And the answer is.. in the title: I am a subscriber.
I subscribe to tags on Stack Overflow. It has many benefits:

you don’t have to rush to answer because questions you receive by mail are 30minold (unanswered?) ones, so the probability that someone will answer at the same time as you is low.

you can focus and what you’re good at, what you’re interested in, or just what you want to learn. For example, I subscribed to the very new tag “rfuture” (for the R package {future}) because I’m interested in this package, even if I don’t know how to use it yet. I had the chance to meet with its author, Henrik Bengtsson, at eRum2018 and he actually already knew me through parallel questions on SO :D.
However, some tags (like “performance” or “foreach”) are relevant to many programming languages so that you would be flooded with irrelevant questions if subscribing directly to these tags. A simple solution to this problem is to subscribe to a feed of a combination of tags, like https://stackoverflow.com/feeds/tag?tagnames=r+and+foreach&sort=newest. I use this website to subscribe to feeds.
I will continue answering questions on SO, so see you there!
PS: I’m not sure you would get only unanswered questions with this technique.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...