Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 6 hours 47 min ago

The use of R in official statistics conference 2018

Tue, 05/22/2018 - 17:57

(This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

On September 12-14 the 6th international conference on the use of R in official statistics (#uRos2018) will take place at the Dutch National Statistical Office in Den Haag, the Netherlands. The conference is aimed at producers and users of official statistics from government, academia, and industry. The conference is modeled after the useR! conference and will consist of one day of tutorials (12th September 2018) followed by two days of conference (13, 14 September 2018). Topics include:

  • Examples of applying R in statistical production.
  • Examples of applying R in dissemination of statistics (visualisation, apps, reporting).
  • Analyses of big data and/or application of machine learning for official statistics.
  • Implementations of statistical methodology in the areas of sampling, editing, modelling and estimation, or disclosure control.
  • R packages connecting R to other standard tools/technical standards
  • Organisational and technical aspects of introducing R to the statistical office.
  • Teaching R to users in the office
  • Examples of accessing or using official statistics publications with R in other fields

    Keynote speakers
    We are very happy to announce that we confirmed two fantastic keynote speakers.

  • Alina Matei is a professor of statistics at the University of Neuchatel and maintainer of the important sampling package.
  • Jeroen Ooms is a postdoc at UC Berkeley, author of many infrastructural R packages and maintainer of R and Rtools for Windows.

    Call for abstracts

    The call for abstracts is open until 31 May. You can contribute to the conference by proposing a 20-minute talk, or a 3-hour tutorial. Also, authors have the opportunity to submit a paper for one of the two journals that will devote a special issue to the conference. Read all about it over here.

    Pointers

  • conference website
  • Follow uRos2018 on twitter

    Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

  • The myth that AI or Cognitive Analytics will replace data scientists: There is no easy button

    Tue, 05/22/2018 - 17:38

    (This article was first published on Scott Mutchler, and kindly contributed to R-bloggers)

    First, let me say I would love to have an “easy button” for predictive analytics. Unfortunately, predictive analytics is hard work that requires deep subject matter expertise in both the business problem and data science.  I’m not trying to be provocative with this post.  I just want to share my viewpoint over many year of experience.

    ————-

    There are a couple of myths that I see more an more these days.  Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work).

    Myths:

    • Deep learning (or Cognitive Analytics) is an easy button.  You can throw massive amounts of data and the algorithm will deliver a near optimal model.
    • Big data is always better than small data.  More rows of data always results in a significantly better model than less rows of data.

    Both of these myths lead some (lately it seems many) people to conclude that data scientist will eventually become superfluous.  With enough data and advanced algorithms maybe we don’t need these expensive data scientists…

    In my experience, the two most important phases of a predictive/prescriptive analytics implementation are:

    1. Data preparation: getting the right universe of possible variables, transforming these variables into useful model inputs (feature creation) and finalizing a parsimonious set of variables (feature selection)
    2. Deployment: making the predictive (and/or optimization) model part of an operational process (making micro-level decisions)

    Note that I didn’t say anything about picking the “right” predictive model.  There are circumstances where the model type makes a big difference but in general data preparation and deployment are much, much more important.

    The Data Scientist Role in Data Preparation

    Can you imagine trying to implement a predictive insurance fraud detection solution without deeply understanding the insurance industry and fraud?  You might think that you could rely on the SIU (insurance team that investigates fraud) to give you all the information your model will need.  I can tell you from personal experience that you would be sadly mistaken.  The SIU relies on tribal knowledge, intuition and years of personal experience to guide them when detecting fraud.  Trying to extract the important variables to detect fraud from the SIU can take weeks or even months.  In the past, I’ve asked to go through the training that SIU team members get just to start this process. As a data scientist, it’s critical to understand every detail of both how fraud is committed and what bread crumbs are left for you to follow.  Some examples for personal injury fraud are:

    • Text mining of unstructured text (claim notes, police reports, etc.) can reveal the level of injury and damage to the car; when compared to medical bills discrepancies can be found
    • Developing a fraud neighborhood using graph theory can determine if a claim/provider is surrounded by fraud or not
    • Entity analytics can determine if a claimant was trying to hide their identity across multiple claims (and possibly part of a crime ring)

    All of these created features, are key predictors for a fraud model.  None of them are present in the raw data.  Sadly, no spray-and-pray approach to throwing massive amounts of data at a deep learning (or ensemble machine learning) algorithm will ever uncover these patterns.

    Example: IoT Driven Predictive Maintenance

    IoT connected devices like your car, manufacturing equipment and heavy machinery (to name a few) generate massive amounts of “low information density” data.  The IoT sensor data is represented as massive time series that falls into the “Big Data” category.  I’ve seen people try to analyze this data directly (spray-and-pray) method with massive Spark clusters and advanced algorithms with very weak results.  Again, the reality is that you must have deep understanding of the business problem.  In many cases, you need to understand how parts are engineered, how they are used and how they fail.  With this understanding, you can start to perform feature engineering on the low information density data and transform it into high information metrics (or anomalies) that can be fed into a predictive model.  For example, vibration time series data can be analyzed with FFT to determine the intensity of vibration in specific frequency ranges.  The frequency ranges that should be examined again are driven by subject matter expertise.  Only after this feature engineering process can you generate a metric that will predict failures with vibration data.

    Hopefully, I’ve made a case of how data scientists are crucial to the predictive solution implementation process.  AI and cognitive analytics are very important tools in the data scientist tool chest but the data scientist is still needed to bridge the gap with subject matter understanding.

    If you are interested in feature creation check out this book (as a start):

    Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Scott Mutchler. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How Has Taylor Swift’s Word Choice Changed Over Time?

    Tue, 05/22/2018 - 15:30

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    .knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

    How Has Taylor Swift’s Word Choice Changed Over Time? Sunday night was a big night for Taylor Swift – not only was she nominated for multiple Billboard Music Awards; she took home Top Female Artist and Top Selling Album. So I thought it was a good time for some more Taylor Swift-themed statistical analysis.

    When I started this blog back in 2011, my goal was to write deep thoughts on trivial topics – specifically, to overthink and overanalyze pop culture and related topics that appear fluffy until you really dig into them. Recently, I’ve been blogging more about statistics, research, R, and data science, and I’ve loved getting to teach and share.

    But sometimes, you just want to overthink and overanalyze pop culture.

    So in a similar vein to the text analysis I’ve been demonstrating on my blog, I decided to answer a question I’m sure we all have – as Taylor Swift moved from country sweetheart to mega pop star, how have the words she uses in her songs changed?

    I’ve used the geniusR package on a couple posts, and I’ll be using it again today to answer this question. I’ll be pulling in some additional code, some based on code from the Text Mining with R: A Tidy Approach book I recently devoured, some written to try to tackle this problem I’ve created for myself to solve. I’ve shared all my code and tried to credit those who helped me write it where I can.

    First, we want to pull in the names of Taylor Swift’s 6 studio albums. I found these and their release dates on Wikipedia. While there are only 6 and I could easily copy and paste them to create my data frame, I wanted to pull that data directly from Wikipedia, to write code that could be used on a larger set in the future. Thanks to this post, I could, with a couple small tweaks.

    library(rvest)
    ## Loading required package: xml2
    TSdisc <- 'https://en.wikipedia.org/wiki/Taylor_Swift_discography'

    disc <- TSdisc %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[2]') %>%
    html_table(fill = TRUE)

    Since html() is deprecated, I replaced it with read_html(), and I got errors if I didn’t add fill = TRUE. The result is a list of 1, with an 8 by 14 data frame within that single list object. I can pull that out as a separate data frame.

    TS_albums <- disc[[1]]

    The data frame requires a little cleaning. First up, there are 8 rows, but only 6 albums. Because the Wikipedia table had a double header, the second header was read in as a row of data, so I want to delete that, because I only care about the first two columns anyway. The last row contains a footnote that was included with the table. So I removed those two rows, first and last, and dropped the columns I don’t need. Second, the information I want with release date was in a table cell along with record label and formats (e.g., CD, vinyl). I don’t need those for my purposes, so I’ll only pull out the information I want and drop the rest. Finally, I converted year from character to numeric – this becomes important later on.

    library(tidyverse)
    TS_albums<-TS_albums[2:7,1:2]

    TS_albums <- TS_albums %>%
    separate(`Album details`, c("Released","Month","Day","Year"),
    extra='drop') %>%
    select(c("Title","Year"))

    TS_albums$Year<-as.numeric(TS_albums$Year)

    I asked geniusR to download lyrics for all 6 albums. (Note: this code may take a couple minutes to run.) It nests all of the individual album data, including lyrics, into a single column, so I just need to unnest that to create a long file, with album title and release year applied to each unnested line.

    library(geniusR)

    TS_lyrics <- TS_albums %>%
    mutate(tracks = map2("Taylor Swift", Title, genius_album))
    ## Joining, by = c("track_title", "track_n", "track_url")
    ## Joining, by = c("track_title", "track_n", "track_url")
    ## Joining, by = c("track_title", "track_n", "track_url")
    ## Joining, by = c("track_title", "track_n", "track_url")
    ## Joining, by = c("track_title", "track_n", "track_url")
    ## Joining, by = c("track_title", "track_n", "track_url")
    TS_lyrics <- TS_lyrics %>%
    unnest(tracks)

    Now we’ll tokenize our lyrics data frame, and start doing our word analysis.

    library(tidytext)

    tidy_TS <- TS_lyrics %>%
    unnest_tokens(word, lyric) %>%
    anti_join(stop_words)
    ## Joining, by = "word"
    tidy_TS %>%
    count(word, sort = TRUE)
    ## # A tibble: 2,024 x 2
    ## word n
    ##
    ## 1 time 198
    ## 2 love 180
    ## 3 baby 118
    ## 4 ooh 104
    ## 5 stay 89
    ## 6 night 85
    ## 7 wanna 84
    ## 8 yeah 83
    ## 9 shake 80
    ## 10 ey 72
    ## # ... with 2,014 more rows

    There are a little over 2,000 unique words across TS’s 6 albums. But how have they changed over time? To examine this, I’ll create a dataset that counts word by year (or album, really). Then I’ll use a binomial regression model to look at changes over time, one model per word. In their book, Julia Silge and David Robinson demonstrated how to use binomial regression to examine word use on the authors’ Twitter accounts over time, including an adjustment to the p-values to correct for multiple comparisons. So I based on my code off that.

    words_by_year <- tidy_TS %>%
    count(Year, word) %>%
    group_by(Year) %>%
    mutate(time_total = sum(n)) %>%
    group_by(word) %>%
    mutate(word_total = sum(n)) %>%
    ungroup() %>%
    rename(count = n) %>%
    filter(word_total > 50)

    nested_words <- words_by_year %>%
    nest(-word)

    word_models <- nested_words %>%
    mutate(models = map(data, ~glm(cbind(count, time_total) ~ Year, .,
    family = "binomial")))

    This nests our regression results in a data frame called word_models. While I could unnest and keep all, I don’t care about every value the GLM gives me. What I care about is the slope for Year, so the filter selects only that slope and the associated p-value. I can then filter to select the significant/marginally significant slopes for plotting (p < 0.1). library(broom)

    slopes <- word_models %>%
    unnest(map(models, tidy)) %>%
    filter(term == "Year") %>%
    mutate(adjusted.p.value = p.adjust(p.value))

    top_slopes <- slopes%>%
    filter(adjusted.p.value < 0.1) %>%
    select(-statistic, -p.value)

    This gives me five words that show changes in usage over time: bad, call, dancing, eyes, and yeah. We can plot those five words to see how they’ve changed in usage over her 6 albums. And because I still have my TS_albums data frame, I can use that information to label the axis of my plot (which is why I needed year to be numeric). I also added a vertical line and annotations to note where TS believes she shifted from country to pop.

    library(scales)
    words_by_year %>%
    inner_join(top_slopes, by = "word") %>%
    ggplot(aes(Year, count/time_total, color = word, lty = word)) +
    geom_line(size = 1.3) +
    labs(x = NULL, y = "Word Frequency") +
    scale_x_continuous(breaks=TS_albums$Year,
    labels=TS_albums$Title) +
    scale_y_continuous(labels=scales::percent) +
    geom_vline(xintercept = 2014) +
    theme(panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_blank()) +
    annotate("text", x = c(2009.5,2015.5), y = c(0.025,0.025),
    label = c("Country", "Pop") , size=5)

    The biggest change appears to be in the word “call,” which she didn’t use at all in her self-titled album, and used at low rates until “1989” and, especially, “Reputation.” I can ask for a few examples of “call” in her song lyrics, with grep.

    library(expss)
    callsubset <- TS_lyrics[grep("call", TS_lyrics$lyric),]
    callsubset <- callsubset %>%
    select(Title, Year, track_title, lyric)
    set.seed(2012)
    callsubset<-callsubset[sample(nrow(callsubset), 3), ]
    callsubset<-callsubset[order(callsubset$Year),]
    as.etable(callsubset, rownames_as_row_labels = FALSE)
    Title  Year   track_title   lyric   Speak Now  2010 Back to December (Acoustic) When your birthday passed, and I didn’t call  Red  2012 All Too Well And you call me up again just to break me like a promise  Reputation  2017 Call It What You Want Call it what you want, call it what you want, call it

    On the other hand, she doesn’t sing about “eyes” as much now that she’s moved from country to pop.

    eyessubset <- TS_lyrics[grep("eyes", TS_lyrics$lyric),]
    eyessubset <- eyessubset %>%
    select(Title, Year, track_title, lyric)
    set.seed(415)
    eyessubset<-eyessubset[sample(nrow(eyessubset), 3), ]
    eyessubset<-eyessubset[order(eyessubset$Year),]
    as.etable(eyessubset, rownames_as_row_labels = FALSE)
    Title  Year   track_title   lyric   Taylor Swift  2006 A Perfectly Good Heart And realized by the distance in your eyes that I would be the one to fall  Speak Now  2010 Better Than Revenge I’m just another thing for you to roll your eyes at, honey  Red  2012 State of Grace Just twin fire signs, four blue eyes

    Bet you’ll never listen to Taylor Swift the same way again.

    A few notes: I opted to examine any slopes with p < 0.10, which is greater than conventional levels of significance; if you look at the adjusted p-value column, though, you'll see that 4 of the 5 are < 0.05 and one is only slightly greater than 0.05. But I made the somewhat arbitrary choice to include only words used more than 50 times across her 6 albums, so I could get different results by changing that filtering value when I create the words_by_time data frame. Feel free to play around and see what you get by using different values!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Why R? 2018 Conf – CfP ends May 25th

    Tue, 05/22/2018 - 10:00

    (This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

    We are pleased to announance upcoming Why R? 2018 conference that is going to happen in central-eastern Europe (Poland, Wroclaw) this July (2-5th). It is the last week for the call for papers! Submit your talk here.

    About

    More about the conference one can find on the conference website whyr2018.pl and in the previous blog post we’ve prepared Why R? 2018 Conference – Registration and Call for Papers Opened. The general overview

    Pre-meetings

    We are organizing pre-meetings in many European cities to cultivate the R experience of knowledge sharing. You are more than welcome to visit upcoming events and check photos and presentations from previous ones. If you are interested in co-organizing a Why R? pre-meeting in your city, let us know (under kontakt_at_whyr.pl) and the Why R? Foundation can provide speakers for the venue!











    Past event

    Why R? 2017 edition, organized in Warsaw, gathered 200 participants. The Facebook reach of the conference page exceeds 15 000 users, with almost 800 subscribers. Our official web page had over 8000 unique visitors and over 12 000 visits in general. To learn more about Why R? 2017 see the conference after movie (https://vimeo.com/239259242).

    Why R? 2017 Conference from Kinema Indigo on Vimeo.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Edward Tufte’s Slopegraphs and political fortunes in Ontario

    Mon, 05/21/2018 - 16:21

    (This article was first published on eKonometrics, and kindly contributed to R-bloggers)

    With fewer than three weeks left in the June 7 provincial elections in Ontario, Canada’s most populous province with 14.2 million persons, the expected outcome is far from certain. The weekly opinion polls reflect the volatility in public opinion. Progressive Conservatives (PC), one of the main opposition parties, is in the lead with the support of roughly 40 percent of the electorate. The incumbent Ontario Liberals are trailing with their support hanging around lower 20 percent. The real story in these elections is the unexpected rise in the fortunes of the New Democratic Party (NDP) that has seen a sustained increase in its popularity from less than 20 percent a few weeks ago to mid 30 percent. As a data scientist/journalist, I have been concerned with how best to represent this information. A scatter plot of sorts would do. However, I would like to demonstrate the change in political fortunes over time with the x-axis representing time. Hence, a time series chart would be more appropriate. Ideally, I would like to plot what Edward Tufte called a Slopegraph. Tufte, in his 1983 book The Visual Display of Quantitative Information, explained that “Slopegraphs compare changes usually over time for a list of nouns located on an ordinal or interval scale”. But here’s the problem. No software offers a readymade solution to draw a Slopegraph. Luckily, I found a way, in fact, two ways, around the challenge with help from colleagues at Stata and R (plotrix). So, what follows in this blog is the story of the elections in Ontario described with data visualized as Slopegraphs. I tell the story first with Stata and then with the plotrix package in R. My interest grew in Slopegraphs when I wanted to demonstrate the steep increase in highly leveraged mortgage loans in Canada from 2014 to 2016. I generated the chart in Excel and sent it to Stata requesting help to recreate it. Stata assigned my request to Derek Wagner whose excellent programming skills resulted in the following chart. Derek built the chart on the linkplot command built by the uber Stata guru, Professor Nicholas J. Cox. However, a straightforward application of linkplot still required a lot of tweaks that Derek very ably managed. For comparison, see the initial version of the chart generated by linkplot below.   We made the following modifications to the base linkplot: 1.    Narrow the plot by reducing the space between the two time periods. 2.    Label the entities and their respective values at the primary and secondary y-axes. 3.    Add a title and footnotes (if necessary). 4.    Label time periods with custom names. 5.    Colour lines and symbols to match preferences. Once we apply these tweaks a Slopegraph with the latest poll data for Ontario’s election is drawn as follows. Notice that in fewer than two weeks, NDP has jumped from 29 percent to 34 percent, almost tying up with the leading PC party whose support has remained steady at 35 percent. The incumbent Ontario Liberals appear to be in free fall from 29 percent to 24 percent. I must admit that I have sort of cheated in the above chart. Note that both Liberals and NDP secured 29 percent of the support in the poll conducted on May 06. In the original chart drawn with Stata’s code, their labels overlapped resulting in unintelligible text. I fixed this manually by manipulating the image in PowerPoint. I wanted to replicate the above chart in R. I tried a few packages, but nothing really worked until I landed on the plotrix package that carries the bumpchart command. In fact, Edward Tufte in Beautiful Evidence (2006) mentions that bumpcharts may be considered as slopegraphs. A straightforward application of bumpchart from the plotrix package labelled the party names but not the respective percentages of support each party commanded. Dr. Jim Lemon authored bumpchart. I turned to him for help. Jim was kind enough to write a custom function, bumpchart2, that I used to create a Slopegraph like the one I generated with Stata. For comparison, see the chart below. As with the Slopegraph generated with Stata, I manually manipulated the labels to prevent NDP and Liberal labels from overlapping. Data Scientist must dig even deeper

    The job of a data scientist, unlike a computer scientist or a statistician, is not done by estimating models and drawing figures. A data scientist must tell a story with all caveats that might apply. So, here’s the story about what can go wrong with polls.

    The most important lesson about forecasting from Brexit and the last US Presidential elections is that one cannot rely on polls to determine the future electoral outcomes. Most polls in the UK predicted a NO vote for Brexit. In the US, most polls forecasted Hillary Clinton to be the winner. Both forecasts went horribly wrong. When it comes to polls, one must determine who sponsored the poll, what methods were used, and how representative is the sample of the underlying population. Asking the wrong question to the right people or posing the right question to the wrong people (non-representative sample) can deliver problematic results. Polling is as much science as it is arts. Late Warren Mitofsky, who pioneered exit polls and innovated political survey research, remains a legend in political polling. His painstakingly cautious approach to polling is why he remains a respected name in market research. Today, the advances in communication and information technologies have made survey research easier to conduct but more difficult to be precise. No longer can one rely on random digit dialling, a Mitosky innovation, to reach a representative sample. Younger cohorts sparingly subscribe to land telephone lines. The attempts to catch them online poses the risk of fishing for opinions in echo chambers. Add political polarization to technological challenges, and one realizes the true scope of the difficulties inherent in the task of taking the political pulse of an electorate where motivated pollster may be after not the truth, but a convenient version of it. Polls also differ by survey instrument, methodology, and sample size. The Abacus Data poll presented above is essentially an online poll of 2,326 respondents. In comparison, a poll by Mainstreet Research used Interactive Voice Response (IVR) system with a sample size of 2,350 respondents. IVR uses automated computerized responses over the telephone to record responses. Abacus Data and Mainstreet Research use quite different methods with similar sample sizes. Professor Dan Cassino of Fairleigh Dickinson University explained the challenges with polling techniques in a 2016 article in the Harvard Business Review. He favours live telephone interviewers who “are highly experienced and college educated and paying them is the main cost of political surveys.”   Professor Cassino believes that techniques like IVR make “polling faster and cheaper,” but these systems are hardly foolproof with lower response rates. They cannot legally reach cellphones. “IVR may work for populations of older, whiter voters with landlines, such as in some Republican primary races, but they’re not generally useful,” explained Professor Cassino. Similarly, online polls are limited in the sense that in the US alone 16 percent Americans don’t use the Internet. With these caveats in mind, a plot of Mainstreet Research data reveals quite a different picture where the NDP doesn’t seem to pose an immediate and direct challenge to the PC party. So, here’s the summary. Slopegraph is a useful tool to summarize change over time between distinct entities. Ontario is likely to have a new government on June 7. It is though, far from being certain whether the PC Party or NDP will assume office. Nevertheless, Slopergaphs generate visuals that expose the uncertainty in the forthcoming elections. Note: To generate the charts in this blog, you can download data and code for Stata and Plotrix (R) by clicking HERE.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: eKonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    ML models: What they can’t learn?

    Sun, 05/20/2018 - 23:03

    (This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

    What I love in conferences are the people, that come after your talk and say: It would be cool to add XYZ to your package/method/theorem.

    After the eRum (great conference by the way) I was lucky to hear from Tal Galili: It would be cool to use DALEX for teaching, to show how different ML models are learning relations.

    Cool idea. So let’s see what can and what cannot be learned by the most popular ML models. Here we will compare random forest against linear models against SVMs.
    Find the full example here. We simulate variables from uniform U[0,1] distribution and calculate y from following equation

    In all figures below we compare PDP model responses against the true relation between variable x and the target variable y (pink color). All these plots are created with DALEX package.

    For x1 we can check how different models deal with a quadratic relation. The linear model fails without prior feature engineering, random forest is guessing the shape but the best fit if found by SVMs.

    With sinus-like oscillations the story is different. SVMs are not that flexible while random forest is much closer.

    Turns out that monotonic relations are not easy for these models. The random forest is close but event here we cannot guarantee the monotonicity.

    The linear model is the best one when it comes to truly linear relation. But other models are not that far.

    The abs(x) is not an easy case for neither model.

    Find the R codes here.

    Of course the behavior of all these models depend on number of observation, noise to signal ratio, correlation among variables and interactions.
    Yet is may be educational to use PDP curves to see how different models are learning relations. What they can grasp easily and what they cannot.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Rcpp 0.12.17: More small updates

    Sun, 05/20/2018 - 16:26

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    Another bi-monthly update and the seventeenth release in the 0.12.* series of Rcpp landed on CRAN late on Friday following nine (!!) days in gestation in the incoming/ directory of CRAN. And no complaints: we just wish CRAN were a little more forthcoming with what is happenening when, and/or would let us help supplying additional test information. I do run a fairly insane amount of backtests prior to releases, only to then have to wait another week or more is … not ideal. But again, we all owe CRAN and immense amount of gratitude for all they do, and do so well.

    So once more, this release follows the 0.12.0 release from July 2016, the 0.12.1 release in September 2016, the 0.12.2 release in November 2016, the 0.12.3 release in January 2017, the 0.12.4 release in March 2016, the 0.12.5 release in May 2016, the 0.12.6 release in July 2016, the 0.12.7 release in September 2016, the 0.12.8 release in November 2016, the 0.12.9 release in January 2017, the 0.12.10.release in March 2017, the 0.12.11.release in May 2017, the 0.12.12 release in July 2017, the 0.12.13.release in late September 2017, the 0.12.14.release in November 2017, the 0.12.15.release in January 2018 and the 0.12.16.release in March 2018 making it the twenty-first release at the steady and predictable bi-montly release frequency.

    Rcpp has become the most popular way of enhancing GNU R with C or C++ code. As of today, 1362 packages on CRAN depend on Rcpp for making analytical code go faster and further, along with another 138 in the current BioConductor release 3.7.

    Compared to other releases, this release contains again a relatively small change set, but between Kevin and Romain cleaned a few things up. Full details are below.

    Changes in Rcpp version 0.12.17 (2018-05-09)
    • Changes in Rcpp API:

      • The random number Generator class no longer inhreits from RNGScope (Kevin in #837 fixing #836).

      • A spurious parenthesis was removed to please gcc8 (Dirk fixing #841)

      • The optional Timer class header now undefines FALSE which was seen to have side-effects on some platforms (Romain in #847 fixing #846).

      • Optional StoragePolicy attributes now also work for string vectors (Romain in #850 fixing #849).

    Thanks to CRANberries, you can also look at a diff to the previous release. As always, details are on the Rcpp Changelog page and the Rcpp page which also leads to the downloads page, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Statistics Sunday: Welcome to Sentiment Analysis with “Hotel California”

    Sun, 05/20/2018 - 15:30

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    .knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

    Welcome to the Hotel California As promised in last week’s post, this week: sentiment analysis, also with song lyrics.

    Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

    • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
    • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
    • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
    • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

    Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

    To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

    I know, I know.

    Using similar code as last week, let’s pull in the lyrics of the song.

    library(geniusR)
    library(tidyverse)
    hotel_calif <- genius_lyrics(artist = "Eagles", song = "Hotel California") %>%
    mutate(line = row_number())

    First, we’ll chop up these 43 lines into individual words, using the tidytext package and unnest_tokens function.

    library(tidytext)
    tidy_hc <- hotel_calif %>%
    unnest_tokens(word,lyric)

    This is also probably the point I would remove stop words with anti_join. But these common words are very unlikely to have a sentiment attached to them, so I’ll leave them in, knowing they’ll be filtered out anyway by this analysis. We have 4 lexicons to choose from. Loughran is more financial and textual, but we’ll still see how well it can classify the words anyway. First, let’s create a data frame of our 4 sentiment lexicons.

    new_sentiments <- sentiments %>%
    mutate( sentiment = ifelse(lexicon == "AFINN" & score >= 0, "positive",
    ifelse(lexicon == "AFINN" & score < 0,
    "negative", sentiment))) %>%
    group_by(lexicon) %>%
    mutate(words_in_lexicon = n_distinct(word)) %>%
    ungroup()

    Now, we’ll see how well the 4 lexicons match up with the words in the lyrics. Big thanks to Debbie Liske at Data Camp for this piece of code (and several other pieces used in this post):

    my_kable_styling <- function(dat, caption) {
    kable(dat, "html", escape = FALSE, caption = caption) %>%
    kable_styling(bootstrap_options = c("striped", "condensed", "bordered"),
    full_width = FALSE)
    }


    library(kableExtra)
    library(formattable)
    library(yarrr)
    tidy_hc %>%
    mutate(words_in_lyrics = n_distinct(word)) %>%
    inner_join(new_sentiments) %>%
    group_by(lexicon, words_in_lyrics, words_in_lexicon) %>%
    summarise(lex_match_words = n_distinct(word)) %>%
    ungroup() %>%
    mutate(total_match_words = sum(lex_match_words),
    match_ratio = lex_match_words/words_in_lyrics) %>%
    select(lexicon, lex_match_words, words_in_lyrics, match_ratio) %>%
    mutate(lex_match_words = color_bar("lightblue")(lex_match_words),
    lexicon = color_tile("lightgreen","lightgreen")(lexicon)) %>%
    my_kable_styling(caption = "Lyrics Found In Lexicons")
    ## Joining, by = "word"
    Lyrics Found In Lexicons lexicon lex_match_words words_in_lyrics match_ratio AFINN 18 175 0.1028571 bing 18 175 0.1028571 loughran 1 175 0.0057143 nrc 23 175 0.1314286

    NRC offers the best match, classifying about 13% of the words in the lyrics. (It’s not unusual to have such a low percentage. Not all words have a sentiment.)

    hcsentiment <- tidy_hc %>%
    inner_join(get_sentiments("nrc"), by = "word")

    hcsentiment
    ## # A tibble: 103 x 4
    ## track_title line word sentiment
    ##
    ## 1 Hotel California 1 dark sadness
    ## 2 Hotel California 1 desert anger
    ## 3 Hotel California 1 desert disgust
    ## 4 Hotel California 1 desert fear
    ## 5 Hotel California 1 desert negative
    ## 6 Hotel California 1 desert sadness
    ## 7 Hotel California 1 cool positive
    ## 8 Hotel California 2 smell anger
    ## 9 Hotel California 2 smell disgust
    ## 10 Hotel California 2 smell negative
    ## # ... with 93 more rows

    Let’s visualize the counts of different emotions and sentiments in the NRC lexicon.

    theme_lyrics <- function(aticks = element_blank(),
    pgminor = element_blank(),
    lt = element_blank(),
    lp = "none")
    {
    theme(plot.title = element_text(hjust = 0.5), #Center the title
    axis.ticks = aticks, #Set axis ticks to on or off
    panel.grid.minor = pgminor, #Turn the minor grid lines on or off
    legend.title = lt, #Turn the legend title on or off
    legend.position = lp) #Turn the legend on or off
    }

    hcsentiment %>%
    group_by(sentiment) %>%
    summarise(word_count = n()) %>%
    ungroup() %>%
    mutate(sentiment = reorder(sentiment, word_count)) %>%
    ggplot(aes(sentiment, word_count, fill = -word_count)) +
    geom_col() +
    guides(fill = FALSE) +
    theme_minimal() + theme_lyrics() +
    labs(x = NULL, y = "Word Count") +
    ggtitle("Hotel California NRC Sentiment Totals") +
    coord_flip()

    Most of the words appear to be positively-valenced. How do the individual words match up?

    library(ggrepel)

    plot_words <- hcsentiment %>%
    group_by(sentiment) %>%
    count(word, sort = TRUE) %>%
    arrange(desc(n)) %>%
    ungroup()

    plot_words %>%
    ggplot(aes(word, 1, label = word, fill = sentiment)) +
    geom_point(color = "white") +
    geom_label_repel(force = 1, nudge_y = 0.5,
    direction = "y",
    box.padding = 0.04,
    segment.color = "white",
    size = 3) +
    facet_grid(~sentiment) +
    theme_lyrics() +
    theme(axis.text.y = element_blank(), axis.line.x = element_blank(),
    axis.title.x = element_blank(), axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    panel.grid = element_blank(), panel.background = element_blank(),
    panel.border = element_rect("lightgray", fill = NA),
    strip.text.x = element_text(size = 9)) +
    xlab(NULL) + ylab(NULL) +
    ggtitle("Hotel California Words by NRC Sentiment") +
    coord_flip()

    It looks like some words are being misclassified. For instance, “smell” as in “warm smell of colitas” is being classified as anger, disgust, and negative. But that doesn’t explain the overall positive bent being applied to the song. If you listen to the song, you know it’s not really a happy song. It starts off somewhat negative – or at least, ambiguous – as the narrator is driving on a dark desert highway. He’s tired and having trouble seeing, and notices the Hotel California, a shimmering oasis on the horizon. He stops in and is greated by a “lovely face” in a “lovely place.” At the hotel, everyone seems happy: they dance and drink, they have fancy cars, they have pretty “friends.”

    But the song is in a minor key. Though not always a sign that a song is sad, it is, at the very least, a hint of something ominous, lurking below the surface. Soon, things turn bad for the narrator. The lovely-faced woman tells him they are “just prisoners here of our own device.” He tries to run away, but the night man tells him, “You can check out anytime you like, but you can never leave.”

    The song seems to be a metaphor for something, perhaps fame and excess, which was also the subject of another song on the same album, “Life in the Fast Lane.” To someone seeking fame, life is dreary, dark, and deserted. Fame is like an oasis – beautiful and shimmering, an escape. But it isn’t all it appears to be. You may be surrounded by beautiful people, but you can only call them “friends.” You trust no one. And once you join that lifestyle, you might be able to check out, perhaps through farewell tour(s), but you can never leave that life – people know who you are (or were) and there’s no disappearing. And it could be about something even darker that it’s hard to escape from, like substance abuse. Whatever meaning you ascribe to the song, the overall message seems to be that things are not as wonderful as they appear on the surface.

    So if we follow our own understanding of the song’s trajectory, we’d say it starts off somewhat negatively, becomes positive in the middle, then dips back into the negative at the end, when the narrator tries to escape and finds he cannot.

    We can chart this, using the line number, which coincides with the location of the word in the song. We’ll stick with NRC since it offered the best match, but for simplicity, we’ll only pay attention to the positive and negative sentiment codes.

    hcsentiment_index <- tidy_hc %>%
    inner_join(get_sentiments("nrc")%>%
    filter(sentiment %in% c("positive",
    "negative"))) %>%
    count(index = line, sentiment) %>%
    spread(sentiment, n, fill = 0) %>%
    mutate(sentiment = positive - negative)
    ## Joining, by = "word"

    This gives us a data frame that aggregates sentiment by line. If a line contains more positive than negative words, its overall sentiment is positive, and vice versa. Because not every word in the lyrics has a sentiment, not every line has an associated aggregate sentiment. But it gives us a sort of trajectory over the course of the song. We can visualize this trajectory like this:

    hcsentiment_index %>%
    ggplot(aes(index, sentiment, fill = sentiment > 0)) +
    geom_col(show.legend = FALSE)

    As the chart shows, the song starts somewhat positive, with a dip soon after into the negative. The middle of the song is positive, as the narrator describes the decadence of the Hotel California. But it turns dark at the end, and stays that way as the guitar solo soars in.

    Sources

    This awesome post by Debbie Liske, mentioned earlier, for her code and custom functions to make my charts pretty.

    Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Beautiful and Powerful Correlation Tables in R

    Sun, 05/20/2018 - 02:00

    (This article was first published on Dominique Makowski, and kindly contributed to R-bloggers)

    Another correlation function?!

    Yes, the correlation function from the psycho package.

    devtools::install_github("neuropsychology/psycho.R") # Install the newest version library(psycho) library(tidyverse) cor <- psycho::affective %>% correlation()

    This function automatically select numeric variables and run a correlation analysis. It returns a psychobject.

    A table

    We can then extract a formatted table that can be saved and pasted into reports and manuscripts by using the summary function.

    summary(cor) # write.csv(summary(cor), "myformattedcortable.csv")   Age Life_Satisfaction Concealing Adjusting Age         Life_Satisfaction 0.03       Concealing -0.05 -0.06     Adjusting 0.03 0.36*** 0.22***   Tolerating 0.03 0.15*** 0.07 0.29*** A Plot

    It integrates a plot done with ggcorplot.

    plot(cor)

    A print

    It also includes a pairwise correlation printing method.

    print(cor) Pearson Full correlation (p value correction: holm): - Age / Life_Satisfaction: Results of the Pearson correlation showed a non significant and weak negative association between Age and Life_Satisfaction (r(1249) = 0.030, p > .1). - Age / Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Age and Concealing (r(1249) = -0.050, p > .1). - Life_Satisfaction / Concealing: Results of the Pearson correlation showed a non significant and weak positive association between Life_Satisfaction and Concealing (r(1249) = -0.063, p > .1). - Age / Adjusting: Results of the Pearson correlation showed a non significant and weak negative association between Age and Adjusting (r(1249) = 0.027, p > .1). - Life_Satisfaction / Adjusting: Results of the Pearson correlation showed a significant and moderate negative association between Life_Satisfaction and Adjusting (r(1249) = 0.36, p < .001***). - Concealing / Adjusting: Results of the Pearson correlation showed a significant and weak negative association between Concealing and Adjusting (r(1249) = 0.22, p < .001***). - Age / Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Age and Tolerating (r(1249) = 0.031, p > .1). - Life_Satisfaction / Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Life_Satisfaction and Tolerating (r(1249) = 0.15, p < .001***). - Concealing / Tolerating: Results of the Pearson correlation showed a non significant and weak negative association between Concealing and Tolerating (r(1249) = 0.074, p = 0.05°). - Adjusting / Tolerating: Results of the Pearson correlation showed a significant and weak negative association between Adjusting and Tolerating (r(1249) = 0.29, p < .001***). Options

    You can also cutomize the type (pearson, spearman or kendall), the p value correction method (holm (default), bonferroni, fdr, none…) and run partial, semi-partial or glasso correlations.

    psycho::affective %>% correlation(method = "pearson", adjust="bonferroni", type="partial") %>% summary()   Age Life_Satisfaction Concealing Adjusting Age         Life_Satisfaction 0.01       Concealing -0.06 -0.16***     Adjusting 0.02 0.36*** 0.25***   Tolerating 0.02 0.06 0.02 0.24*** Fun with p-hacking

    In order to prevent people for running many uncorrected correlation tests (promoting p-hacking and result-fishing), we included the i_am_cheating parameter. If FALSE (default), the function will help you finding interesting results!

    df_with_11_vars <- data.frame(replicate(11, rnorm(1000))) cor <- correlation(df_with_11_vars, adjust="none") ## Warning in correlation(df_with_11_vars, adjust = "none"): We've detected that you are running a lot (> 10) of correlation tests without adjusting the p values. To help you in your p-fishing, we've added some interesting variables: You never know, you might find something significant! ## To deactivate this, change the 'i_am_cheating' argument to TRUE. summary(cor)   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X1                       X2 -0.04                     X3 -0.04 -0.02                   X4 0.02 0.05 -0.02                 X5 -0.01 -0.02 0.05 -0.03               X6 -0.03 0.03 0.08* 0.02 0.02             X7 0.03 -0.01 -0.02 -0.04 -0.03 -0.04           X8 0.01 -0.07* 0.04 0.02 -0.01 -0.01 0.00         X9 -0.02 0.03 -0.03 -0.02 0.00 -0.04 0.03 -0.02       X10 -0.03 0.00 0.00 0.01 0.01 -0.01 0.01 -0.02 0.02     X11 0.01 0.01 -0.03 -0.05 0.00 0.05 0.01 0.00 -0.01 0.07*   Local_Air_Density 0.26*** -0.02 -0.44*** -0.15*** -0.25*** -0.50*** 0.57*** -0.11*** 0.47*** 0.06 0.01 Reincarnation_Cycle -0.03 -0.02 0.02 0.04 0.01 0.00 0.05 -0.04 -0.05 -0.01 0.03 Communism_Level 0.58*** -0.44*** 0.04 0.06 -0.10** -0.18*** 0.10** 0.46*** -0.50*** -0.21*** -0.14*** Alien_Mothership_Distance 0.00 -0.03 0.01 0.00 -0.01 -0.03 -0.04 0.01 0.01 -0.02 0.00 Schopenhauers_Optimism 0.11*** 0.31*** -0.25*** 0.64*** -0.29*** -0.15*** -0.35*** -0.09** 0.08* -0.22*** -0.47*** Hulks_Power 0.03 0.00 0.02 0.03 -0.02 -0.01 -0.05 -0.01 0.00 0.01 0.03

    As we can see, Schopenhauer’s Optimism is strongly related to many variables!!!

    Credits

    This package was useful? You can cite psycho as follows:

    • Makowski, (2018). The psycho Package: an Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R/exams @ eRum 2018

    Sun, 05/20/2018 - 00:00

    (This article was first published on R/exams, and kindly contributed to R-bloggers)

    Keynote lecture about R/exams at eRum 2018 (European R Users Meeting) in Budapest: Slides, video, e-learning, replication materials.

    Keynote lecture at eRum 2018

    R/exams was presented in a keynote lecture by Achim Zeileis at eRum 2018, the European R Users Meeting, this time organized by a team around Gergely Daróczi in Budapest. It was a great event with many exciting presentations, reflecting the vibrant R community in Europe (and beyond).

    This blog post provides various resources accompanying the presentation which may be of interest to those who did not attend the meeting as well as those who did and who want to explore the materials in more detail.

    Most importantly the presentation slides are available in PDF format (under CC-BY):

    Video

    The eRum organizers did a great job in making the meeting accessible to those useRs who could not make it to Budapest. All presentations were available in a livestream on YouTube where also videos of all lectures were made available after the meeting (Standard YouTube License):

    E-Learning

    To illustrate the e-learning capabilities supported by R/exams, the presentation started with a live quiz using the audience response system ARSnova. The original version of the quiz was hosted on the ARSnova installation at Universität Innsbruck. To encourage readers to try out ARSnova for their own purposes, a copy of the quiz was also posted on the official ARSnova server at Technische Hochschule Mittelhessen (where ARSnova is developed under the General Public License, GPL):

    The presentation briefly also showed an online test generated by R/exams and imported into OpenOLAT, an open-source learning management system (available under the Apache License). The online test is made available again here for anonymous guest access. (Note however, that the system only has one guest user so that when you start the test there may already be some test results from a previous guest session. In that case you can finish the test and also start it again.)

    Replication code

    The presentation slides show how to set up an exam using the R package and then rendering it into different output formats. In order to allow the same exam to be rendered into a wide range of different output formats, only single-choice and multiple-choice exercises were employed (see the choice list below). However, in the e-learning test shown in OpenOLAT all exercises types are supported (see the elearn list below). All these exercises are readily provided in the package and also introduced online: deriv/deriv2, fruit/fruit2, ttest, boxplots, cholesky, lm, function. The code below uses the R/LaTeX (.Rnw) version but the R/Markdown version (.Rmd) could also be used instead.

    ## package library("exams") ## single-choice and multiple-choice only choice <- list("deriv2.Rnw", "fruit2.Rnw", c("ttest.Rnw", "boxplots.Rnw")) ## e-learning test (all exercise types) elearn <- c("deriv.Rnw", "fruit.Rnw", "ttest.Rnw", "boxplots.Rnw", "cholesky.Rnw", "lm.Rnw", "function.Rnw")

    First, the exam with the choice-based questions can be easily turned into a PDF exam in NOPS format using exams2nops, here using Hungarian language for illustration. Exams in this format can be easily scanned and evaluated within R.

    set.seed(2018-05-16) exams2nops(choice, institution = "eRum 2018", language = "hu")

    Second, the choice-based exam version can be exported into the JSON format for ARSnova: Rexams-1.json. This contains an entire ARSnova session that can be directly imported into the ARSnova system as shown above. It employs a custom exercise set up just for eRum (conferences.Rmd) as well as a slightly tweaked exercise (fruit3.Rmd) that displays better in ARSnova.

    set.seed(2018-05-16) exams2arsnova(list("conferences.Rmd", choice[[1]], "fruit3.Rmd", choice[[3]]), name = "R/exams", abstention = FALSE, fix_choice = TRUE)

    Third, the e-learning exam can be generated in QTI 1.2 format for OpenOLAT, as shown above: eRum-2018.zip. The exams2openolat command below is provided starting from the current R/exams version 2.3-1. It essentially just calls exams2qti12 but slightly tweaks the MathJax output from pandoc so that it is displayed properly by OpenOLAT.

    set.seed(2018-05-16) exams2openolat(elearn, name = "eRum-2018", n = 10, qti = "1.2") What else?

    In the last part of the presentation a couple of new and ongoing efforts within the R/exams project are highlighted. First, the natural language support in NOPS exams is mentioned which was recently described in more detail in this blog. Second, the relatively new “stress tester” was illustrated with the following example. (A more detailed blog post will follow soon.)

    s <- stresstest_exercise("deriv2.Rnw") plot(s)

    Finally, a psychometric analysis illustrated how to examine exams regarding: Exercise difficulty, student performance, unidimensionality, fairness. The replication code for the results from the slides is included below (omitting some graphical details for simplicity, e.g., labeling or color).

    ## load data and exclude extreme scorers library("psychotools") data("MathExam14W", package = "psychotools") mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13) ## raw data plot(mex$solved) ## Rasch model parameters mr <- raschmodel(mex$solved) plot(mr, type = "profile") ## points per student MathExam14W <- transform(MathExam14W, points = 2 * nsolved - 0.5 * rowSums(credits == 1) ) hist(MathExam14W$points, breaks = -4:13 * 2 + 0.5, col = "lightgray") abline(v = 12.5, lwd = 2, col = 2) ## person-item map plot(mr, type = "piplot") ## principal component analysis pr <- prcomp(mex$solved, scale = TRUE) plot(pr) biplot(pr, col = c("transparent", "black"), xlim = c(-0.065, 0.005), ylim = c(-0.04, 0.065)) ## differential item functioning mr1 <- raschmodel(subset(mex, group == 1)$solved) mr2 <- raschmodel(subset(mex, group == 2)$solved) ma <- anchortest(mr1, mr2, adjust = "single-step") ## anchored item difficulties plot(mr1, parg = list(ref = ma$anchor_items), ref = FALSE, ylim = c(-2, 3), pch = 19) plot(mr2, parg = list(ref = ma$anchor_items), ref = FALSE, add = TRUE, pch = 19, border = 4) legend("topleft", paste("Group", 1:2), pch = 19, col = c(1, 4), bty = "n") ## simultaneous Wald test for pairwise differences plot(ma$final_tests) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppGSL 0.3.5

    Sat, 05/19/2018 - 22:11

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance update of RcppGSL just brought version 0.3.5 to CRAN, a mere twelve days after the RcppGSL 0.3.4. release. Just like yesterday’s upload of inline 0.3.15 it was prompted by a CRAN request to update the per-package manual page; see the inline post for details.

    The RcppGSL package provides an interface from R to the GNU GSL using the Rcpp package.

    No user-facing new code or features were added. The NEWS file entries follow below:

    Changes in version 0.3.5 (2018-05-19)
    • Update package manual page using references to DESCRIPTION file [CRAN request].

    Courtesy of CRANberries, a summary of changes to the most recent release is available.

    More information is on the RcppGSL page. Questions, comments etc should go to the issue tickets at the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    openrouteservice – geodata!

    Sat, 05/19/2018 - 20:56

    (This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

    The openrouteservice provides a new method to get geodata into R. It has an API (or a set of them) and an R package has been written to communicate with said API(s) and is available from GitHub. I’ve just been playing around with the examples on this page, in the thought of using it for a project (more on that later if I get anywhere with it).

    Anyways…onto the code…which is primarily a modification from the examples page I mentioned earlier (see that page for more examples).

    devtools::install_github("GIScience/openrouteservice-r")

    Load some libraries

    library(openrouteservice) library(leaflet)

    Set the API key

    ors_api_key("your-key-here")

    Locations of interest and send the request to the API asking for the region that is accessible within a 15 minute drive of the coordinates.

    coordinates <- list(c(8.55, 47.23424), c(8.34234, 47.23424), c(8.44, 47.4)) x <- ors_isochrones(coordinates, range = 60*15, # maximum time to travel (15 mins) interval = 60*15, # results in bands of 60*15 seconds (15 mins) intersections=FALSE) # no intersection of polygons

    By changing the interval to, say, 60*5, three regions per coordinate are returned representing regions accessible within 5, 10 and 15 minutes drive. Changing the intersections argument would produce a separate polygon for any overlapping regions. The information of the intersected polygons is limited though, so it might be better to do the intersection with other tools afterwards.

    The results can be plotted with leaflet…

    leaflet() %>% addTiles() %>% addGeoJSON(x) %>% fitBBox(x$bbox)

    The blue regions are the three regions accessible within 15 minutes. A few overlapping regions are evident, each of which would be saved to a unique polygon had we set intersections to TRUE.

    The results from the API come down in a GeoJSON format which is given a class of, in this case ors_isochrones, which isn’t recognized by so many formats so you might want to convert it to an sp object, giving access to all of the tools for those formats. That’s easy enough to do via the geojsonio package…

    library(geojsonio) class(x) <- "geo_list" y <- geojson_sp(x) library(sp) plot(y)

    You can also derive coordinates from (partial) addresses. Here is an example for a region of Bern in Switzerland, using the postcode.

    coord <- ors_geocode("3012, Switzerland")

    This resulted in 10 hits, the first of which was correct…the others were in different countries…

    unlist(lapply(coord$features, function(x) x$properties$label)) [1] "3012, Bern, Switzerland" [2] "A1, Bern, Switzerland" [3] "Bremgartenstrasse, Bern, Switzerland" [4] "131 Bremgartenstrasse, Bern, Switzerland" [5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland" [6] "119 Bremgartenstrasse, Bern, Switzerland" [7] "Gym Neufeld, Bern, Switzerland" [8] "131b Bremgartenstrasse, Bern, Switzerland" [9] "Gebäude Nord, Bern, Switzerland" [10] "113 Bremgartenstrasse, Bern, Switzerland"

    The opposite (coordinate to address) is also possible, again returning multiple hits…

    address <- ors_geocode(location = c(7.425898, 46.961598)) unlist(lapply(address$features, function(x) x$properties$label)) [1] "3012, Bern, Switzerland" [2] "A1, Bern, Switzerland" [3] "Bremgartenstrasse, Bern, Switzerland" [4] "131 Bremgartenstrasse, Bern, Switzerland" [5] "Briefeinwurf Bern, Gymnasium Neufeld, Bern, Switzerland" [6] "119 Bremgartenstrasse, Bern, Switzerland" [7] "Gym Neufeld, Bern, Switzerland" [8] "131b Bremgartenstrasse, Bern, Switzerland" [9] "Gebäude Nord, Bern, Switzerland" [10] "113 Bremgartenstrasse, Bern, Switzerland"

    Other options are distances/times/directions between points and places of interest (POI) near a point or within a region.

    Hope that helps someone! Enjoy!

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create Code Metrics with cloc

    Sat, 05/19/2018 - 20:31

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    The cloc Perl script (yes, Perl!) by Al Danial (https://github.com/AlDanial/cloc) has been one of the go-to tools for generating code metrics. Given a single file, directory tree, archive, or git repo, cloc can speedily give you metrics on the count of blank lines, comment lines, and physical lines of source code in a vast array of programming languages.

    I don’t remember the full context but someone in the R community asked about about this type of functionality and I had tossed together a small script-turned-package to thinly wrap the Perl cloc utility. Said package was and is unimaginatively named cloc. Thanks to some collaborative input from @ma_salmon, the package gained more features. Recently I added the ability to process R markdown (Rmd) files (i.e. only count lines in code chunks) to the main cloc Perl script and was performing some general cleanup when the idea to create some RStudio addins hit me.

    cloc Basics

    As noted, you can cloc just about anything. Here’s some metrics for dplyr::group_by:

    cloc("https://raw.githubusercontent.com/tidyverse/dplyr/master/R/group-by.r") ## # A tibble: 1 x 10 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct ## 1 group… R 1 1. 44 1. 13 1. 110 1.

    and, here’s a similar set of metrics for the whole dplyr package:

    cloc_cran("dplyr") ## # A tibble: 7 x 11 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines comment_line_pct ## 1 dplyr… R 148 0.454 13216 0.442 2671 0.380 3876 0.673 ## 2 dplyr… C/C++ H… 125 0.383 6687 0.223 1836 0.261 267 0.0464 ## 3 dplyr… C++ 33 0.101 4724 0.158 915 0.130 336 0.0583 ## 4 dplyr… HTML 11 0.0337 3602 0.120 367 0.0522 11 0.00191 ## 5 dplyr… Markdown 2 0.00613 1251 0.0418 619 0.0880 0 0. ## 6 dplyr… Rmd 6 0.0184 421 0.0141 622 0.0884 1270 0.220 ## 7 dplyr… C 1 0.00307 30 0.00100 7 0.000995 0 0. ## # ... with 1 more variable: pkg

    We can also measure (in bulk) from afar, such as the measuring the dplyr git repo:

    cloc_git("git://github.com/tidyverse/dplyr.git") ## # A tibble: 12 x 10 ## source language file_count file_count_pct loc loc_pct blank_lines blank_line_pct comment_lines ## 1 dplyr.git HTML 108 0.236 21467 0.335 3829 0.270 1114 ## 2 dplyr.git R 156 0.341 13648 0.213 2682 0.189 3736 ## 3 dplyr.git Markdown 12 0.0263 10100 0.158 3012 0.212 0 ## 4 dplyr.git C/C++ Header 126 0.276 6891 0.107 1883 0.133 271 ## 5 dplyr.git CSS 2 0.00438 5684 0.0887 1009 0.0711 39 ## 6 dplyr.git C++ 33 0.0722 5267 0.0821 1056 0.0744 393 ## 7 dplyr.git Rmd 7 0.0153 447 0.00697 647 0.0456 1309 ## 8 dplyr.git XML 1 0.00219 291 0.00454 0 0. 0 ## 9 dplyr.git YAML 6 0.0131 212 0.00331 35 0.00247 12 ## 10 dplyr.git JavaScript 2 0.00438 44 0.000686 10 0.000705 4 ## 11 dplyr.git Bourne Shell 3 0.00656 34 0.000530 15 0.00106 10 ## 12 dplyr.git C 1 0.00219 30 0.000468 7 0.000493 0 ## # ... with 1 more variable: comment_line_pct All in on Addins

    The Rmd functionality made me realize that some interactive capabilities might be handy, so I threw together three of them.

    Two of them extraction of code chunks from Rmd documents. One uses cloc other uses knitr::purl() (h/t @yoniceedee). The knitr one adds in some very nice functionality if you want to preserve chunk options and have “eval=FALSE” chunks commented out.

    The final one will gather up code metrics for all the sources in an active project.

    FIN

    If you’d like additional features or want to contribute, give (https://github.com/hrbrmstr/cloc) a visit and drop an issue or PR.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    An East-West less divided?

    Sat, 05/19/2018 - 19:41

    (This article was first published on R – thinkr, and kindly contributed to R-bloggers)

    With tensions heightened recently at the United Nations, one might wonder whether we’ve drawn closer, or farther apart, over the decades since the UN was established in 1945.

    We’ll see if we can garner a clue by performing cluster analysis on the General Assembly voting of five of the founding members. We’ll focus on the five permanent members of the Security Council. Then later on we can look at whether Security Council vetoes corroborate our findings.

    A prior article, entitled the “cluster of six“, employed unsupervised machine learning to discover the underlying structure of voting data. We’ll use related techniques here to explore the voting history of the General Assembly, the only organ of the United Nations in which all 193 member states have equal representation.

    By dividing the voting history into two equal parts, which we’ll label as the “early years” and the “later years”, we can assess how our five nations cluster in the two eras.

    During the early years, France, the UK and the US formed one cluster, whilst Russia stood apart.

    Although the Republic of China (ROC) joined the UN at its founding in 1945, it’s worth noting that the People’s Republic of China (PRC), commonly called China today, was admitted into the UN in 1971. Hence its greater distance in the clustering.

    Through the later years, France and the UK remained close. Not surprising given our EU ties. Will Brexit have an impact going forward?

    The US is slightly separated from its European allies, but what’s more striking, is the shorter distance between these three and China / Russia. Will globalization continue to bring us closer together, or is the tide about to turn?

    The cluster analysis above focused on General Assembly voting. By web-scraping the UN’s Security Council Veto List, we can acquire further insights on the voting patterns of our five nations.

    Russia dominated the early vetoes before these dissipated in the late 60s. Vetoes picked up again in the 70s with the US dominating through to the 80s. China has been the most restrained throughout.

    Since the 90s, there would appear to be less dividing us, supporting our finding from the General Assembly voting. But do the vetoes in 2017, and so far in 2018, suggest a turning of the tide? Or just a temporary divergence?

    R toolkit

    R packages and functions (excluding base) used throughout this analysis.

      Packages Functions purrr map_dbl[3]; map[1]; map2_df[1]; possibly[1]; set_names[1] XML readHTMLTable[1] dplyr if_else[15]; mutate[9]; filter[6]; select[5]; group_by[3]; summarize[3]; distinct[2]; inner_join[2]; slice[2]; arrange[1]; as_data_frame[1]; as_tibble[1]; data_frame[1]; desc[1]; rename[1] tibble as_data_frame[1]; as_tibble[1]; data_frame[1]; enframe[1]; rowid_to_column[1] stringr str_c[8]; str_detect[6]; str_replace[3]; fixed[2]; str_remove[2]; str_count[1] rebus dgt[1]; literal[1]; lookahead[1]; lookbehind[1] lubridate year[7]; dmy[1]; today[1]; ymd[1] dummies dummy.data.frame[2] tidyr spread[3]; gather[2]; unnest[1] cluster pam[3] ggplot2 aes[6]; ggplot[5]; ggtitle[5]; scale_x_continuous[5]; element_blank[4]; geom_text[4]; geom_line[3]; geom_point[3]; ylim[3]; element_rect[2]; geom_col[2]; labs[2]; scale_fill_manual[2]; theme[2]; coord_flip[1] factoextra fviz_cluster[3]; fviz_dend[1]; fviz_silhouette[1]; hcut[1] cowplot draw_plot[2]; ggdraw[1] ggthemes theme_economist[1] kableExtra kable[1]; kable_styling[1] knitr kable[1]

    View the code here.

    Citations / Attributions

    R Development Core Team (2008). R: A language and environment for
    statistical computing. R Foundation for Statistical Computing,
    Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

    Erik Voeten “Data and Analyses of Voting in the UN General Assembly” Routledge Handbook of International Organization, edited by Bob Reinalda (published May 27, 2013)

    The post An East-West less divided? appeared first on thinkr.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Do Clustering by “Dimensional Collapse”

    Sat, 05/19/2018 - 13:39

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Problem

    Image that someone in a bank want to find out whether some of bank’s credit card holders are acctually the same person, so according to his experience, he set a rule: the people share either the same address or the same phone number can be reasonably regarded as the same person. Just as the example:

    library(tidyverse) a <- data_frame(id = 1:16,                 addr = c("a", "a", "a", "b", "b", "c", "d", "d", "d", "e", "e", "f", "f", "g", "g", "h"),                 phone = c(130L, 131L, 132L, 133L, 134L, 132L, 135L, 136L, 137L, 136L, 138L, 138L, 139L, 140L, 141L, 139L),                 flag = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 3L)) head(a) ## id   addr    phone   flag ## 1    a   130 1 ## 2    a   131 1 ## 3    a   132 1 ## 4    b   133 2 ## 5    b   134 2 ## 6    c   132 1

    In the dataframe

    ]czoxOlwiYVwiO3tbJiomXX0=[

    , the letters in column

    ]czo0OlwiYWRkclwiO3tbJiomXX0=[

    stand for address information, the numbers in column

    ]czo1OlwicGhvbmVcIjt7WyYqJl19[

    stand for phone numbers, and the integers in column

    ]czo0OlwiZmxhZ1wiO3tbJiomXX0=[

    is what he want: the CLUSTER flag which means “really” different persons.

    In the above plot, each point stand for a “identity” who has a address which you can tell according to horizontal axis , and a phone number which you can see in vertical axis. The red dotted line present the “connections” betweent identities, which actually means the same address or phone number. So the wanted result is the blue rectangels to circle out different flags which reprent really different persons.

    Goal

    The “finding the same person” thing is typically a clustring process, and I am very sure there are pretty many ways to do it, Such as Disjoint-set data structure. But, I can not help thinking mayby we can make it in a simple way with R. that’s my goal.

    “Dimensional Collapse”

    When I stared at the plot, I ask myself, why not map the x-axis information of the points to the very first one according to the y-axis “connections”. When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on x-axis: a, b, d and g, instead of 9 marks in the first place. And the y-axis information, after contributing all the “connection rules”, can be put away now, since the left x-axis marks are exactly what I want: the final flags. It is why I like to call it “Dimensional Collapse”.

    Furthermore, in order to take advantage of R properties, I also:
    1. Treat both dimensions as integers by factoring them.
    2. Use “integer subsetting” to map and collapse.

    axis_collapse <- function(df, .x, .y) {     .x <- enquo(.x)     .y <- enquo(.y)          # Turn the address and phone number into integers.     df <- mutate(df,                  axis_x = c(factor(!!.x)),                  axis_y = c(factor(!!.y)))          oldRule <- seq_len(max(df$axis_x))          mapRule <- df %>%       select(axis_x, axis_y) %>%       group_by(axis_y) %>%       arrange(axis_x, .by_group = TRUE) %>%       mutate(collapse = axis_x[1]) %>%       ungroup() %>%       select(-axis_y) %>%       distinct() %>%       group_by(axis_x) %>%       arrange(collapse, .by_group = TRUE) %>%       slice(1) %>%         ungroup() %>%       arrange(axis_x) %>%       pull(collapse)          # Use integer subsetting to collapse x-axis.     # In case of indirect "connections", we should do it recursively.     while (TRUE) {         newRule <- mapRule[oldRule]         if(identical(newRule, oldRule)) {             break         } else {             oldRule <- newRule         }     }          df <- df %>%       mutate(flag = newRule[axis_x],              flag = c(factor(flag))) %>%       select(-starts_with("axis_"))          df }

    Let see the result.

    a %>%   rename(flag_t = flag) %>%   axis_collapse(addr, phone) %>%   mutate_at(.vars = vars(addr:flag), factor) %>%   ggplot(aes(factor(addr), factor(phone), shape = flag_t, color = flag)) +   geom_point(size = 3) +   labs(x = "Address", y = "Phone Number", shape = "Target Flag:", color = "Cluster Flag:")

    Not bad so far.

    Calculation Complexity

    Let make a simple test about time complexity.

    test1 <- data_frame(addr = sample(1:1e4, 1e4), phone = sample(1:1e4, 1e4)) test2 <- data_frame(addr = sample(1:1e5, 1e5), phone = sample(1:1e5, 1e5)) bm <- microbenchmark::microbenchmark(n10k = axis_collapse(test1, addr, phone),                                      n100k = axis_collapse(test2, addr, phone),                                      times = 30) summary(bm) ## expr min lq  mean    median  uq  max neval   cld ## n10k     249.2172    259.918     277.0333    266.9297    279.505     379.4292    30  a ## n100k    2489.1834   2581.731    2640.9394   2624.5741   2723.390    2839.5180   30  b

    It seems that the growth of consumed time is in linear relationship with data increase holding the other conditions unchanged. That is acceptable.

    More Dimensions?

    To me, since this method collapse one dimension by transfering their clustering information to the other dimension, so the method should can be used resursively on more than 2 dimensions. But I am not 100% sure. Let do a simple test.

    a %>%   # I deliberately add a column which connect group 2 and 4 only.   mutate(other = c(LETTERS[1:14], "D", "O")) %>%   # use axis_collapse recursively   axis_collapse(other, phone) %>%   axis_collapse(flag, addr) %>%   ggplot(aes(x = factor(addr), y = factor(phone), color = factor(flag))) +   geom_point(size = 3) +   labs(x = "Address", y = "Phone Number", color = "Cluster Flag:")

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Decision Modelling in R Workshop in The Netherlands!

    Sat, 05/19/2018 - 13:39

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    The Decision Analysis in R for Technologies in Health (DARTH) workgroup is hosting a two-day workshop on decision analysis in R in Leiden, The Netherlands from June 7-8, 2018. A one-day introduction to R course will also be offered the day before the workshop, on June 6th.

    Decision models are mathematical simulation models that are increasingly being used in health sciences to simulate the impact of policy decisions on population health. New methodological techniques around decision modeling are being developed that rely heavily on statistical and mathematical techniques. R is becoming increasingly popular in decision analysis as it provides a flexible environment where advanced statistical methods can be combined with decision models of varying complexity. Also, the fact that R is freely available improves model transparency and reproducibility.

    The workshop will guide participants on building probabilistic decision trees, Markov models and microsimulations, creating publication-quality tabular and graphical output, and will provide a basic introduction to value of information methods and model calibration using R.

    For more information and to register, please visit: http://www.cvent.com/d/mtqth1

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    wrapr 1.4.1 now up on CRAN

    Sat, 05/19/2018 - 04:57

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    wrapr 1.4.1 is now available on CRAN. wrapr is a really neat R package both organizing, meta-programming, and debugging R code. This update generalizes the dot-pipe feature’s dot S3 features.

    Please give it a try!

    wrapr, is an R package that supplies powerful tools for writing and debugging R code.

    Introduction

    Primary wrapr services include:

    • let() (let block)
    • %.>% (dot arrow pipe)
    • build_frame()/draw_frame()
    • := (named map builder)
    • DebugFnW() (function debug wrappers)
    • λ() (anonymous function builder)
    let()

    let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for names as with base::substitute() or base::with()).

    The function is simple and powerful. It treats strings as variable names and re-writes expressions as if you had used the denoted variables. For example the following block of code is equivalent to having written "a + a".

    library("wrapr") a <- 7 let( c(VAR = 'a'), VAR + VAR ) # [1] 14

    This is useful in re-adapting non-standard evaluation interfaces (NSE interfaces) so one can script or program over them.

    We are trying to make let() self teaching and self documenting (to the extent that makes sense). For example try the arguments "eval=FALSE" prevent execution and see what would have been executed, or debug=TRUE to have the replaced code printed in addition to being executed:

    let( c(VAR = 'a'), eval = FALSE, { VAR + VAR } ) # { # a + a # } let( c(VAR = 'a'), debugPrint = TRUE, { VAR + VAR } ) # $VAR # [1] "a" # # { # a + a # } # [1] 14

    Please see vignette('let', package='wrapr') for more examples. Some formal documentation can be found here.

    For working with dplyr 0.7.* we strongly suggest wrapr::let() (or even an alternate approach called seplyr).

    %.>% (dot pipe or dot arrow)

    %.>% dot arrow pipe is a pipe with intended semantics:

    "a %.>% b" is to be treated approximately as if the user had written "{ . <- a; b };" with "%.>%" being treated as left-associative.

    Other R pipes include magrittr and pipeR.

    The following two expressions should be equivalent:

    cos(exp(sin(4))) # [1] 0.8919465 4 %.>% sin(.) %.>% exp(.) %.>% cos(.) # [1] 0.8919465

    The notation is quite powerful as it treats pipe stages as expression parameterized over the variable ".". This means you do not need to introduce functions to express stages. The following is a valid dot-pipe:

    1:4 %.>% .^2 # [1] 1 4 9 16

    The notation is also very regular as we show below.

    1:4 %.>% sin # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% sin(.) # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% base::sin # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% base::sin(.) # [1] 0.8414710 0.9092974 0.1411200 -0.7568025 1:4 %.>% function(x) { x + 1 } # [1] 2 3 4 5 1:4 %.>% (function(x) { x + 1 }) # [1] 2 3 4 5 1:4 %.>% { .^2 } # [1] 1 4 9 16 1:4 %.>% ( .^2 ) # [1] 1 4 9 16

    Regularity can be a big advantage in teaching and comprehension. Please see "In Praise of Syntactic Sugar" for more details. Some formal documentation can be found here.

    • Some obvious "dot-free"" right-hand sides are rejected. Pipelines are meant to move values through a sequence of transforms, and not just for side-effects. Example: 5 %.>% 6 deliberately stops as 6 is a right-hand side that obviously does not use its incoming value. This check is only applied to values, not functions on the right-hand side.
    • Trying to pipe into a an "zero argument function evaluation expression" such as sin() is prohibited as it looks too much like the user declaring sin() takes no arguments. One must pipe into either a function, function name, or an non-trivial expression (such as sin(.)). A useful error message is returned to the user: wrapr::pipe does not allow direct piping into a no-argument function call expression (such as "sin()" please use sin(.)).
    • Some reserved words can not be piped into. One example is 5 %.>% return(.) is prohibited as the obvious pipe implementation would not actually escape from user functions as users may intend.
    • Obvious de-references (such as $, ::, @, and a few more) on the right-hand side are treated performed (example: 5 %.>% base::sin(.)).
    • Outer parenthesis on the right-hand side are removed (example: 5 %.>% (sin(.))).
    • Anonymous function constructions are evaluated so the function can be applied (example: 5 %.>% function(x) {x+1} returns 6, just as 5 %.>% (function(x) {x+1})(.) does).
    • Checks and transforms are not performed on items inside braces (example: 5 %.>% { function(x) {x+1} } returns function(x) {x+1}, not 6).
    build_frame()/draw_frame()

    build_frame() is a convenient way to type in a small example data.frame in natural row order. This can be very legible and saves having to perform a transpose in one’s head. draw_frame() is the complimentary function that formats a given data.frame (and is a great way to produce neatened examples).

    x <- build_frame( "measure" , "training", "validation" | "minus binary cross entropy", 5 , -7 | "accuracy" , 0.8 , 0.6 ) print(x) # measure training validation # 1 minus binary cross entropy 5.0 -7.0 # 2 accuracy 0.8 0.6 str(x) # 'data.frame': 2 obs. of 3 variables: # $ measure : chr "minus binary cross entropy" "accuracy" # $ training : num 5 0.8 # $ validation: num -7 0.6 cat(draw_frame(x)) # build_frame( # "measure" , "training", "validation" | # "minus binary cross entropy", 5 , -7 | # "accuracy" , 0.8 , 0.6 ) := (named map builder)

    := is the "named map builder". It allows code such as the following:

    'a' := 'x' # a # "x"

    The important property of named map builder is it accepts values on the left-hand side allowing the following:

    name <- 'variableNameFromElsewhere' name := 'newBinding' # variableNameFromElsewhere # "newBinding"

    A nice property is := commutes (in the sense of algebra or category theory) with R‘s concatenation function c(). That is the following two statements are equivalent:

    c('a', 'b') := c('x', 'y') # a b # "x" "y" c('a' := 'x', 'b' := 'y') # a b # "x" "y"

    The named map builder is designed to synergize with seplyr.

    DebugFnW()

    DebugFnW() wraps a function for debugging. If the function throws an exception the execution context (function arguments, function name, and more) is captured and stored for the user. The function call can then be reconstituted, inspected and even re-run with a step-debugger. Please see our free debugging video series and vignette('DebugFnW', package='wrapr') for examples.

    λ() (anonymous function builder)

    λ() is a concise abstract function creator or "lambda abstraction". It is a placeholder that allows the use of the -character for very concise function abstraction.

    Example:

    # Make sure lambda function builder is in our enironment. wrapr::defineLambda() # square numbers 1 through 4 sapply(1:4, λ(x, x^2)) # [1] 1 4 9 16 Installing

    Install with either:

    install.packages("wrapr")

    or

    # install.packages("devtools") devtools::install_github("WinVector/wrapr") More Information

    More details on wrapr capabilities can be found in the following two technical articles:

    Note

    Note: wrapr is meant only for "tame names", that is: variables and column names that are also valid simple (without quotes) R variables names.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    inline 0.3.15

    Sat, 05/19/2018 - 03:04

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release of the inline package arrived on CRAN today. inline facilitates writing code in-line in simple string expressions or short files. The package is mature and in maintenance mode: Rcpp used it greatly for several years but then moved on to Rcpp Attributes so we have a much limited need for extensions to inline. But a number of other package have a hard dependence on it, so we do of course look after it as part of the open source social contract (which is a name I just made up, but you get the idea…)

    This release was triggered by a (as usual very reasonable) CRAN request to update the per-package manual page which had become stale. We now use Rd macros, you can see the diff for just that file at GitHub; I also include it below. My pkgKitten package-creation helper uses the same scheme, I wholeheartedly recommend it — as the diff shows, it makes things a lot simpler.

    Some other changes reflect both two user-contributed pull request, as well as standard minor package update issues. See below for a detailed list of changes extracted from the NEWS file.

    Changes in inline version 0.3.15 (2018-05-18)
    • Correct requireNamespace() call thanks (Alexander Grueneberg in #5).

    • Small simplification to .travis.yml; also switch to https.

    • Use seq_along instead of seq(along=...) (Watal M. Iwasaki) in #6).

    • Update package manual page using references to DESCRIPTION file [CRAN request].

    • Minor packaging updates.

    Courtesy of CRANberries, there is a comparison to the previous release.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    What Makes a Song (More) Popular

    Fri, 05/18/2018 - 16:42

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    Earlier this week, the Association for Psychological Science sent out a press release about a study examining what makes a song popular:

    Researchers Jonah Berger of the University of Pennsylvania and Grant Packard of Wilfrid Laurier University were interested in understanding the relationship between similarity and success. In a recent study published in Psychological Science, the authors describe how a person’s drive for stimulation can be satisfied by novelty. Cultural items that are atypical, therefore, may be more liked and become more popular.

    “Although some researchers have argued that cultural success is impossible to predict,” they explain, “textual analysis of thousands of songs suggests that those whose lyrics are more differentiated from their genres are more popular.”

    The study, which is was published online ahead of print, used a method of topic modeling called latent Dirichlet allocation. (Side note, this analysis is available in the R topicmodels package, as function LDA. It requires a document term matrix, which can be created in R. Perhaps a future post!) The LDA extracted 10 topics from the lyrics of songs spanning seven genres (Christian, country, dance, pop, rap, rock, and rhythm and blues):

    • Anger and violence
    • Body movement
    • Dance moves
    • Family
    • Fiery love
    • Girls and cars
    • Positivity
    • Spiritual
    • Street cred
    • Uncertain love

    Overall, they found that songs with lyrics that differentiated them from other songs in their genre were more popular. However, this wasn’t the case for the specific genres of pop and dance, where lyrical differentiation appeared to be harmful to popularity. Finally, being lyrically different by being more similar to a different genre (a genre to which the song wasn’t defined) had no impact. So it isn’t about writing a rock song that sounds like a rap song to gain popularity; it’s about writing a rock song that sounds different from other rock songs.

    I love this study idea, especially since I’ve started doing some text and lyric analysis on my own. (Look for another one Sunday, tackling the concept of sentiment analysis!) But I do have a criticism. This research used songs listed in the Billboard Top 50 by genre. While it would be impossible to analyze every single song that comes out a given time, this study doesn’t really answer the question of what makes a song popular, but what determines how popular an already popular song is. The advice in the press release (To Climb the Charts, Write Lyrics That Stand Out), may be true for established artists who are already popular, but it doesn’t help that young artist trying to break onto the scene. They’re probably already writing lyrics to try to stand out. They just haven’t been noticed yet.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How To Plot With Dygraphs: Exercises

    Fri, 05/18/2018 - 15:41

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    INTRODUCTION

    The dygraphs package is an R interface to the dygraphs JavaScript charting library. It provides rich facilities for charting time-series data in R, including:

    1. Automatically plots xts time-series objects (or any object convertible to xts.)

    2. Highly configurable axis and series display (including optional second Y-axis.)

    3. Rich interactive features, including zoom/pan and series/point highlighting.

    4. Display upper/lower bars (ex. prediction intervals) around the series.

    5. Various graph overlays, including shaded regions, event lines, and point annotations.

    6. Use at the R console, just like conventional R plots (via RStudio Viewer.)

    7. Seamless embedding within R Markdown documents and Shiny web applications.

    Before proceeding, please follow our short tutorial.

    Look at the examples given and try to understand the logic behind them. Then, try to solve the exercises below by using R without looking at the answers. Then, check the solutions to check your answers.

    Exercise 1

    Unite the two time series data-sets mdeaths and fdeaths and create a time-series dygraph of the new data-set.

    Exercise 2

    Insert a date range selector into the dygraph you just created.

    Exercise 3

    Change the label names of “mdeaths” and “fdeaths” to “Male” and “Female.”

    Exercise 4

    Make the graph stacked.

    Exercise 5

    Set the date range selector height to 20.

    Exercise 6

    Add a main title to your graph.

    Exercise 7

    Use the tutorial’s predicted data-set to create a dygraph of “lwr”, “fit”, and “upr”, but display the label as the summary of them.

    Exercise 8

    Set the colors to red.

    Exercise 9

    Remove the x-axis grid lines from your graph.

    Exercise 10

    Remove the y-axis grid lines from your graph.

    Related exercise sets:
    1. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
    2. Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
    3. Spatial Data Analysis: Introduction to Raster Processing: Part-3
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages