Because it’s Friday: History of Australia

This one’s mainly for my Australian friends and family who, if they’re anything like me, didn’t realize that there had been so many changes since the colonial borders of Australia were first drawn. The state where I was born, South Australia, changed shape no less than four times @— I had no idea! (via Max Galka)

That’s all from the blog for this week. Monday’s a holiday here in the States, so we’ll be back on Tuesday. See you then!

from Revolutions

Who is the caretaker? Evidence-based probability estimation with the bnlearn package

by Juan M. Lavista Ferres , Senior Director of Data Science at Microsoft
In what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview. 

The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean. 
The confusion over this episode caused a second viral wave calling out that people that thought she was the nanny should feel bad for being stereotypical.
We decided to embrace the uncertainty and take a data science based approach to estimating the chances that the person was the nanny or the mother of the child, based on the evidence people had from watching the news.

@David_Waddell What would that mean, please? Re-broadcasting it on BBC TV, or just here on Twitter? Is this kinda thing that goes ‘viral’ and gets weird?
— Robert E Kelly (@Robert_E_Kelly) March 10, 2017

What evidence did viewers of the video have?

the person is American Caucasian
the person is professional
there are two kids
the caretaker is Asian

We then look for probability values for these statistics. (Given that Professor Kelly is American, all statistics are based on US data.)

Probability (Asian Wife | Caucasian Husband) = 1% [Married couples in the United States in 2010]
Probability of (Household has Nanny | husband is professional) = 3.5% [The Three Faces of Work-Family Conflict, page 9, Figure 3]
Probability of (Asian | Nanny) = 6% [Caregiver Statistics: Demographics]
Probability of (Stay at home mom) = 14% and Probability of (Stay at home mom | Asian Wife) = 30%  [Stay-at-Home Mothers by Demographic Group ]

We define the following Bayesian network using the bnlearn package for R. We create the network using the model2network function and then we input the conditional probability tables (CPTs) that we know at each node.

net <- model2network("[HusbandDemographics][HusbandIsProfessional][NannyDemographics][WifeDemographics|HusbandDemographics][StayAtHomeMom|HusbandIsProfessional:WifeDemographics][HouseholdHasNanny|StayAtHomeMom:HusbandIsProfessional][Caretaker|StayAtHomeMom:HouseholdHasNanny][CaretakerEthnicity|WifeDemographics:Caretaker:NannyDemographics]")


The last step is to fit the parameters of the Bayesian network conditional on its structure, the function runs the EM algorithm to learn CPT for all different nodes in the above graph.

yn <- c("yes", "no")
ca <- c("caucacian","other")
ao <- c("asian","other")
nw <- c("nanny","wife")

cptHusbandDemographics <- matrix(c(0.85, 0.15), ncol=2, dimnames=list(NULL, ca)) #[1]
cptHusbandIsProfessional <- matrix(c(0.81, 0.19), ncol=2, dimnames=list(NULL, yn)) #[2]
cptNannyDemographics <- matrix(c(0.06, 0.94), ncol=2, dimnames=list(NULL, ao)) # [3]
cptWifeDemographics <- matrix(c(0.01, 0.99, 0.33, 0.67), ncol=2, dimnames=list("WifeDemographics"=ao, "HusbandDemographics"=ca)) #[1]
cptStayAtHomeMom <- c(0.3, 0.7, 0.14, 0.86, 0.125, 0.875, 0.125, 0.875) #[4]

dim(cptStayAtHomeMom) <- c(2, 2, 2)
dimnames(cptStayAtHomeMom) <- list("StayAtHomeMom"=yn, "WifeDemographics"=ao, "HusbandIsProfessional"=yn)

cptHouseholdHasNanny <- c(0.01, 0.99, 0.035, 0.965, 0.00134, 0.99866, 0.00134, 0.99866) #[5]
dim(cptHouseholdHasNanny) <- c(2, 2, 2)
dimnames(cptHouseholdHasNanny) <- list("HouseholdHasNanny"=yn, "StayAtHomeMom"=yn, "HusbandIsProfessional"=yn)

cptCaretaker <- c(0.5, 0.5, 0.999, 0.001, 0.01, 0.99, 0.001, 0.999)
dim(cptCaretaker) <- c(2, 2, 2)
dimnames(cptCaretaker) <- list("Caretaker"=nw, "StayAtHomeMom"=yn, "HouseholdHasNanny"=yn)

cptCaretakerEthnicity <- c(0.99, 0.01, 0.99, 0.01, 0.99, 0.01, 0.01, 0.99, 0.01,0.99,0.99,0.01,0.01,0.99,0.01,0.99)
dim(cptCaretakerEthnicity) <- c(2, 2, 2,2)
dimnames(cptCaretakerEthnicity) <- list("CaretakerEthnicity"=ao,"Caretaker"=nw, "WifeDemographics"=ao ,"NannyDemographics"=ao)

net.disc <-, dist=list(HusbandDemographics=cptHusbandDemographics, HusbandIsProfessional=cptHusbandIsProfessional, WifeDemographics=cptWifeDemographics, StayAtHomeMom=cptStayAtHomeMom, HouseholdHasNanny=cptHouseholdHasNanny, Caretaker=cptCaretaker, NannyDemographics=cptNannyDemographics,CaretakerEthnicity=cptCaretakerEthnicity))

Once we have the model, we can query the network using cpquery to estimate the probability of the events and calculate the probability that the person is the nanny or the wife based on the evidence we have (husband is Caucasian and professional, caretaker is Asian). Based on this evidence the output is that the probability that she is the wife is 90% vs. 10% that she is the nanny.

probWife <- cpquery(net.disc, (Caretaker=="wife"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000)
probNanny <- cpquery(net.disc, (Caretaker=="nanny"),HusbandDemographics=="caucacian" & HusbandIsProfessional=="yes" & CaretakerEthnicity=="asian",n=1000000)

[1] "The probability that the caretaker is his wife = 0.898718647764449"
[1] "The probability that the caretaker is the nanny = 0.110892031547457"

In conclusion, if you thought the woman in the video was the nanny, you may need to review your priors!
The bnlearn package is available on CRAN. You can find the R code behind this post here on GitHub or here as a Jupyter Notebook.

from Revolutions

Love is all around: Popular words in pop hits

Data scientist Giora Simchoni recently published a fantastic analysis of the history of pop songs on the Billboard Hot 100 using the R language. Giora used the rvest package in R to scrape data from the Ultimate Music Database site for the 350,000 chart entries (and 35,000 unique songs) since 1940, and used those data to create and visualize several measures of song popularity over time.
A novel measure that Giora calculates is “area under the song curve”: the sum of all the ranks above 100 for every week the song is in the Hot 100. By that measure, the most popular (and also longest-charting) song of all time is Radioactive by Imagine Dragons:

It’s turns out that calculating this “song integral” is pretty simple in R when you use the tidyverse:

calculateSongIntegral %
filter(EntryDate >= date_decimal(1960)) %>%
group_by(Artist, Title) %>%
summarise(positions = list(ThisWeekPosition)) %>%
mutate(integral = map_dbl(positions, calculateSongIntegral)) %>%
group_by(Artist, Title) %>%
tally(integral) %>%

Another fascinating chart included in Giora’s post is this analysis of the most frequent words to appear in song titles, by decade. He used the tidytext package to extract individual words from song titles and then rank them by frequency of use:

So it seems as though Love Is All Around (#41, October 1994) after all! For more analysis of the Billboard Hot 100 data, including Top-10 rankings for various measures of song popularity and the associated R code, check out Giora’s post linked below.
Sex, Drugs and Data: Billboard Bananas

from Revolutions

Microsoft R Open 3.4.0 now available

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.4.0 and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.4.0, reduces the size of the installer image, and updates the bundled packages.
R 3.4.0 (upon which MRO 3.4.0 is based) is a major update to the R language, with many fixes and improvements. Most notably, R 3.4.0 introduces a just-in-time (JIT) compiler to improve performance of the scripts and functions that you write. There have been a few minor tweaks to the language itself, but in general functions and packages written for R 3.3.x should work the same in R 3.4.0. As usual, MRO points to a fixed CRAN snapshot from May 1 2017, but you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).   
MRO is supported on Windows, Mac and Linux (Ubuntu, RedHat/CentOS, and SUSE). MRO 3.4.0 is 100% compatible with R 3.4.0, and you can use it with any of the 10,000+ packages available on CRAN. Here are some highlights of new packages released since the last MRO update.
We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.
MRAN: Download Microsoft R Open

from Revolutions

Create smooth animations in R with the tweenr package

There are several tools available in R for creating animations (movies) from statistical graphics. The animation package by Yihui Xie will create an animated GIF or video file, using a series of R charts you generate as the frames. And the gganimate package is an extension to ggplot2 that will create a movie from charts created using the ggplot2 syntax: in much the same way that you can create multiple panels using faceting, you can instead an animation with multiple frames.
But from a storytelling perspective, such animations can sometimes seem rather disjointed. For example, here’s the example (from the gganimate documentation) of crating an animated bubble chart from the gapminder data. (NOTE: to use the gganimate package, you will need to install ImageMagick. On Windows, be sure to select the “install legacy utilities” option during install: gganimate requires the “convert” utility to operate.)

The data are all there, but the presentation lacks the style of the original Gapminder presentation, which smoothly moved the bubbles between the known data points separated by five-year intervals. You can achieve a similar effect in R by using the tweenr package, available on CRAN. The tweenr package doesn’t actually perform any animation itself; rather, it calculates smooth interpolations between data points to create new rows in the data which can be used as frames in the animation created by the gganimate package. Here’s the same chart, with data processed by tweenr: 

To achieve this affect, I used the tween_elements function to augment the original gapminder data set to make 300 frames of animation, smoothing the transitions of Life Expectancy, GDP per capita, and population, and then animating the result with gganimate. This snimation tutorial by Peter Aldhous was most helpful in developing the code shown below.


gapminder_edit %
arrange(country, year) %>%
select(gdpPercap,lifeExp,year,country, continent, pop) %>%
rename(x=gdpPercap,y=lifeExp,time=year,id=country) %>%

gapminder_tween %
mutate(year = round(time), country = .group) %>%
left_join(gapminder, by=c(“country”,”year”,”continent”)) %>%
rename(population = pop.x)

p2 <- ggplot(gapminder_tween,
aes(x=x, y=y, frame = .frame)) +
geom_point(aes(size=population, color=continent),alpha=0.8) +
xlab("GDP per capita") +
ylab("Life expectancy at birth") +

gganimate(p2, filename="gapminder-tween.gif", title_frame = FALSE, interval = 0.05)

For another nice example using the tweenr package, check out this blog post by Jake Thompson replicating the Datasaurus Dozen as an animation in R. (Note that every frame of the animation has almost the exact same set of summary statistics.)

For more on the tweenr package, check out its homepage on Github from its author, Thomas Lin Pedersen.
Github (thomasp85): tweenr

from Revolutions

Preview of EARL San Francisco

The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it’s not too late to register).  The EARL conference series is now in its fourth year, and the prior conferences in London and Boston have each been a fantastic way to learn how R is used in real-world applications. Judging from the speaker lineup in San Francisco, next month’s event looks like it will feature some great R stores as well. Included on the program will be stories from the following companies:

AirBnB (Keynote presentation by Ricardo Bion)
StitchFix (Keynote presentation by Hilary Parker)
Uber (Keynote presentation by Prakhar Mehrota)
Hitachi Solutions (David Bishop)
Amgen (Daniel Saldana)
Pfizer (Luke Fostveldt)
Fred Hutch Cancer Research Center (Thomas Vaughan)
Domino Data Lab (Eduardo  Ariño de la Rubia)
Genentech (Gabriel Becker)

… and many more. On a personal note, I’m looking forward to presenting with my colleague Bharath Sankaranarayan on using Microsoft R to predict hospital stays.
For more information on EARL 2017 San Francisco or to register, follow the link below. (And if you can’t make San Francisco, check out EARL 2017 London, coming September 12-14.)
EARL: San Francisco 2017

from Revolutions

The history of the universe, in 20 minutes

Not content with merely documenting the history of Japan, Bill Wurtz is back with a history of the entire universe. It gets rather Earth-centric once it gets going, but I guess that was inevitable. There’s some NSFW language, but it’s very funny and packs a lot of information into its 20 minute runtime:

That’s all from us for this week. Have a great weekend, and we’ll be back on Monday.

from Revolutions