The 10 Data Science Crack Commandments

 It’s the ten crack commandments, what? homie can’t tell me nothing about this code Can’t tell me nothing about these #rstats

Number 1, make a function from a script. Everyone knows we’re to busy to be copy/pasting shit

Number 2, never let ’em know your data manipulation moves. Don’t you know Bad Boys move in silence and violence?

Number 3: never trust point-o-five p’s, your moms’ll set that ass up, properly gassed up, hoodie to mask up, for that fast buck

Number 4: I know you heard this before “Never compute high on your own CPU supply”

Number 5: never store PII where you rest at

Number 6: that goddamn STATA*? Dead it You think a crackhead paying you back, shit forget it! (*STATA/SAS/SPSS)

Numero Siete: this rule is so underrated Keep your training and test set completely seperated Money and blood don’t mix like two…

Number 8, always keep survey weights on you. Them cats that squeeze your guns can ask what population your stats generalize to

Number 9 shoulda been Number 1 to me: If you ain’t gettin’ representative samples stay the fuck from police data

Number 10, a strong word called Bayes-i-an Strictly for live men, not for freshmen




Be a BigBallR in #rstats : Stayeth or Switcheth

If you’re a Big Baller, you know when to stayeth in your lane but also when to switcheth lanes.

The Big Baller Brand brothers just inked a deal to play professional basketball in Europe. They switched into a different lane to achieve their original goal.

The knee-jerk reaction from the non-globally minded is that this spells doom for any NBA hoop dreams. Not so.

Four score and ten years ago, your only shot of making it to the NBA was usually through the NCAA. The NBA has adopted a global perspective. You see professionals from different continental leagues finding a way into the NBA. The 2016 to 2017 diagram is interesting, we’re seeing a darker edge from the NBA to the G-League. The NBA Players Association worked hard on installing ‘2 way contracts’ letting NBA players play in (and get paid in) both the NBA and the G-League, this was a great move, a no-brainer.

Options are great, but keep in mind that the NBA is the crème de la crème. You see the lightest edges enter the NBA node while you see many dark edges enter the ‘none’ node.

When I started making the above network plot of basketball league to league migrations, I looked into off the shelf R packages. There’s a fragmented handfull of R packages to do this. Kicker, none of them do directed edges the way I wanted.

If you’re an #rstats user and want to adopt the BigBallR philosophy, find multiple ways to achieve your goal.

I went back to the basics and hand rolled my own network plot with vanilla ggplot2. At the end of the day, a network plot needs two things, nodes and edges, eg points (stored in dat_nodes) and lines (stored in dat_edges). Once you have your data in that format, you can make the network migration plot above, with the code snippet below

 geom_point(aes(size=10,label=league),show.legend = FALSE) +
 arrow=arrow(angle=30,length = unit(0.2, "inches")),
 aes(x=X1_send,y = X2_send,
 xend = X1_rec,yend = X2_rec)

I hand-rolled the network diagram cuz the other packages didn’t have the custom features I needed. In my hand-rolled plot there is still one thing missing. I want to place the ‘arrow head’ along some other part of the curve (say the mid-point), other than the end-point. This is probably hard to do, since arrow() looks like it just needs the receiving coordinate and plops the arrow head there. For what I want, ggplot2 needs to know the desired placement-coordinates output from geom_curve(). So somehow, the internal curve calculations need to be returned in order to pass into arrow().


Weird Al Yankovise() a 2 Chainz Lyric

I have a character string snipped from a lyric by Daniel Son the Necklace Don.

Let’s get all Weird Al Yankovic with it.

I’ll revise some words with the datzen::yankovise() function. You can swap out words using a dictionary of name-value pairs.

paste0(snip <<- "Suede sun roof, hanging out the big top We leave the dealership, head to the rim shop", " - @2chainz aka the Hair Weave Killer") 
#> [1] "Suede sun roof, hanging out the big top We leave the dealership, head to the rim shop - @2chainz aka the Hair Weave Killer"

Above, we have a bar from 2 Chainz aka the Hair Weave Killer.

Below, we have the yankovised bar from @2Chainz_PhD aka the CPU Core Killer aka ProbaBittyBoi aka Daniel Son the Data Don aka El Efron Jr.

# user supplied dictionary
dict_outin = c(
datzen::dictate_outin('dealership','SERVER ROOM'),

yankovise(snip,suffix="- @2Chainz_PhD",dict_outin = dict_outin)
#> [1] "suede sun roof hanging out da LAP top we leave da SERVER ROOM head 2 da RAM shop - @2Chainz_PhD"

You might ask, isn’t this just a wrapper to gsub() or stringr::str_replace_all() with some added flavor? I might say, yes, yes it is… only with a narrower scope and whose output is streamlined as tweet-ready text.

Get out of my way! Dunk thru #rstats errors like the Big Shaq-istician

Ahh, leaves falling, parents crying, collegicians biking uphill with a bag of in-n-out in between their teeth. Must be the new academic school year!

I figured it’s a good time to introduce my work-in-progress datzen package of miscellaneous #rstats functions.  You can bee-line straight to the github readme with more examples.

Or stick around and I’ll highlight the Shaq example showcasing datzen::itersave()

In #rstats if you want to iterate, you can go about it in many different ways. Works pretty well for “homogeneous” iterations.

As good as they are, the standard approaches hit snags for “non-homogeneous” iterations, eg data from the web.

Go ahead, try them. I dare you.

You in 5 hours

“Aw shit, my brute force for loop crapped the bed during iteration 69. Now I have to manually restart it. I hope it doesn’t do it again. I’m running out of patience, and linen.”

Let’s take a look. The Big Aristotle, Dr. Shaq, was a notorious brute on the hardwood. Here he is, contemplating how he should score in the paint:

shaq = function(meatbag){
if(meatbag %in% 'scrub'){return('dunk on em')}
if(meatbag %in% 'sabonis'){return('elbow his face')}
if(!(meatbag %in% c('scrub','sabonis'))){
stop('shaq is confused')}

meatbags = c('scrub','sabonis','scrub','kobe')
names(meatbags) = paste0('arg_',seq_along(meatbags))

#> Error in FUN(X[[i]], ...): shaq is confused

Uh, some error confused Shaq.

enter, stage trap door
“Meet itersave()

front row faints
“It’s… hideously beautiful”

In a nutshell, itersave works like lapply but when it meets an ugly, unskilled, unqualified, and ungraceful error it will keep trucking along like Shaquille The Diesel O’Neal hitchhiking a ride on Chris Dudley’s back


#> [1] "1 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_1"
#> [1] "2 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_2"
#> [1] "3 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_3"
#> [1] "4 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_4"

The meatbags that Shaq succesfully put into bodybags.

print('the successes')
#> [1] "the successes"
#> [1] "arg_1.rds" "arg_2.rds" "arg_3.rds" "failed"

It’ll also book keep any errors along the way via purrr::safely() and R.utils::withTimeout().

print('the failures')
#> [1] "the failures"
#> [1] "arg_4.rds"

Along with the out, itersave has an in companion

enter, zipline from balcony
“meet iterload()

audience faints

#> $arg_4
#> $arg_4$ind_fail
#> [1] 4
#> $arg_4$input_bad
#> [1] "kobe"
#> $arg_4$result_bad
#> <simpleError in (function (meatbag) {    if (meatbag %in% "scrub") {        return("dunk on em")    }    if (meatbag %in% "sabonis") {        return("elbow his face")    }    if (!(meatbag %in% c("scrub", "sabonis"))) {        stop("shaq is confused")    }})("kobe"): shaq is confused>

Ah, it was the 4th argument, Kobe, that boggled Shaq’s mind.

“Jigga man [was] Diesel, when he [used to] lift the 8 Up” – Jay-Z

*Wiping away my sad Laker tear from my face while I type this*

“What could have been man, what could have been.”

R.I.P Frank Hamblen

Anyways, Shaq wisened up in Miami. He also fattened up in Phoenix, Cleveland, Boston, Hawaii, Catalina, etc.

shaq_wiser = function(meatbag){
if(meatbag %in% 'scrub'){return('dunk on em')}
if(meatbag %in% 'sabonis'){return('elbow his face')}
if(meatbag %in% 'kobe'){return('breakup &amp; makeup')}

if(!(meatbag %in% c('scrub','sabonis','kobe'))){
stop('shaq is confused')}

#> [1] "1 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_1"
#> [1] "2 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_2"
#> [1] "3 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_3"
#> [1] "4 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_4"

So, give me the whole shebang. What was the whole story of Shaqs road trip?

out_il = iterload(paste0(mainDir,subDir))
#>       meatbags  out_il            
#> arg_1 "scrub"   "dunk on em"      
#> arg_2 "sabonis" "elbow his face"  
#> arg_3 "scrub"   "dunk on em"      
#> arg_4 "kobe"    "breakup & makeup"

So, if you use bare bones for loops or lapply you’ll crap out immediately when you hit an error.

On the other hand, even using purrr::map with purrr::safely , by design, it’ll do everything in one shot (eg batch results). This is not ideal when working with stuff online. When you backtrack to resolve unforseen edge-cases, it’ll feel like a cantor-set .

For web data in the wild, expect the unexpected. That’s why I baked up itersave . You have non-homogeneous edge cases aplenty.

These Chris Dudley looking edge cases are just waiting in the bushes for you.

Dunk thru them.

Nasty Nas’ Nasty Rubdown via `magick`


We have 2 legends, Biggie Smalls and Nas. At the 1:00 mark, Nasty Nas receives a Nasty Rubdown. Pretty sure this was the inspiration for Boosie’s Wipe Me Down.

I made a .gif version using a pen, a tablet, and command line ‘ImageMagick‘.

But the resulting FPS was slow, so I decided to try out Jeroen’s R package, magick , to tune settings for the sped up version below.



I could have totally ‘tuned’ these settings in standalone ‘ImageMagick’, but I like the comforting caress of R’s function syntax.

Some of magick’s R bindings can immediately accept a ‘.gif’, so you can do things like

nas_gif_original %>% image_chop(.,'0x10') %>% image_animate(.,fps=10)

There you have it. Biggie, Nas, an enthusiastic head caresser, pngs, and gifs. Brought to you by R and magick .

A gist to the R script is below.

Chop It: Look up the Generating Data Frame Columns of a Formula Term

We the moody Gucci, Louis and Pucci men
Escada, Prada
The chopper it got the Uzi lens
Bird’s-eye view
The birds I knew, flip birds
Bird gangs, it was birds I flew

Say you use the base #rstats lm() command

lm(data=dat_foo[,c('y','x1','x2')],y ~ x1 + x2 + x1:x2)

I want to be able to map the single formula term x1:x2 to the two
‘generating’ columns  dat_foo[,c('x1','x2')]

In words, for a term in a ?formula, lookup the involved ‘root’ columns of the data frame inside the formula’s associated environment.

I feel like this mapping must exist under the lm() hood somewhere. Various stackoverflow Q+A’s about formulas never directly talk about this lookup. This Rviews blog post sums up the formula landscape pretty well. But there does not seem to be a convenient expose of the explicit lookup/hash table of the df to term mapping.

I have to hand roll the few lines of code to implement the hash / lookup table myself. My solution is ‘loose’ since it chops up the terms in the formula, then creates many sub-formulas for each chopped term.

Is there a better / preferred way?

Bill and Ted Make the best out of a Shi… Stata situation: Rstudio + Rstata + Stata

After rewatching the thanksgiving classic, Bill and Ted’s Excellent Adventure, it reminded me of the history of #Rstats and its current status as the defacto software for general data programming.

The most excellent thing about R, is the literate programming options you have. As a data analyst, you are Bill S. Preston Esquire (or Ted “Theodore” Logan, they are exchangeable). Rstudio is the time traveling phone booth. Since its conception, Rstats had Sweave’s phone number on speed dial. Now, Rstudio has Rmarkdown. Compare this situation with…  Stata. Stata is Ghenkis Khan.

Seeing Mine Çetinkaya-Rundel post about the joys of Stata,

During these discussions a package called RStata also came up. This package is [a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

I decided to see for myself how (un)approachable writing narratives for literate programming in Stata really is.

Related image

If Plato pitched his ideal to So-crates, he would claim:

Integrating Rstudio + Rmarkdown + R + RStata, should give you the best of 3 worlds

1) Write narratives that are human-readable

2) Manipulate data with human-readable R code

3) Have ‘paid-for-assurance’ of Stata analysis commands

But!  Bill and Ted would probably get bogged down during the setup. The key overhead step is to make sure Bill’s RStata package plays nicely with his local copy of Stata.

This is like chaparoning Ghenkis Khan in a shopping mall by letting him run loose without an adult-sized child leash.  He might be enjoying a delicious Cinnabon all by his lonesome, or he might be playing home run derby with a mannequin’s head.

It depends on Ghengis’ mood aka the disgruntled software versions in his computing environment.

The setup overhead is definitely an obstacle against adoption. You need to also version control Rstudio (undergoing rapid development) for its notebook feature and you need to align the Stata version (with their yearly business as usual updates).

I can only see this being useful if Ted is a Stata user with a near ‘final’ Stata .do file that he wants to share in a reproducible manner. During his presentation to his high school history class, Ted would just narrate his analysis center stage via markdown and whenever a result needs to be piped in, he could just chunk-source the .do file in Rstudio (like pulling Ghengis Khan out of the phone booth). Most Excellent.


The gist below is my standalone Rnotebook demo that should work if you have Rstudio 1.0.44 and Stata 14. Your Mileage May Very, with or without a time traveling phone booth.



Use Rstats to Share Google Map Stars with Friends

On my trip to Japan, I took this photo of the stairs leading to the “Rucker Park of Tokyo.” I crossed up some Tokyo cats, they were garbage. That one girl behind the blue pillar was practicing her hip hop power moves. She thought no one could see, but I saw.


I’ve been traveling. I’ve been starring places on google maps. I want to share my recs with friends. I also want to receive recs from friends. See this wired article that came out today!


“Google Maps” (what you use on the phone) exports ‘.json’ data


“Google My Maps” (what you share with friends) CANNOT import ‘.json’ data


For something like this, Gavin Belson would rip a new hole in some unsuspecting Hooli employee. I really hope the engineers of “Google (My) Maps” eventually roll out a backend feature that would make this post obsolete.


This is why you’re here, we’re going to fill the middle gap with a very easy #rstats script.

Step 1) Google Takeout > Google Maps (your places) > export json

Step 2) Use R to manipulate json then export a csv spreadsheet




Step 3) Google My Maps > Upload csv spreadsheet via Drag + Drop

Step 4) Share the url link of your new map with friends

Here’s my Google My Maps of Japan

Screen Shot 2016-06-10 at 10.40.06 PM

Spread the word, use this method, play a game of around the world… around the world 😉 , and share your recs.

PS, Shoutouts to seeing Slow Magic at O-nest in Shibuya

PPS, Sending good vibes for the recovery from the recent earthquake.



# read in .json data
# Google ->; Takeout ->; Google Maps (My Places) ->; Saved Places.json

txt = '~/projects/Saved Places.json'
dat = fromJSON(txt, flatten = TRUE)

# keep the useful parts
df_feat = flatten(dat$features)
df_dat = df_feat %>%
select(`properties.Location.Business Name`,
`properties.Location.Geo Coordinates.Latitude`,
`properties.Location.Geo Coordinates.Longitude`

# subset to specific geographies
# method 1, grep for state in address (easier)

dat_jap = df_dat %>%

# export to a csv spreadsheet

# upload csv into Google My Maps to share

Lakers Lent: Chuck should have fasted sooner and Historical Win Trajectories

For the 2015 NBA season, the only exciting Lakers news is the return of the Kobe show and Charles Barkley’s Lakers Lent.chuck_madonna_churchThe Lakers started the season with 0 wins and 5 losses, amazingly bad. The round mound of rebound started Lakers Lent, fasting until the Lakers won. This week, Chuck finally ate and the Lakers finally got a win, advancing to 1 and 5 against the new look Charlotte Hornets. The following game, the Lakers lost to the Grizzlies.

What did Charles eat, is the question? Easy, I say organic foie gras milk shakes. The interesting question is, what other times in history have teams started 1 and 5? Starting under those conditions, where did they end up and what win trajectory paths did they follow?win_traj_w1l5For all historical 82 game seasons (thus excluding pre-1968 and the two lockout seasons) there have been 121 times where teams started 1 and 5, highlighted in cyan. Following these win paths, things look pretty grim. In general, teams end up in the tail of the pack, scum eaters, cocaroaches.

However, we notice a difference between seasons. In the current era, the final location at game 82 is more spread out (more variation), bottom feeder teams have more hope for positive win mobility, whereas teams in the older eras were stagnant (less variation), more likely remained near the bottom.

So, Chuck should have had many lents. Mavs Lent, Rockets Lent, and Knicks Lent. Before Chuck and Angelinos say “those aren’t the Lakers, they’re just wannabes that look like them,” theres some hope. That is, if the Lakers do not purposely go all out tank mode like the doormat 76ers.chuck_wannabe_fenceUp next, an interactive version that lets you choose the initial conditions.


As my favorite statistician, NAS, said, “no ideas original under the sun.” Substituting professions for ideas, Hadley Whickham is a modern day blacksmith who is forging open access [R] weapons. All of this analysis is possible by open access statistics. Specifically, using a combination of rvest for web scraping data from basketball-reference, dplyr for shaping the data, and ggplot2 for graphics.

SAS,PHO,LAL : “Untangibles” Not Captured by the Box Score

Our goal here is to structure and quantify the ‘Untangible’ attributes that do not show up in the end game boxscores. We overcome this boxscore ‘low-resolution’ measurement issue by using modern Statistical techniques, a Bayesian model. For each game, we structure the relationship between the teams’ points, their boxscores, and the teams’ additional untangible (latent) component of variation.

Deep playoff teams like OKC, MIA, and SAS, appear at the top of the ‘untangibles’ chart. Surprisingly, The Toronto Raptors and the Phoenix Suns are right behind the Spurs. The Lakers and their historical rivals Celtics, appear near the bottom. Both teams were in rebuilding mode in 2013. As a proxy for a team’s average ‘untangible’ effects, we will incorporate latent variation, due to unmeasured important micro scale events.


The championship Spurs had extremely great chemistry. Coach Popovich knows a good thing when he sees it. For nearly a decade, he has steered the Spurs ship working with Tim Duncan, Manu Ginobiflop, and Tony Parker as the core. ‘Eyeballing’ a Spurs game, the dazzling buffet of ball movement was easy on the eyes. The ball movement was like a fast paced soccer match. Further, running more plays for the riverside CA native, Kawhei Leonard, was like Vin Diesel hitting the NOS button. Shoutouts to the 909.


Phoenix had pocket rockets. The dynamic duo point guard combo of Goran Dragic and Eric Bledsoe was a refreshing yet effective approach for the Suns. This system went against the traditional cookie cutter lineup you see across the league.


The rehabbing Lakers were vomiting in the boxscore on a nightly basis. It was obvious the Lakers were going to be painfully bad. With the lame duck coaching situation of pringles D’Antoni and the spread of contagious injuries, fans knew to expect poor performance. Thus, the Lakers are near the bottom of the untangibles chart. The small ‘untangible’ effect tells us the variation in the Lakers point rates were already well accounted for in their disgusting boxscore measures.

As seen on @joy_behar_swagg ‘s instagram of a framed photo of a @woptype_swagg tweet

Boxscores are useful as descriptive end game summaries. Although ripe with information, you always hear the criticism, for good reason, that important ‘untangible’ attributes do not show up in the boxscore. As the name suggests, it all boils down to a purely ‘measurement’ issue; the boxscores lack the high resolution necessary for capturing dynamical point increasing game effects like: hustle, hands in the face, box outs, helping the help defender, spacing configurations, player interactions, etc.

Using box scores, we look at a season’s worth of (30 x 82 / 2) match-ups of pairwise (home team I versus away team J) combinations. To model the analysis (figure below), we make the following contextual assumptions:

1) We care about Wins, but we really care about Points (most points wins the game): For a game, each team’s Point Rate is the bivariate Poisson outcome.
2) Team I and team J’s point rates depend on boxscores: Via principal components of two way interactions of the raw boxscores (Field Goals, Rebounds, Turnovers, etc).
3) After accounting for boxscores, we structure what’s leftover: Use team level random effects.
4) Team I just ‘matches better’ against team J: The random effects are allowed to be pairwise correlated.


We demonstrated that although boxscores are limited, they are still helpful if used with appropriate methods. An obvious alternative to study boxscore untangibles is to approach the problem with ‘high-resolution’ data, like SportVU. This directly allows you to define and measure what the boxscore untangibles are. After talking to some NBA franchises, teams are barely scratching the surface of SportVU; setting up databases, defining the measures, and doing basic visualizations. To truly harness this extra information, NBA franchises need to start shifting towards model based analysis like these guys