The Hawkes Hand

Thanks to Luke Bornn & Tim Swartz, I had the opportunity to present the Hawkes Hand at CASSIS18 . I learned a lot, had a hell of a time & met some great people #RuffRuff . My Hawkes Hand slides are here .

 

 

“The Hot Hand is real. If not, why do players even warm up before the game starts?‚ÄĚ – Me

 

In the 80s, GVT looked at the hot hand through a¬†difference of conditional probabilities that depends on the analyst explicitly conditioning on shooting streak sequences.¬†When the sequence is a string of 3 makes, this is like “NBA Jam,” a¬†video game in the 90s.

 

Capture

Are there alternative ways to frame the hot hand other than conditioning on sequential coin flip orderings (H is a make and T is a miss)? If you observed a HHH sequence you might naturally say that the player was hot. But what if I told you that the made shots were evenly spread across 48 minutes. If the player only attempted and made all 3 shots at the cumulative minute marks of 11:30, 23:30, and 35:30, would you still say that the player was hot in the 4th quarter? Alternatively, even if you observed a miss in the sequence HHT within a short time frame (3 minutes), this could very well provide information towards a player heating up. Temporal information is important when contextualizing shooting streaks of makes or misses, information NOT used by the GVT approach.

 

A professional NBA player is expected to have world-class ability in judging their own latent psycho-bio-mechancial state, then acting upon it. Seeing a player attempt many shots in a short cluster of time would suggest that the player is in a self-excited shooting state, perhaps self-confident in his shot making ability. Instead of the GVT approach, we use a Hawkes model, a way to model temporal clustering in point processes. Specifically, we use the Epidemic Type Aftershock Sequences model.

 

ETAS is a special case of the Hawkes model, where the ETAS model allows earthquakes of different magnitudes to have different contributions to productivity. A large magnitude earthquake might generate more aftershocks as opposed to an earthquake with a smaller magnitude. In the Hawkes Hand context, we let the ‘local field goal percentage’ be the ‘magnitude’ of a made shot. A made shot with a higher localized field goal percentage may generate more aftershock-makes than a made shot that had a lower field goal percentage.

 

Our ETAS approach gives us 2 dimensions of hotness (hawkesness):

 

1. The productivity parameter in the ETAS model summarizes the average number of aftershock-makes generated by a parent make.
2. The logistic regression parameter links the historical intensity of makes to the local field goal percentage (using a geometric random variable to preserve information about the misses between two makes).

 

For those reading at home, try to match the 3 games of varying hotness to:

 

Shuffle {Klay Thompson, Tracy McGrady, Kobe Bryant}

 

 

Further, these 2 dimensions of hawkesness suggest that there are a handful of games in the 2005-2006 season where Kobe was hotter than he was for his 81 point game on January 22, 2006 (in red below).

 

scatter_hawkes

 

If you were an NBA player, how would you react if the league office got rid of pre-game warmups?

 

PS, we had another presentation at #CASSIS18 where Aaron talked about relational networks in basketball. Find out more about CoordiNet and Dr. Aaron at https://www.aaronjdanielson.com/

 

PPS, we used #rstats for all of our analysis and graphics. From the presentations to the posters, many of the presenters used R. Talking with analytics staff (from the NBA, MLB, and the NFL) a lot of them use R in their internal systems.

Continue reading “The Hawkes Hand”

Advertisements

The 10 Data Science Crack Commandments

¬†It’s the ten¬†crack¬†commandments, what? homie can’t tell me nothing about this code Can’t tell me nothing about these #rstats

Number 1, make a function from a script. Everyone knows we’re to busy to be copy/pasting shit

http://adv-r.had.co.nz/Functions.html

Number 2, never let ’em know your data manipulation moves.¬†Don’t you know Bad Boys move in silence and violence?

Number 3: never trust point-o-five p’s, your moms’ll set that ass up, properly gassed up, hoodie to mask up, for that fast buck

https://www.nature.com/articles/s41562-017-0189-z

Number 4: I know you heard this before “Never compute high on your own CPU supply”

Number 5: never store PII where you rest at

https://www2.census.gov/foia/ds_policies/ds007.pdf

Number 6: that goddamn STATA*? Dead it You think a crackhead paying you back, shit forget it! (*STATA/SAS/SPSS)

https://thomaswdinsmore.com/2018/03/07/sas-is-on-the-brink-of-something/#comment-10243

Numero Siete: this rule is so underrated Keep your training and test set completely seperated Money and blood don’t mix like two…

https://statistics.stanford.edu/research/estimating-error-rate-prediction-rule-improvements-cross-validation

Number 8, always keep survey weights on you. Them cats that squeeze your guns can ask what population your stats generalize to

https://www.statschat.org.nz/2016/10/25/oversampling/

Number 9 shoulda been Number 1 to me: If you ain’t gettin’ representative samples stay the fuck from police data

https://www.vox.com/2016/7/11/12148452/police-shootings-racism-study

Number 10, a strong word called Bayes-i-an Strictly for live men, not for freshmen

https://projecteuclid.org/euclid.aos/1176346785

 

#RIPBIGGIE

Be a BigBallR in #rstats : Stayeth or Switcheth

If you’re a Big Baller, you know when to stayeth in your lane but also when to switcheth lanes.

The Big Baller Brand brothers just inked a deal to play professional basketball in Europe. They switched into a different lane to achieve their original goal.

The knee-jerk reaction from the non-globally minded is that this spells doom for any NBA hoop dreams. Not so.

Four¬†score¬†and ten years ago, your only shot of making it to the NBA was usually through the NCAA. The NBA has adopted a global perspective. You see professionals from different continental leagues finding a way into the NBA. The 2016 to 2017 diagram is interesting, we’re seeing a darker edge from the NBA to the G-League. The NBA Players Association worked hard on installing ‘2 way contracts’ letting NBA players play in (and get paid in) both the NBA and the G-League, this was a great move, a no-brainer.

Options are great, but keep in mind that the NBA is the cr√®me de la cr√®me. You see the lightest edges enter the NBA node while you see many dark edges enter the ‘none’ node.

When I started making the above network plot of basketball league to league migrations, I looked into off the shelf R packages. There’s a fragmented handfull of R packages to do this. Kicker, none of them do directed edges the way I wanted.

If you’re an #rstats user and want to adopt the BigBallR philosophy, find multiple ways to achieve your goal.

I went back to the basics and hand rolled my own network plot with vanilla ggplot2. At the end of the day, a network plot needs two things, nodes and edges, eg points (stored in dat_nodes) and lines (stored in dat_edges). Once you have your data in that format, you can make the network migration plot above, with the code snippet below

ggplot(data=dat_nodes,aes(X1,X2))+
 geom_point(aes(size=10,label=league),show.legend = FALSE) +
 geom_curve(data=dat_edges,
 arrow=arrow(angle=30,length = unit(0.2, "inches")),
 alpha=0.03,
 aes(x=X1_send,y = X2_send,
 xend = X1_rec,yend = X2_rec)
 )

I hand-rolled the network diagram cuz the other packages didn’t have the custom features I needed. In my hand-rolled plot there is still one thing missing. I want to place the ‘arrow head’ along some other part of the curve (say the mid-point), other than the end-point. This is probably hard to do, since arrow() looks like it just needs the receiving coordinate and plops the arrow head there. For what I want, ggplot2 needs to know the desired placement-coordinates output from geom_curve(). So somehow, the internal curve calculations need to be returned in order to pass into arrow().

 

Weird Al Yankovise() a 2 Chainz Lyric

I have a character string snipped from a lyric by Daniel Son the Necklace Don.

Let’s get all Weird Al Yankovic with it.

I’ll revise some words with the datzen::yankovise() function. You can swap out words using a dictionary of name-value pairs.

paste0(snip <<- "Suede sun roof, hanging out the big top We leave the dealership, head to the rim shop", " - @2chainz aka the Hair Weave Killer") 
#> [1] "Suede sun roof, hanging out the big top We leave the dealership, head to the rim shop - @2chainz aka the Hair Weave Killer"

Above, we have a bar from 2 Chainz aka the Hair Weave Killer.

Below, we have the yankovised bar from @2Chainz_PhD aka the CPU Core Killer aka ProbaBittyBoi aka Daniel Son the Data Don aka El Efron Jr.

# user supplied dictionary
dict_outin = c(
datzen::dictate_outin('dealership','SERVER ROOM'),
datzen::dictate_outin('big','LAP'),
datzen::dictate_outin('rim','RAM')
)

yankovise(snip,suffix="- @2Chainz_PhD",dict_outin = dict_outin)
#> [1] "suede sun roof hanging out da LAP top we leave da SERVER ROOM head 2 da RAM shop - @2Chainz_PhD"

You might ask, isn’t this just a wrapper to gsub() or stringr::str_replace_all() with some added flavor? I might say, yes, yes it is… only with a narrower scope and whose output is streamlined as tweet-ready text.

Get out of my way! Dunk thru #rstats errors like the Big Shaq-istician

Ahh, leaves falling, parents crying, collegicians biking uphill with a bag of in-n-out in between their teeth. Must be the new academic school year!

I figured it’s a good time to introduce my work-in-progress datzen¬†package of miscellaneous #rstats functions.¬† You can bee-line straight to the github readme¬†with more examples.

Or stick around and I’ll highlight the Shaq example showcasing datzen::itersave()

In #rstats if you want to iterate, you can go about it in many different ways. Works pretty well for ‚Äúhomogeneous‚ÄĚ iterations.

As good as they are, the standard approaches hit snags for “non-homogeneous” iterations, eg data from the web.

Go ahead, try them. I dare you.

You in 5 hours

“Aw shit, my brute force for loop crapped the bed during iteration 69. Now I have to manually restart it. I hope it doesn’t do it again. I’m running out of patience, and linen.”

Let’s take a look. The Big Aristotle, Dr. Shaq, was a notorious brute on the hardwood. Here he is, contemplating how he should score in the paint:

shaq = function(meatbag){
if(meatbag %in% 'scrub'){return('dunk on em')}
if(meatbag %in% 'sabonis'){return('elbow his face')}
if(!(meatbag %in% c('scrub','sabonis'))){
stop('shaq is confused')}
}

meatbags = c('scrub','sabonis','scrub','kobe')
names(meatbags) = paste0('arg_',seq_along(meatbags))

testthat::expect_failure(lapply(meatbags,FUN=shaq))
#> Error in FUN(X[[i]], ...): shaq is confused

Uh, some error confused Shaq.

enter, stage trap door
“Meet itersave()

front row faints
“It’s… hideously beautiful”

In a nutshell, itersave works like lapply but when it meets an ugly, unskilled, unqualified, and ungraceful error it will keep trucking along like Shaquille The Diesel O’Neal hitchhiking a ride on Chris Dudley’s back

mainDir=paste0(getwd(),'/tests/proto/')
subDir='/temp/'

itersave(func_user=shaq,
         vec_arg_func=meatbags,
         mainDir,subDir)
#> [1] "1 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_1"
#> [1] "2 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_2"
#> [1] "3 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_3"
#> [1] "4 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_4"

The meatbags that Shaq succesfully put into bodybags.

print('the successes')
#> [1] "the successes"
list.files(paste0(mainDir,subDir))
#> [1] "arg_1.rds" "arg_2.rds" "arg_3.rds" "failed"

It’ll also book keep any errors along the way via purrr::safely() and R.utils::withTimeout().

print('the failures')
#> [1] "the failures"
list.files(paste0(mainDir,subDir,'/failed/'))
#> [1] "arg_4.rds"

Along with the out, itersave has an in companion

enter, zipline from balcony
“meet iterload()

audience faints

iterload(paste0(mainDir,subDir,'/failed'))
#> $arg_4
#> $arg_4$ind_fail
#> [1] 4
#> 
#> $arg_4$input_bad
#> [1] "kobe"
#> 
#> $arg_4$result_bad
#> <simpleError in (function (meatbag) {    if (meatbag %in% "scrub") {        return("dunk on em")    }    if (meatbag %in% "sabonis") {        return("elbow his face")    }    if (!(meatbag %in% c("scrub", "sabonis"))) {        stop("shaq is confused")    }})("kobe"): shaq is confused>

Ah, it was the 4th argument, Kobe, that boggled Shaq’s mind.

“Jigga man [was] Diesel, when he [used to] lift the 8 Up” – Jay-Z

*Wiping away my sad Laker tear from my face while I type this*

“What could have been man, what could have been.”

R.I.P Frank Hamblen

Anyways, Shaq wisened up in Miami. He also fattened up in Phoenix, Cleveland, Boston, Hawaii, Catalina, etc.

shaq_wiser = function(meatbag){
if(meatbag %in% 'scrub'){return('dunk on em')}
if(meatbag %in% 'sabonis'){return('elbow his face')}
if(meatbag %in% 'kobe'){return('breakup &amp; makeup')}

if(!(meatbag %in% c('scrub','sabonis','kobe'))){
stop('shaq is confused')}
}

itersave(func_user=shaq_wiser,
         vec_arg_func=meatbags,
         mainDir,subDir)
#> [1] "1 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_1"
#> [1] "2 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_2"
#> [1] "3 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_3"
#> [1] "4 of 4"
#> [1] "2017-10-01 12:35:14 PDT"
#> [1] "arg_4"

So, give me the whole shebang. What was the whole story of Shaqs road trip?

out_il = iterload(paste0(mainDir,subDir))
cbind(meatbags,out_il)
#>       meatbags  out_il            
#> arg_1 "scrub"   "dunk on em"      
#> arg_2 "sabonis" "elbow his face"  
#> arg_3 "scrub"   "dunk on em"      
#> arg_4 "kobe"    "breakup & makeup"

So, if you use bare bones for loops or¬†lapply¬†you’ll crap out immediately when you hit an error.

On the other hand, even using¬†purrr::map with purrr::safely , by design, it’ll do everything in one shot (eg batch results). This is not ideal when working with stuff online. When you backtrack to resolve unforseen edge-cases, it’ll feel like a cantor-set .

For web data in the wild, expect the unexpected. That’s why I baked up itersave¬†. You have non-homogeneous edge cases aplenty.

These Chris Dudley looking edge cases are just waiting in the bushes for you.

Dunk thru them.

Nasty Nas’ Nasty Rubdown via `magick`

 

We have 2 legends, Biggie Smalls and Nas. At the 1:00 mark, Nasty Nas¬†receives a Nasty Rubdown. Pretty sure this was the inspiration for Boosie’s¬†Wipe Me Down.

I made a .gif¬†version using a¬†pen, a tablet, and command line ‘ImageMagick‘.

But the resulting FPS was slow, so I decided to try¬†out Jeroen’s R package,¬†magick¬†, to tune settings¬†for the sped up version below.

nas_faster

 

I could have totally ‘tuned’ these settings in standalone ‘ImageMagick’, but I like the comforting caress¬†of R’s function syntax.

Some of magick’s R¬†bindings can immediately accept a ‘.gif’, so you can do things like

nas_gif_original %>% image_chop(.,'0x10') %>% image_animate(.,fps=10)

There you have it. Biggie, Nas, an enthusiastic head caresser, pngs, and gifs. Brought to you by R and magick .

A gist to the R script is below.

Chop It: Look up the Generating Data Frame Columns of a Formula Term

We the moody Gucci, Louis and Pucci men
Escada, Prada
The chopper it got the Uzi lens
Bird’s-eye view
The birds I knew, flip birds
Bird gangs, it was birds I flew

Say you use the base #rstats lm() command

lm(data=dat_foo[,c('y','x1','x2')],y ~ x1 + x2 + x1:x2)

I want to be able to map the single formula term x1:x2 to the two
‘generating’ columns ¬†dat_foo[,c('x1','x2')]

In words, for a term in a ?formula, lookup the involved ‘root’ columns of the data frame inside the formula’s associated environment.

I feel like this mapping must exist under the lm() hood somewhere. Various stackoverflow Q+A’s about formulas never directly talk about this lookup. This Rviews blog post sums up the formula landscape pretty well.¬†But there¬†does not seem to be a convenient expose of the explicit lookup/hash table of the df to term mapping.

I have to hand roll the few lines of code to implement the hash / lookup table myself. My solution is ‘loose’ since¬†it chops up the terms in the formula, then creates many sub-formulas for each chopped term.

Is there a better / preferred way?

Bill and Ted Make the best out of a Shi… Stata situation: Rstudio + Rstata + Stata

After rewatching the thanksgiving classic, Bill and Ted’s Excellent Adventure, it reminded me of the history of #Rstats and its current status as the defacto software for general data programming.

The most excellent thing about R, is the literate programming options you have. As a data analyst, you are Bill S. Preston Esquire (or Ted “Theodore” Logan, they are exchangeable). Rstudio is the time traveling phone booth. Since its conception, Rstats had Sweave’s phone number on speed dial. Now, Rstudio has Rmarkdown. Compare this situation with… ¬†Stata. Stata is Ghenkis Khan.

Seeing Mine Çetinkaya-Rundel post about the joys of Stata,

During these discussions a package called RStata¬†also came up. This package is¬†[a]¬†simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.‚Ä̬†Looks promising as it should allow running¬†Stata commands¬†from an R Markdown chunk. But it‚Äôs really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can‚Äôt imagine teaching Stata and saying to students ‚Äúfirst download R‚ÄĚ. Not that I teach Stata, but those who do confirmed that¬†it would be an odd experience for students‚Ķ

I decided to see for myself how (un)approachable writing narratives for literate programming in Stata really is.

Related image

If Plato pitched his ideal to So-crates, he would claim:

Integrating Rstudio + Rmarkdown + R + RStata, should give you the best of 3 worlds

1) Write narratives that are human-readable

2) Manipulate data with human-readable R code

3) Have ‘paid-for-assurance’ of Stata analysis commands

But! ¬†Bill and Ted would probably get bogged down during the setup. The key overhead step¬†is to make sure Bill’s¬†RStata¬†package plays nicely with his local copy of Stata.

This is like chaparoning Ghenkis Khan in a shopping mall by letting him run loose without an adult-sized child leash. ¬†He might be enjoying a delicious Cinnabon all by his lonesome, or he might be playing home run derby with a mannequin’s head.

It depends on Ghengis’ mood aka the disgruntled software versions in his computing environment.

The setup overhead is definitely an obstacle against adoption. You need to also version control Rstudio (undergoing rapid development) for its notebook feature and you need to align the Stata version (with their yearly business as usual updates).

I can only see this being useful if Ted is a Stata user with a near ‘final’ Stata .do file that he wants to share in a reproducible manner. During his presentation to his high school history class, Ted would just narrate his analysis center stage via markdown and whenever a result needs to be piped in, he could just chunk-source the .do file in Rstudio (like pulling Ghengis Khan out of the phone booth). Most Excellent.

download

The gist below is my standalone Rnotebook demo that should work if you have Rstudio 1.0.44 and Stata 14. Your Mileage May Very, with or without a time traveling phone booth.

https://gist.github.com/statsccpr/5f4bb658c15ff2a31b3ba0c0afae228d#file-rstata_boomerang_rnotebook-rmd

capture

 

Use Rstats to Share Google Map Stars with Friends

On my trip to Japan, I took this photo of the stairs leading to¬†the “Rucker Park of Tokyo.” I crossed up some Tokyo cats, they were garbage. That one girl behind the blue pillar was practicing her hip hop power moves. She thought no one could see, but I saw.

IMG_20160610_214219.jpg

I’ve¬†been traveling. I’ve been starring places on google maps. I want to share my recs¬†with friends. I also want to receive recs¬†from friends. See this wired article that came out today!

ONE DUMB SNAG: YOU CAN NOT DO THIS SMOOTHLY USING PURE GOOGLE TOOLS

“Google Maps” (what you use on the¬†phone) exports ‘.json’ data

yet

“Google My Maps” (what you share with friends) CANNOT¬†import ‘.json’ data

1268198584550505870.png

https://productforums.google.com/forum/#!topic/maps/Ms6ouQuA4qI

For something like this, Gavin Belson would rip a new hole in some¬†unsuspecting Hooli employee.¬†I really hope the¬†engineers of “Google (My) Maps” eventually roll out a backend¬†feature that would make this post obsolete.

DUMB SNAG ELIMINATOR:  #RSTATS IS AWESOME

This is why you’re here, we’re going to fill the middle gap with a very easy #rstats script.

Step 1) Google Takeout > Google Maps (your places) > export json

Step 2) Use R to manipulate json then export a csv spreadsheet

R:::jsonlite::fromJSON()

R:::dplyr::filter()

R:::base::write.csv()

Step 3) Google My Maps > Upload csv spreadsheet via Drag + Drop

https://support.google.com/mymaps/answer/3024925?rd=1

Step 4) Share the url link of your new map with friends

Here’s my¬†Google My Maps of Japan

Screen Shot 2016-06-10 at 10.40.06 PM

Spread the word, use this method, play a game of around the world… around the world¬†ūüėČ , and share your recs.

PS, Shoutouts to seeing Slow Magic at O-nest in Shibuya

PPS, Sending good vibes for the recovery from the recent earthquake.

HERE¬†IS THE MEAT OF THE #RSTATS CODE FOR¬†STEP 2 (ABOVE). LOOK AT HOW SHORT AND ‘HUMAN-READABLE’¬†THE SYNTAX IS.


library(jsonlite)
library(dplyr)

# read in .json data
# Google ->; Takeout ->; Google Maps (My Places) ->; Saved Places.json
# https://en.wikipedia.org/wiki/Google_Takeout

txt = '~/projects/Saved Places.json'
dat = fromJSON(txt, flatten = TRUE)

# keep the useful parts
df_feat = flatten(dat$features)
df_dat = df_feat %>%
select(`properties.Location.Business Name`,
`properties.Location.Address`,
`properties.Location.Geo Coordinates.Latitude`,
`properties.Location.Geo Coordinates.Longitude`
)

# subset to specific geographies
# method 1, grep for state in address (easier)

dat_jap = df_dat %>%
 filter(grepl(pattern='Japan',x=properties.Location.Address))

# export to a csv spreadsheet
write.csv(dat_jap,file='~/projects//dat_jap.csv',row.names=FALSE)

# upload csv into Google My Maps to share

Lakers Lent: Chuck should have fasted sooner and Historical Win Trajectories

For the 2015 NBA season, the only exciting Lakers news is the return of the Kobe show and Charles Barkley’s Lakers Lent.chuck_madonna_churchThe Lakers started the season with 0 wins and 5 losses, amazingly bad. The round mound of rebound started Lakers Lent, fasting until the Lakers won. This week, Chuck finally ate and the Lakers finally got a win, advancing to 1 and 5 against the new look Charlotte Hornets. The following game, the Lakers lost to the Grizzlies.

What did Charles eat, is the question? Easy, I say organic foie gras milk shakes. The interesting question is, what other times in history have teams started 1 and 5? Starting under those conditions, where did they end up and what win trajectory paths did they follow?win_traj_w1l5For all historical 82 game seasons (thus excluding pre-1968 and the two lockout seasons) there have been 121 times where teams started 1 and 5, highlighted in cyan. Following these win paths, things look pretty grim. In general, teams end up in the tail of the pack, scum eaters, cocaroaches.

However, we notice a difference between seasons. In the current era, the final location at game 82 is more spread out (more variation), bottom feeder teams have more hope for positive win mobility, whereas teams in the older eras were stagnant (less variation), more likely remained near the bottom.

So, Chuck should have had many lents. Mavs Lent, Rockets Lent, and Knicks Lent. Before Chuck and Angelinos say “those aren’t the Lakers, they’re just wannabes that look like them,” theres some hope. That is, if the Lakers do not purposely go all out tank mode like the doormat 76ers.chuck_wannabe_fenceUp next, an interactive version that lets you choose the initial conditions.

PS

As my favorite statistician, NAS, said, “no ideas original under the sun.” Substituting professions for ideas, Hadley Whickham is a modern day blacksmith who is forging open access [R] weapons. All of this analysis is possible by open access statistics. Specifically, using a combination of rvest for web scraping data from basketball-reference, dplyr for shaping the data, and ggplot2 for graphics.