# Probabilistic interpretation of AUC

Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be ). So it took me some time until I learned that the AUC has a nice probabilistic meaning.

Focus on statistics, machine learning, and programming, among other things

- a tensor of order one is a vector, which simply is a column of numbers,
- a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,
- a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),
- an
*n*th order (or*n*-way) tensor looks like numbers arranged in an*n*-hyperrectangle… you get the idea…

Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be ). So it took me some time until I learned that the AUC has a nice probabilistic meaning.

The United States Patent and Trademark office (USPTO) provides *immense* amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).

Recently I came across the classical 1983 paper *A note on screening regression equations* by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

A reviewer asked me to report detailed running times for all (so many ) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.

In many applications, data naturally form an *n*-way tensor with *n > 2*, rather than a “tidy” table.
As mentioned in the beginning of my last blog post, a tensor is essentially a multi-dimensional array:

A tensor is essentially a multi-dimensional array:

Many statistical modeling problems reduce to a minimization problem of the general form:

Lately I notice a sharp increase in my coffee consumption (reading Howard Schultz’s Starbucks book, which is actually quite good by the way, does not help either ). Having recently transitioned into a new PhD program I started wondering whether my increased coffee consumption has something to do with my higher stress levels in the last few weeks, and how that conjecture generalizes to the rest of my grad school experience. To answer that question I decided to take a look at how much money I have spent at coffee houses over the last few years. …Also, I’m right now over-caffeinated at 1:40am and I have nothing better to do anyway.

Lately I was getting a little bored with genomic data (and then TCGA2STAT started to give me a segfault on my university’s high performance computing facility too ). So I decided to analyze some brain imaging data that I had lying around instead. The first step is to do some visual data exploration. In this blog post I present some functions which I was able to find for MRI visualization in R, and which I found to be very useful. All functions presented below presuppose an image in the NIfTI data format as input, and are very user-friendly.

A couple of weeks ago I have released my first R package on CRAN. For me it turned out to be a far less painful process than many people on the internet portray it to be (even though the package uses quite a lot of C++ code via Rcpp and RcppEigen, and even though R CMD check returns two NOTEs). Some of the most helpful resources for publishing the package were: