From conditional probability to conditional distribution to conditional expectation, and back

[Redirects to the 0-fold Cross Validation blog – my new math/stats/ML focused blog.] I can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard ) Another such occasion was yesterday. This time I took some notes.

Neural networks and deep learning — self-study and 2 presentations

Last month, after mentioning “deep learning” a few times to some professors, I suddenly found myself in a position where I had to prepare three talks about “deep learning” within just one month… This is not to complain. I actually strongly enjoy studying the relevant theory, applying it to interesting datasets, and presenting what I have learned. Besides, teaching may be the best way to learn. However, it is quite funny. The deep learning hype is too real. Probabilistic interpretation of AUC

Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be ). So it took me some time until I learned that the AUC has a nice probabilistic meaning.

Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

Understanding the Tucker decomposition, and compressing tensor-valued data (with R code)

In many applications, data naturally form an n-way tensor with n > 2, rather than a “tidy” table. As mentioned in the beginning of my last blog post, a tensor is essentially a multi-dimensional array:

• a tensor of order one is a vector, which simply is a column of numbers,
• a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,
• a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),
• an nth order (or n-way) tensor looks like numbers arranged in an n-hyperrectangle… you get the idea…

Understanding the CANDECOMP/PARAFAC Tensor Decomposition, aka CP; with R code

A tensor is essentially a multi-dimensional array:

Contours of statistical penalty functions as GIF images

Many statistical modeling problems reduce to a minimization problem of the general form:

Generalized inverse of a symmetric matrix

I have always found the common definition of the generalized inverse of a matrix quite unsatisfactory, because it is usually defined by a mere property, $A A^{-} A = A$, which does not really give intuition on when such a matrix exists or on how it can be constructed, etc… But recently, I came across a much more satisfactory definition for the case of symmetric (or more general, normal) matrices. Recently I got surprised by the behaviour of #permute_columns in the Ruby gem NMatrix.