Jekyll2019-11-15T23:19:44-05:00https://www.alexejgossmann.com/feed.xml0-fold Cross-ValidationBlogging aspirations: machine learning, statistics, deep learning, math, theory, application, coding, creativity.
Alexej GossmannFrom conditional probability to conditional distribution to conditional expectation, and back2018-08-12T00:00:00-04:002018-08-12T00:00:00-04:00https://www.alexejgossmann.com/conditional_distributions<p>I can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard :weary:) Another such occasion was yesterday. This time I took some notes.</p>
<h2 id="from-conditional-probability--to-conditional-distribution--to-conditional-expectation">From conditional probability → to conditional distribution → to conditional expectation</h2>
<p>Let <script type="math/tex">X</script> and <script type="math/tex">Y</script> be two real-valued random variables.</p>
<h3 id="conditional-probability">Conditional probability</h3>
<p>For a fixed set <script type="math/tex">B</script> <a class="citation" href="#FellerVol2">(Feller, 1966, p. 157)</a> defines conditional probability of an event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script> as follows.</p>
<blockquote>
<p>By <script type="math/tex">\prob(Y \in B \vert X)</script> (in words, “a conditional probability of the event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script>”) is meant a function <script type="math/tex">q(X, B)</script> such that for every set <script type="math/tex">A \in \mathbb{R}</script></p>
<script type="math/tex; mode=display">\prob(X \in A, Y \in B) = \int_A q(x, B) \mu(dx)</script>
<p>where <script type="math/tex">\mu</script> is the marginal distribution of <script type="math/tex">X</script>.</p>
</blockquote>
<p>(where <script type="math/tex">A</script> and <script type="math/tex">B</script> are both <a href="https://en.wikipedia.org/wiki/Borel_set">Borel sets</a> on <script type="math/tex">\R</script>.)</p>
<p>That is, the conditional probability can be defined as something that, when integrated with respect to the marginal distribution of <script type="math/tex">X</script>, results in the joint probability of <script type="math/tex">X</script> and <script type="math/tex">Y</script>.</p>
<p>Moreover, note that if <script type="math/tex">A = \R</script> then the above formula yields <script type="math/tex">\prob(Y \in B)</script>, the marginal probability of the event <script type="math/tex">\{ Y \in B \}</script>.</p>
<h4 id="example">Example</h4>
<p>For example, if the joint distribution of two random variables <script type="math/tex">X</script> and <script type="math/tex">Y</script> is the following <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Bivariate_case">bivariate normal</a> distribution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
X \\
Y
\end{pmatrix}
\sim \mathcal{N} \left(
\begin{pmatrix}
\mu_X \\
\mu_Y
\end{pmatrix},
\begin{pmatrix}
\sigma^2_X & \rho \sigma_X \sigma_Y \\
\rho \sigma_X \sigma_Y & \sigma^2_Y
\end{pmatrix}
\right), %]]></script>
<p>then by sitting down with a pen and paper for some amount of time, it is not hard to verify that the function</p>
<script type="math/tex; mode=display">q(x, B) = \int_B \frac{1}{\sqrt{2\pi(1-\rho^2)}\sigma_Y} \exp\left(-\frac{\left(y - \mu_Y+\frac{\sigma_Y}{\sigma_X}\rho( x - \mu_X)\right)^2}{2(1-\rho^2)\sigma_Y^2}\right) \mathrm{d}y</script>
<p>in this case satisfies the above definition of <script type="math/tex">\prob(Y \in B \vert X)</script>.</p>
<h3 id="conditional-distribution">Conditional distribution</h3>
<p>Later on <a class="citation" href="#FellerVol2">(Feller, 1966, p. 159)</a> follows up with the notion of conditional probability distribution:</p>
<blockquote>
<p>By a conditional probability distribution of <script type="math/tex">Y</script> for given <script type="math/tex">X</script> is meant a function <script type="math/tex">q</script> of two variables, a point <script type="math/tex">x</script> and a set <script type="math/tex">B</script>, such that</p>
<ol>
<li>
<p>for a fixed set <script type="math/tex">B</script></p>
<script type="math/tex; mode=display">q(X, B) = \prob(Y \in B \vert X )</script>
<p>is a conditional probability of the event <script type="math/tex">\{Y \in B\}</script> for given <script type="math/tex">X</script>.</p>
</li>
<li>
<p><script type="math/tex">q</script> is for each <script type="math/tex">x</script> a probability distribution.</p>
</li>
</ol>
</blockquote>
<p>It is also pointed out that</p>
<blockquote><p>In effect a conditional probability distribution is a family of ordinary probability distributions and so the whole theory carries over without change.</p><cite><a class="citation" href="#FellerVol2">(Feller, 1966)</a></cite></blockquote>
<p>When I first came across this viewpoint, I found it incredibly enlightening to regard the conditional probability distribution as a <em>family</em> of ordinary probability distributions. :smile:</p>
<h4 id="example-1">Example</h4>
<p>For example, assume that <script type="math/tex">X</script> is an integer-valued and non-negative random variable, and that the conditional probability distribution of <script type="math/tex">Y</script> for given <script type="math/tex">X</script> is an <a href="https://en.wikipedia.org/wiki/F-distribution">F-distribution</a> (denoted <script type="math/tex">\mathrm{F}(d_1, d_2)</script>) with <script type="math/tex">d_1 = e^X</script> and <script type="math/tex">d_2 = 2^X</script> degrees of freedom.
Then the conditional probability distribution of <script type="math/tex">(Y \vert X)</script> can be regarded as a family of probability distributions <script type="math/tex">\mathrm{F}(e^x, 2^x)</script> for <script type="math/tex">x = 0, 1, 2, \dots</script>, whose probability density functions look like this:</p>
<p><img src="https://www.alexejgossmann.com/assets/img/2018-08-12-conditional_distributions/conditional_densities.png" alt="Probability density functions of (Y|X=x) for different values x" /></p>
<p>In addition, as pointed out above, if we know the marginal distribution of <script type="math/tex">X</script>, then the conditional probability distribution of <script type="math/tex">(Y \vert X)</script> can be used to obtain the marginal probability distribution of <script type="math/tex">Y</script>, or to randomly sample from the marginal distribution.
Practically it means that if we randomly generate a value of <script type="math/tex">X</script> according to its probability distribution, and use this value to randomly generate a value of <script type="math/tex">Y</script> according to the conditional distribution of <script type="math/tex">Y</script> for the given <script type="math/tex">X</script>, then the observations resulting from this procedure follow the marginal distribution of <script type="math/tex">Y</script>.
Continuing the previous example, assume that <script type="math/tex">X</script> follows a <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a> with parameters <script type="math/tex">n = 5</script> and <script type="math/tex">p = 0.5</script>. Then the described simulation procedure estimates the following shape for the probability density function of <script type="math/tex">\prob(Y)</script>, the marginal distribution of <script type="math/tex">Y</script>:</p>
<p><img src="https://www.alexejgossmann.com/assets/img/2018-08-12-conditional_distributions/marginal_density.png" alt="Probability density function of Y" /></p>
<h3 id="conditional-expectation">Conditional expectation</h3>
<p>Finally, <a class="citation" href="#FellerVol2">(Feller, 1966, p. 159)</a> introduces the notion of conditional expectation.
By the above, for given a value <script type="math/tex">x</script> we have that</p>
<script type="math/tex; mode=display">q(x, B) = \prob(Y \in B \vert X = x), \quad\forall B\in\mathcal{B}</script>
<p>(here <script type="math/tex">\mathcal{B}</script> denotes the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel <script type="math/tex">\sigma</script>-algebra</a> on <script type="math/tex">\R</script>), and therefore, a conditional probability distribution can be viewed as a family of ordinary probability distributions (represented by <script type="math/tex">q</script> for different <script type="math/tex">x</script>s).
Thus, as <a class="citation" href="#FellerVol2">(Feller, 1966, p. 159)</a> points out, if <script type="math/tex">q</script> is given then the conditional expectation <em>“introduces a new notation rather than a new concept.”</em></p>
<blockquote>
<p>A conditional expectation <script type="math/tex">E(Y \vert X)</script> is a function of <script type="math/tex">X</script> assuming at <script type="math/tex">x</script> the value</p>
<script type="math/tex; mode=display">\E(Y \vert X = x) = \int_{-\infty}^{\infty} y q(x, dy)</script>
<p>provided the integral converges.</p>
</blockquote>
<p>Note that, because <script type="math/tex">\E(Y \vert X)</script> is a function of <script type="math/tex">X</script>, it is a random variable, whose value at an individual point <script type="math/tex">x</script> is given by the above definition.
Moreover, from the above definitions of conditional probability and conditional expectation it follows that</p>
<script type="math/tex; mode=display">\E(Y) = \E(\E(Y \vert X)).</script>
<h4 id="example-cont">Example [cont.]</h4>
<p>We continue with the last example.
From the properties of the <a href="https://en.wikipedia.org/wiki/F-distribution">F-distribution</a> we know that under this example’s assumptions on the conditional distribution, it holds that</p>
<script type="math/tex; mode=display">\E(Y \vert X = x) =
\begin{cases}
\frac{d_2}{d_2 - 2} = \frac{2^x}{2^x - 2}, \quad x > 1,\\
\infty, \quad x \leq 1.
\end{cases}</script>
<p>A rather boring strictly decreasing function of <script type="math/tex">x</script> converging to <script type="math/tex">1</script> as <script type="math/tex">x\to\infty</script>.</p>
<p>Thus, under the example’s assumption on the distribution of <script type="math/tex">X</script>, the conditional expectation <script type="math/tex">\E(Y \vert X)</script> is a discrete random variable, which has non-zero probability mass at the values <script type="math/tex">2, 4/3, 8/7, 16/15,</script> and <script type="math/tex">\infty</script>.</p>
<h2 id="from-conditional-expectation--to-conditional-probability">From conditional expectation → to conditional probability</h2>
<p>An alternative approach is to define the conditional expectation first, and then to define conditional probability as the conditional expectation of <a href="https://en.wikipedia.org/wiki/Indicator_function">the indicator function</a>.
This approach seems less intuitive to me. However, it is more flexible and more general, as we see below.</p>
<h3 id="conditional-expectation-1">Conditional expectation</h3>
<h4 id="a-definition-in-2d">A definition in 2D</h4>
<p>Let <script type="math/tex">X</script> and <script type="math/tex">Y</script> be two real-valued random variables, and let <script type="math/tex">\mathcal{B}</script> denote the <a href="https://en.wikipedia.org/wiki/Borel_set">Borel <script type="math/tex">\sigma</script>-algebra</a> on <script type="math/tex">\R</script>.
Recall that <script type="math/tex">X</script> and <script type="math/tex">Y</script> can be represented as mappings <script type="math/tex">X: \Omega \to \R</script> and <script type="math/tex">Y: \Omega \to \R</script> over some <a href="https://en.wikipedia.org/wiki/Measure_space">measure space</a> <script type="math/tex">(\Omega, \mathcal{A}, \prob)</script>.
We can define <script type="math/tex">\mathrm{E}(Y \vert X=x)</script>, the conditional expectation of <script type="math/tex">Y</script> given <script type="math/tex">X=x</script>, as follows.</p>
<p>A <script type="math/tex">\mathcal{B}</script>-measurable function <script type="math/tex">g(x)</script> is the conditional expectation of <script type="math/tex">Y</script> for given <script type="math/tex">x</script>, i.e.,</p>
<script type="math/tex; mode=display">\mathrm{E}(Y \vert X=x) = g(x),</script>
<p>if for all sets <script type="math/tex">B\in\mathcal{B}</script> it holds that</p>
<script type="math/tex; mode=display">\int_{X^{-1}(B)} Y(\omega) d\prob(\omega) = \int_{B} g(x) d\prob^X(x),</script>
<p>where <script type="math/tex">\prob^X</script> is the marginal probability distribution of <script type="math/tex">X</script>.</p>
<h4 id="interpretation-in-2d">Interpretation in 2D</h4>
<p>If <script type="math/tex">X</script> and <script type="math/tex">Y</script> are real-valued one-dimensional, then the pair <script type="math/tex">(X,Y)</script> can be viewed as a random vector in the plane.
Each set <script type="math/tex">\{X \in A\}</script> consists of parallels to the <script type="math/tex">y</script>-axis, and we can define a <script type="math/tex">\sigma</script>-algebra induced by <script type="math/tex">X</script> as the collection of all sets <script type="math/tex">\{X \in A\}</script> on the plane, where <script type="math/tex">A</script> is a Borel set on the line.
The collection of all such sets forms a <script type="math/tex">\sigma</script>-algebra <script type="math/tex">\mathcal{A}</script> on the plane, which is contained in the <script type="math/tex">\sigma</script>-algebra of all Borel sets in <script type="math/tex">\R^2</script>.
<script type="math/tex">\mathcal{A}</script> is called the <script type="math/tex">\sigma</script>-algebra generated by the random variable <script type="math/tex">X</script>.</p>
<p>Then <script type="math/tex">\mathrm{E}(Y \vert X)</script> can be equivalently defined as a random variable such that</p>
<script type="math/tex; mode=display">\mathrm{E}(Y\cdot I_{A}) = \mathrm{E}(\mathrm{E}(Y \vert X) \cdot I_{A}), \quad \forall A\in\mathcal{A},</script>
<p>where <script type="math/tex">I_{A}</script> denotes the indicator function of the set <script type="math/tex">A</script>.</p>
<h4 id="a-more-general-definition-of-conditional-expectation">A more general definition of conditional expectation</h4>
<p>The last paragraph illustrates that one could generalize the definition of the conditional expectation of <script type="math/tex">Y</script> given <script type="math/tex">X</script> to the conditional expectation of <script type="math/tex">Y</script> given an arbitrary <script type="math/tex">\sigma</script>-algebra <script type="math/tex">\mathcal{B}</script> (not necessarily the <script type="math/tex">\sigma</script>-algebra generated by <script type="math/tex">X</script>).
This leads to the following general definition, which is stated in <a class="citation" href="#FellerVol2">(Feller, 1966, pp. 160-161)</a> in a slightly different notation.</p>
<p>Let <script type="math/tex">Y</script> be a random variable, and let <script type="math/tex">\mathcal{B}</script> be a <script type="math/tex">\sigma</script>-algebra of sets.</p>
<ol>
<li>
<p>A random variable <script type="math/tex">U</script> is called a conditional expectation of <script type="math/tex">Y</script> relative to <script type="math/tex">\mathcal{B}</script>, or <script type="math/tex">U = \E(Y \vert \mathcal{B})</script>, if it is <script type="math/tex">\mathcal{B}</script>-measurable and</p>
<script type="math/tex; mode=display">\E(Y\cdot I_{B}) = \E(U \cdot I_{B}), \quad \forall B\in\mathcal{B}.</script>
</li>
<li>
<p>If <script type="math/tex">\mathcal{B}</script> is the <script type="math/tex">\sigma</script>-algebra generated by a random variable <script type="math/tex">X</script>, then <script type="math/tex">\E(Y \vert X) = \E(Y \vert \mathcal{B})</script>.</p>
</li>
</ol>
<h3 id="back-to-conditional-probability-and-conditional-distributions">Back to conditional probability and conditional distributions</h3>
<p>Let <script type="math/tex">I_{\{Y \in A\}}</script> be a random variable that is equal to one if and only if <script type="math/tex">Y\in A</script>. The conditional probability of <script type="math/tex">\{Y \in A\}</script> given <script type="math/tex">X = x</script> can be defined in terms of a conditional expectation as</p>
<script type="math/tex; mode=display">\prob(Y \in A \vert X = x) = \E(I_{\{Y \in A\}} \vert X = x).</script>
<p>Under certain regularity conditions the above defines the conditional probability distribution of <script type="math/tex">(Y \vert X)</script>.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><span id="FellerVol2">Feller, W. (1966). <i>An introduction to probability theory and its applications</i> (Vol. 2). John Wiley & Sons.</span>
</li></ol>Alexej GossmannI can’t count how many times I have looked up the formal (measure theoretic) definitions of conditional probability distribution or conditional expectation (even though it’s not that hard :weary:) Another such occasion was yesterday. This time I took some notes.Setting up an HTTPS static site using AWS S3 and Cloudfront (and also Jekyll and s3_website)2018-07-27T00:00:00-04:002018-07-27T00:00:00-04:00https://www.alexejgossmann.com/AWS_S3_and_CloudFront<p>For a while now I wanted to migrate my websites away from Github pages.
While Github provides an excellent free service, there are some limitations to its capabilities, and the longer I wait the harder (or the more inconvenient) it becomes to migrate away from gh-pages.
AWS S3 + CloudFront is a widely-used alternative that has been around for a long time.
Moreover, I was planning to get more familiar with AWS at all levels anyway.
So, it’s a great learning opportunity too.</p>
<p>There are a <a href="https://medium.com/@esfoobar/setting-up-an-https-static-site-using-amazon-aws-7ab270c4e277">number</a> <a href="https://www.david-merrick.com/2017/05/24/moving-my-jekyll-website-from-github-pages-to-s3/">of</a> <a href="https://vickylai.com/verbose/hosting-your-static-site-with-aws-s3-route-53-and-cloudfront/">very</a> <a href="https://medium.com/@jameshamann/migrating-your-jekyll-website-to-aws-bc9582b3fbb2">helpful</a> <a href="https://blog.jpterry.com/howto/2016/02/02/secure-static-hosting-w-s3-cloudfront-acm.html">tutorials</a> online on how to set up an HTTPS static site using AWS S3 and CloudFront.
Of course, as always the case with blog articles, they may be outdated, incomplete, and generally not as trustworthy as <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html">the official AWS documentation on the topic</a>, which is pretty good too; but it is also somewhat fragmented and inconvenient to follow.
So I wrote my own summary to refer to in the future.</p>
<p><strong>Relevant AWS docs:</strong> <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/HostingWebsiteOnS3Setup.html">How to create a static website on AWS S3</a>; <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html">How to use a custom domain with AWS S3</a>; <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-cloudfront-walkthrough.html">Setting up Amazon CloudFront</a>; <a href="https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html">SSL certificate instructions</a>.</p>
<h3 id="1-set-up-a-static-site-yet-without-cloudfront-and-without-https">1 Set up a static site, yet without CloudFront and without HTTPS</h3>
<p>First, we set up a static HTTP site without a custom domain on AWS S3:</p>
<ul>
<li>Create a bucket named <code class="highlighter-rouge">example.com</code> (obviously replace <code class="highlighter-rouge">example.com</code> with your own domain).</li>
<li>Follow the procedure given at <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/HostingWebsiteOnS3Setup.html">https://docs.aws.amazon.com/AmazonS3/latest/dev/HostingWebsiteOnS3Setup.html</a> to <em>enable website hosting</em> for the bucket, and to make it <em>publicly readable</em>; (optionally) if you want to understand the AWS bucket access policy language see <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev//access-policy-language-overview.html">https://docs.aws.amazon.com/AmazonS3/latest/dev//access-policy-language-overview.html</a>, and follow the links from there.</li>
<li>Test the S3 website: Upload an <code class="highlighter-rouge">index.html</code> to the bucket (you can keep all options for the upload at their default values). Then go to <code class="highlighter-rouge">http://example.com.s3-website-us-east-1.amazonaws.com/</code> (where you need to replace <code class="highlighter-rouge">example.com</code> with the bucket name, and <code class="highlighter-rouge">us-east-1</code> with your bucket’s region), and see if the contents of <code class="highlighter-rouge">index.html</code> show up.</li>
</ul>
<p>Yay :laughing: we have a working website!! …without a custom domain or https yet :sweat_smile:</p>
<p><strong>The www subdomain:</strong> Now prepare another S3 bucket for the subdomain “www.example.com” to be later redirected to the root domain “example.com” (btw, if you so wish, <code class="highlighter-rouge">www.example.com</code> can be the main S3 bucket and the <code class="highlighter-rouge">example.com</code> bucket can be configured to redirect — just swap their roles in this entire writeup):</p>
<ul>
<li>Create a bucket named <code class="highlighter-rouge">www.example.com</code> (all options can be left at their defaults; this bucket doesn’t need to be publicly readable).</li>
<li>Configure <code class="highlighter-rouge">www.example.com</code> to redirect all requests to <code class="highlighter-rouge">example.com</code> following Step 2.3 from the AWS docs at <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html">https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html</a>.</li>
<li>Test the endpoints for the redirect by going to <code class="highlighter-rouge">http://www.example.com.s3-website-us-east-1.amazonaws.com/</code> (as before replace the bucket name and region accordingly).</li>
</ul>
<p><strong>Map the domain and subdomain to their S3 buckets:</strong></p>
<p>Amazon Route 53 is a service that maintains a mapping between the alias records and the IP of the bucket.
You need to follow Step 3 from the AWS docs at <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html">https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-custom-domain-walkthrough.html</a>.</p>
<p><strong>Configuration with your domain name registrar:</strong></p>
<ul>
<li>In AWS go to Route 53 -> Hosted zones -> example.com</li>
<li>The NS (name servers) records that you see are what needs to be provided to the domain name registrar. For example, for GoDaddy I have to choose to use “custom nameservers” under the DNS settings for the domain, and then to input all (four in my case) of the URLs provided as values under the NS record.</li>
<li>Your website should now appear under http://example.com (and http://www.example.com).</li>
</ul>
<p>:smile: So we have a website with a custom domain!! …though without CloudFront (so loading may be rather slow) and without HTTPS.</p>
<h4 id="optional-configure-an-iam-role-with-limited-access-permissions">Optional: Configure an IAM role with limited access permissions</h4>
<p>Now it seems a good idea to create a new user that has full read-write permission to the <code class="highlighter-rouge">example.com</code> bucket and full permission to CloudFront, but does not have any further AWS permissions.
A suitable IAM policy document can be found at: <a href="https://github.com/laurilehmijoki/s3_website/blob/master/additional-docs/setting-up-aws-credentials.md">https://github.com/laurilehmijoki/s3_website/blob/master/additional-docs/setting-up-aws-credentials.md</a>
Make sure to save the new user’s access key ID and secret access key somewhere in a private place.</p>
<h4 id="optional-use-jekyll-and-s3_website-to-generate-a-static-site-and-to-push-it-to-the-s3-bucket">Optional: Use Jekyll and s3_website to generate a static site and to push it to the S3 bucket</h4>
<p>Well, I typically use <a href="https://jekyllrb.com/">Jekyll</a> to make my static sites (because it’s awesome!).
The Ruby gem <a href="https://github.com/laurilehmijoki/s3_website"><code class="highlighter-rouge">s3_website</code></a> can be used to push the website to, or to synchronized it with the S3 bucket.
The <a href="https://github.com/laurilehmijoki/s3_website"><code class="highlighter-rouge">s3_website</code> documentation</a> is easy to follow.
I have found it convenient to use the <a href="https://github.com/bkeepers/dotenv"><code class="highlighter-rouge">dotenv</code> gem</a> to keep the access key ID and the secret access key of the user (that was just created) locally in a <code class="highlighter-rouge">.env</code> file (don’t commit/push it to github!!!)
At this point you may also choose to allow <code class="highlighter-rouge">s3_website</code> to set up CloudFront for the website to save some time later (though without the SSL certificate, which will still have to be added manually, see below).</p>
<h3 id="2-request-an-ssl-certificate">2 Request an SSL certificate</h3>
<p>We need an SSL certificate to enable HTTPS for the custom domain when it is accessed through CloudFront.</p>
<p>Follow the AWS docs at <a href="https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html">https://docs.aws.amazon.com/acm/latest/userguide/gs-acm-request-public.html</a> to request a public certificate for your domain. Some important points:</p>
<ul>
<li>Add <code class="highlighter-rouge">example.com</code> and <code class="highlighter-rouge">*.example.com</code> to the certificate.</li>
<li>Use DNS validation (rather than email validation), whereby in the “pending validation” stage you can choose “Create record in Route 53” which saves time (since we have already configures Route 53 for this domain).</li>
</ul>
<p>I encountered one caveat in this process:</p>
<blockquote>
<p>To use an ACM Certificate with CloudFront, you must request or import the certificate in the US East (N. Virginia) region.</p>
</blockquote>
<p>(from <a href="http://docs.aws.amazon.com/acm/latest/userguide/acm-services.html">http://docs.aws.amazon.com/acm/latest/userguide/acm-services.html</a>); i.e., change region to US East N. Virginia if needed (top right corner within the AWS interface).</p>
<h3 id="3-create-a-cloudfront-distribution">3 Create a CloudFront distribution</h3>
<p>Follow these AWS docs to create a CloudFront distribution: <a href="https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-cloudfront-walkthrough.html">https://docs.aws.amazon.com/AmazonS3/latest/dev/website-hosting-cloudfront-walkthrough.html</a>; unless a CloudFront distribution was already created by <code class="highlighter-rouge">s3_website</code> (see one of the previous optional steps), in which case it needs to be merely edited (add the SSL certificate to it, and update “Alternate Domain Names” with <code class="highlighter-rouge">yourdomain.com</code> and <code class="highlighter-rouge">www.yourdomain.com</code> if necessary).</p>
<p>Notice the designated CloudFront distribution domain, which should look similar to vtrlj8ubh2k69.cloudfront.net. Once set up the website should appear under it.</p>
<p><strong>A few points I found noteworthy:</strong></p>
<ul>
<li>One can choose to set HTTP to always redirect to HTTPS.</li>
<li>Once issued, the SSL certificate can be selected from the drop down menu under “Custom SSL Certificates”.</li>
<li>As pointed out in <a href="https://vickylai.com/verbose/hosting-your-static-site-with-aws-s3-route-53-and-cloudfront/">Vicky Lai’s blog post</a> the “Origin” column in the CloudFront Console should show the S3 bucket’s endpoint <code class="highlighter-rouge">example.com.s3-website.us-east-1.amazonaws.com</code>, and not the bucket name <code class="highlighter-rouge">example.com.s3.amazonaws.com</code> (btw <code class="highlighter-rouge">s3_website</code> does this correctly). Note that when setting up, the drop down menu offers only the bucket name to be picked rather the correct endpoint; so, don’t use the drop down menu; type it in yourself.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></li>
</ul>
<p><strong>Update A records in Route 53, and update the</strong> <code class="highlighter-rouge">s3_website</code> <strong>configs:</strong></p>
<ul>
<li>In AWS go to Route 53 -> Hosted zones -> example.com</li>
<li>For both A records, change the “Alias Target” from the S3 endpoint to the CloudFront distribution domain (i.e., something like <code class="highlighter-rouge">vtrlj8ubh2k69.cloudfront.net</code>).</li>
<li>If you use <code class="highlighter-rouge">s3_website</code> check or set the <code class="highlighter-rouge">cloudfront_distribution_id</code> property in <code class="highlighter-rouge">s3_website.yml</code> to the correct distribution ID (something like <code class="highlighter-rouge">SY9Q4DHIOUG7A</code>)</li>
</ul>
<p>That’s it — the site should now be accessible under <code class="highlighter-rouge">https://example.com</code> and <code class="highlighter-rouge">https://www.example.com</code>. :tada: :tada: :tada:</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>It is not exactly clear to me what difference it makes to set the “Origin” to <code class="highlighter-rouge">example.com.s3.amazonaws.com</code> vs <code class="highlighter-rouge">example.com.s3-website.us-east-1.amazonaws.com</code>. However, it solved one of my issues. At first I set the “Origin” value to the bucket name, similar to <code class="highlighter-rouge">example.com.s3.amazonaws.com</code>, since that is what was offered by the drop down menu in CloudFront. The landing page of the website was working just fine under the custom domain. However, when I navigated to subdirectories in my domain, similar to <code class="highlighter-rouge">example.com/about/</code>, the server did not seem to understand that it needed to look for the <code class="highlighter-rouge">index.html</code> within the <code class="highlighter-rouge">about</code> directory, and produced an error. Once I edited the “Origin” record to the S3 bucket endpoint, similar to <code class="highlighter-rouge">example.com.s3-website.us-east-1.amazonaws.com</code>, all pages of the website started to display perfectly fine. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexej GossmannFor a while now I wanted to migrate my websites away from Github pages. While Github provides an excellent free service, there are some limitations to its capabilities, and the longer I wait the harder (or the more inconvenient) it becomes to migrate away from gh-pages. AWS S3 + CloudFront is a widely-used alternative that has been around for a long time. Moreover, I was planning to get more familiar with AWS at all levels anyway. So, it’s a great learning opportunity too.Neural networks and deep learning - self-study and 2 presentations2018-04-29T00:00:00-04:002018-04-29T00:00:00-04:00https://www.alexejgossmann.com/deep-learning-self-study<p>Last month, after mentioning “deep learning” a few times to some professors, I suddenly found myself in a position where I had to prepare three talks about “deep learning” within just one month… :sweat_smile:
This is not to complain. I actually strongly enjoy studying the relevant theory, applying it to interesting datasets, and presenting what I have learned.
Besides, teaching may be the best way to learn.
However, it is quite funny. :laughing:
The deep learning hype is too real. :trollface:</p>
<p>In this post I want to share my presentation slides (see below), some other resources, and some thoughts, in case any of that can be helpful to other deep learning beginners.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>Neural networks (NNs) and deep learning (DL, also deep NNs, or DNNs) are not my research area, but currently it is one of my main side-interests.
(D)NNs are truly fascinating to somebody with substantial experience in statistics or the more conventional machine learning (like myself). Initially it seems counterintuitive how these extremely overparametrized models are even supposed to work, but then you fit those models, and their performance is so good that it seems to border on magic. :crystal_ball:</p>
<h2 id="slides">Slides</h2>
<p>These <code class="highlighter-rouge">html</code> slides were created with the excellent <a href="https://github.com/hakimel/reveal.js/"><code class="highlighter-rouge">reveal.js</code></a>.</p>
<ul>
<li><a href="https://agisga.github.io/reveal.js/20180426-Math-Modeling-Guest-Lecture.html"><strong>An introduction to neural networks and deep learning</strong></a> — guest lecture for a mathematical modeling class in the department of biomedical engineering at Tulane (with live-demos using Google Cloud, see slides).</li>
<li><a href="https://agisga.github.io/reveal.js/20180411-Journal-Club.html"><strong>A survey on Deep Learning in Medical Image Analysis</strong></a> — journal club presentation based on a paper by Litjens, Kooi, Bejnordi, et al. This was presented as a 2-part talk.</li>
</ul>
<h2 id="my-favorite-learning-resources">My favorite learning resources</h2>
<p>I was able give the above presentations, because I did a good amount of self-study on NN and DL in my free time. Here are some of the resources that I have used, and that I highly recommend:</p>
<ul>
<li>I worked through the <a href="http://www.fast.ai/">fast.ai</a> MOOC <a href="http://course.fast.ai/">“Practical Deep Learning For Coders, Part 1”</a> by Jeremy Howard and Rachel Thomas. It is not spoon-feeding (if you want to actually understand what’s going on), but highly recommended as a starting point. <a href="https://en.wikipedia.org/wiki/Jeremy_Howard_(entrepreneur)">Jeremy Howard</a> is fantastic at giving clear and simple explanations to complex concepts, and the provided Jupyter Notebooks are excellent to get started with the practical application of DL.</li>
<li>At the same time I swallowed <a href="https://en.wikipedia.org/wiki/Michael_Nielsen">Michael Nielsen</a>’s <a href="http://neuralnetworksanddeeplearning.com/">“Neural Networks and Deep Learning”</a> book, which was a pleasure to read.</li>
<li>Then I participated in the IPAM/UCLA workshop <a href="https://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/?tab=schedule">“New Deep Learning Techniques”</a> in February (videos and slides available on the linked site), which blew my mind by covering so many different perspectives which I was not aware of.</li>
<li>Currently I am working through the <a href="https://www.youtube.com/playlist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv">lectures</a> and <a href="https://cs231n.github.io/">assignments</a> from Stanford’s <a href="https://cs231n.github.io/">CS231n</a> together with the <a href="http://ods.ai/">ods.ai</a> community (see <a href="https://github.com/Yorko/mlcourse_open/wiki/Passing-cs231n-together">passing cs231n together</a>). Feel free to contact me if you want to discuss the CS231n assignments in the near future.</li>
<li>During the entire time, I was also (slowly) working on my Python skills, as well as figuring out how to set up AWS, and Google Cloud GPU instances. Unfortunately, figuring out and setting up the required drivers, libraries, etc. is still very non-trivial. For many people I meet in academia this may even be the greatest bottleneck towards deep learning. <a href="https://github.com/agisga/coding_notes/blob/master/google_cloud.md">This is the setup I am currently using</a>.</li>
</ul>
<p>These resources have worked very well for me. My background is mostly academic, and includes experience in statistical modeling, (non-deep) machine learning, an all-but-dissertation status in a math PhD program, and some domain knowledge in medical imaging.
While it is helpful with some of the above, none of that is really that important or necessary.
Though some math is definitely needed, it does not need to be at a PhD level.
Medical or biological knowledge helps only if those are the applications of DL that you seek out (which I do).
Understanding some basic machine learning and data science practices certainly helps, but the relevant material is covered in all DL courses that I have tried.
However, what helps immensely in any case is proficiency with git, Github, Linux, as well as general programming and data processing skills.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I hope that still being close to the beginning of my DL journey makes me in some way more helpful to the absolute beginner (which I too was just a few months ago)… Maybe right now I have some perspective that may get lost should I become a DL expert… <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexej GossmannLast month, after mentioning “deep learning” a few times to some professors, I suddenly found myself in a position where I had to prepare three talks about “deep learning” within just one month… :sweat_smile: This is not to complain. I actually strongly enjoy studying the relevant theory, applying it to interesting datasets, and presenting what I have learned. Besides, teaching may be the best way to learn. However, it is quite funny. :laughing: The deep learning hype is too real. :trollface:Probabilistic interpretation of AUC2018-01-25T01:00:00-05:002018-01-25T01:00:00-05:00https://www.alexejgossmann.com/auc<p>Unfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:).
So it took me some time until I learned that the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve">AUC</a> has a nice probabilistic meaning.</p>
<h2 id="whats-auc-anyway">What’s AUC anyway?</h2>
<p>AUC is the <strong>a</strong>rea <strong>u</strong>nder the ROC <strong>c</strong>urve. The ROC curve is the <strong>r</strong>eceiver <strong>o</strong>perating <strong>c</strong>haracteristic curve. AUC is simply the area between that curve and the x-axis. So, to understand AUC we need to look at the concept of an ROC curve.</p>
<p>Consider:</p>
<ol>
<li>A dataset <script type="math/tex">S</script> : <script type="math/tex">(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n) \in \mathbb{R}^p \times \{0, 1\}</script>, where
<ul>
<li><script type="math/tex">\mathbf{x}_i</script> is a vector of <script type="math/tex">p</script> features collected for the <script type="math/tex">i</script>th subject,</li>
<li><script type="math/tex">y_i</script> is the <script type="math/tex">i</script>th subject’s label (binary outcome variable of interest, like a disease status, class membership, or whatever binary label).</li>
</ul>
</li>
<li>A classification algorithm (such as logistic regression, SVM, deep neural net, or whatever you like), trained on <script type="math/tex">S</script>, that assigns a score (or a “probability”) <script type="math/tex">\hat{p}(\mathbf{x}_{\ast})</script> to any new observation <script type="math/tex">\mathbf{x}_{\ast} \in \mathbb{R}^p</script> signifying the algorithm’s confidence that the label (or class) of <script type="math/tex">\mathbf{x}_{\ast}</script> is <script type="math/tex">y_{\ast} = 1</script>.</li>
</ol>
<p>Then:</p>
<ol>
<li>A <em>decision threshold</em> (or <em>operating point</em>) can be chosen to assign a class label (<script type="math/tex">y_{\ast} = 0</script> or <script type="math/tex">1</script>) to <script type="math/tex">\mathbf{x}_{\ast}</script> based on the value of <script type="math/tex">\hat{p}(\mathbf{x}_{\ast})</script>.
The chosen threshold determines the balance between how many <em>false positives</em> and <em>false negatives</em> will result from this classification.</li>
<li>Plotting the <em>true positive rate</em> (TPR) against the <em>false positive rate</em> (FPR) <em>as the operating point changes from its minimum to its maximum value</em> yields the <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic"><em>receiver operating characteristic (ROC) curve</em></a>. Check the <a href="https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Confusion_matrix">confusion matrix</a> if you are not sure what TPR and FPR refer to.</li>
<li>The area under the ROC curve, or AUC, is used as a measure of classifier performance.</li>
</ol>
<p>Here is some R code for clarification:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># load some data, fit a logistic regression classifier</span><span class="w">
</span><span class="n">data</span><span class="p">(</span><span class="n">iris</span><span class="p">)</span><span class="w">
</span><span class="n">versicolor_virginica</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">iris</span><span class="p">[</span><span class="n">iris</span><span class="o">$</span><span class="n">Species</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"setosa"</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="n">logistic_reg_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glm</span><span class="p">(</span><span class="n">Species</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">Sepal.Width</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">Sepal.Length</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">versicolor_virginica</span><span class="p">,</span><span class="w">
</span><span class="n">family</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"binomial"</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">versicolor_virginica</span><span class="o">$</span><span class="n">Species</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s2">"versicolor"</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y_pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">logistic_reg_fit</span><span class="o">$</span><span class="n">fitted.values</span><span class="w">
</span><span class="c1"># get TPR and FPR at different values of the decision threshold</span><span class="w">
</span><span class="n">threshold</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">FPR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">threshold</span><span class="p">,</span><span class="w">
</span><span class="k">function</span><span class="p">(</span><span class="n">thresh</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">y_pred</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">thresh</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">TPR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">threshold</span><span class="p">,</span><span class="w">
</span><span class="k">function</span><span class="p">(</span><span class="n">thresh</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">y_pred</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="n">thresh</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="c1"># plot an ROC curve</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">FPR</span><span class="p">,</span><span class="w"> </span><span class="n">TPR</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">FPR</span><span class="p">,</span><span class="w"> </span><span class="n">TPR</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>A rather ugly ROC curve emerges:</p>
<p><img src="/images/20180124-AUC/ROC.png?raw=true" alt="ROC curve R example" title="An ugly ROC curve" /></p>
<p>The area under the ROC curve, or AUC, seems like a nice heuristic to evaluate and compare the overall performance of classification models independent of the exact decision threshold chosen. <script type="math/tex">\mathrm{AUC} = 1.0</script> signifies perfect classification accuracy, and <script type="math/tex">\mathrm{AUC} = 0.5</script> is the accuracy of making classification decisions via coin toss (or rather a continuous coin that outputs values in <script type="math/tex">[0,1]</script>…).
Most classification algorithms will result in an AUC in that range.
But there’s more to it.</p>
<h2 id="probabilistic-interpretation">Probabilistic interpretation</h2>
<p>As above, assume that we are looking at a dataset where we want to distinguish data points of <em>type 0</em> from those of <em>type 1</em>. Consider a classification algorithm that assigns to a random observation <script type="math/tex">\mathbf{x}\in\mathbb{R}^p</script> a score (or probability) <script type="math/tex">\hat{p}(\mathbf{x}) \in [0,1]</script> signifying membership in <em>class 1</em>. If the final classification between <em>class 1</em> and <em>class 0</em> is determined by a decision threshold <script type="math/tex">t\in[0, 1]</script>, then the <em>true positive rate</em> (a.k.a. <em>sensitivity</em> or <em>recall</em>) can be written as a conditional probability</p>
<script type="math/tex; mode=display">T(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{belongs to class 1}],</script>
<p>and the <em>false positive rate</em> (or <em>1 - specificity</em>) can be written as</p>
<script type="math/tex; mode=display">F(t) := P[\hat{p}(\mathbf{x}) > t \,|\, \mathbf{x}\,\text{does not belong to class 1}].</script>
<p>For brevity of notation let’s say <script type="math/tex">y(\mathbf{x}) = 1</script> instead of “<script type="math/tex">\mathbf{x}</script> belongs to class 1”, and <script type="math/tex">y(\mathbf{x})=0</script> instead of “<script type="math/tex">\mathbf{x}</script> doesn’t belong to class 1”.</p>
<p>The ROC curve simply plots <script type="math/tex">T(t)</script> against <script type="math/tex">F(t)</script> while varying <script type="math/tex">t</script> from 0 to 1.
Thus, <strong>if we view <script type="math/tex">T</script> as a function of <script type="math/tex">F</script></strong>, the AUC can be rewritten as follows.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\mathrm{AUC} &=& \int_0^1 T(F_0) \,\mathrm{d}F_0 \nonumber \\
&=& \int_0^1 P[\hat{p}(\mathbf{x}) > F^{-1}(F_0) \,|\, y(\mathbf{x}) = 1] \,\mathrm{d}F_0 \nonumber \\
&=& \int_1^0 P[\hat{p}(\mathbf{x}) > F^{-1}(F(t)) \,|\, y(\mathbf{x}) = 1] \cdot \frac{\partial F(t)}{\partial t} \,\mathrm{d}t \nonumber \\
&=& \int_0^1 P[\hat{p}(\mathbf{x}) > t \,|\, y(\mathbf{x}) = 1] \cdot P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\
&=& \int_0^1 P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,\&\, \hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0] \,\mathrm{d}t \nonumber \\
&=& P[\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}}) \,|\, y(\mathbf{x}) = 1 \,\&\, y(\mathbf{x^{\prime}}) = 0], \nonumber
\end{eqnarray} %]]></script>
<p>where we used the fact that the probability density function</p>
<script type="math/tex; mode=display">P[\hat{p}(\mathbf{x^{\prime}}) = t \,|\, y(\mathbf{x^{\prime}}) = 0] =: f(t)</script>
<p>is the derivative with respect to <script type="math/tex">t</script> of the cumulative distribution function</p>
<script type="math/tex; mode=display">P[\hat{p}(\mathbf{x^{\prime}}) \leq t \,|\, y(\mathbf{x^{\prime}}) = 0] = 1-F(t).</script>
<p>So, given a randomly chosen observation <script type="math/tex">\mathbf{x}</script> belonging to <em>class 1</em>, and a randomly chosen observation <script type="math/tex">\mathbf{x^{\prime}}</script> belonging to <em>class 0</em>, the AUC is the probability that the evaluated classification algorithm will assign a higher score to <script type="math/tex">\mathbf{x}</script> than to <script type="math/tex">\mathbf{x^{\prime}}</script>, i.e., the conditional probability of <script type="math/tex">\hat{p}(\mathbf{x}) > \hat{p}(\mathbf{x^{\prime}})</script>.</p>
<p>An alternative <em>purely geometric</em> proof can be found in the <a href="https://madrury.github.io/jekyll/update/statistics/2017/06/21/auc-proof.html">Scatterplot Smoothers blog</a>.</p>
<p>In other words, if the classification algorithm distinguishes “positive” and “negative” examples (e.g., disease status), then</p>
<blockquote>
<p>AUC is the probability of correct ranking of a random “positive”-“negative” pair.</p>
</blockquote>
<h2 id="computing-auc">Computing AUC</h2>
<p>The above probabilistic interpretation suggest a simple formula to compute AUC on a finite sample:</p>
<blockquote>
<p>Among all “positive”-“negative” pairs in the dataset compute the proportion of those which are ranked correctly by the evaluated classification algorithm.</p>
</blockquote>
<p>Here is an inefficient implementation using results from the above logistic regression example:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">y_pred</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">y_pred</span><span class="p">[</span><span class="n">j</span><span class="p">])</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">y_pred</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">y_pred</span><span class="p">[</span><span class="n">j</span><span class="p">])</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0.5</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">))</span><span class="w">
</span><span class="n">s</span><span class="w">
</span><span class="c1"># [1] 0.7918</span><span class="w">
</span></code></pre></div></div>
<p>The <strong>proportion of correctly ranked “positive”-“negative” pairs</strong> yields estimated <script type="math/tex">\mathrm{AUC} = 0.7918</script>.</p>
<p>We can compare this value to the area under the ROC curve computed with the <a href="https://en.wikipedia.org/wiki/Trapezoidal_rule">trapezoidal rule</a>.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">FPR</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dFPR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">FPR</span><span class="p">[</span><span class="n">i</span><span class="m">+1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">FPR</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="n">s</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dFPR</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="n">TPR</span><span class="p">[</span><span class="n">i</span><span class="m">+1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">TPR</span><span class="p">[</span><span class="n">i</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">s</span><span class="w">
</span><span class="c1"># [1] 0.7922</span><span class="w">
</span></code></pre></div></div>
<p><strong>Trapezoidal rule</strong> yields estimated <script type="math/tex">\mathrm{AUC} = 0.7922</script>. The difference of <script type="math/tex">0.0004</script> can be explained by the fact that we evaluated the ROC curve at only 100 points.</p>
<p>Since there is a minor disagreement, let’s use some standard R package to compute AUC.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ROCR</span><span class="p">)</span><span class="w">
</span><span class="n">pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">prediction</span><span class="p">(</span><span class="n">y_pred</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">auc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">performance</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span><span class="w"> </span><span class="n">measure</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"auc"</span><span class="p">)</span><span class="o">@</span><span class="n">y.values</span><span class="p">)</span><span class="w">
</span><span class="n">auc</span><span class="w">
</span><span class="c1"># [1] 0.7918</span><span class="w">
</span></code></pre></div></div>
<p>Same as the proportion of correctly ranked pairs! :grin:</p>
<h4 id="wilcoxon-mann-whitney-test">Wilcoxon-Mann-Whitney test</h4>
<p>By analysing the probabilistic meaning of AUC, we not only got a practically relevant interpretation of this classification performance metric, but we also obtained a simple formula to estimate the AUC of a trained classification algorithm.
Well, it turns out that taking the proportion of correctly ranked “positive”-“negative” pairs as a formula to estimate the AUC <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Area-under-curve_(AUC)_statistic_for_ROC_curves">is equivalent to the Wilcoxon-Mann-Whitney statistical test</a>.
This fact can also be easily demonstrated in a couple lines of R code.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y_is_1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y_is_0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="w">
</span><span class="n">n_pairs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y_is_1</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y_is_0</span><span class="p">)</span><span class="w">
</span><span class="n">WMW_test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">wilcox.test</span><span class="p">(</span><span class="n">y_pred</span><span class="p">[</span><span class="n">y_is_1</span><span class="p">],</span><span class="w"> </span><span class="n">y_pred</span><span class="p">[</span><span class="n">y_is_0</span><span class="p">])</span><span class="w">
</span><span class="n">WMW_test</span><span class="o">$</span><span class="n">statistic</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">n_pairs</span><span class="w">
</span><span class="c1"># W</span><span class="w">
</span><span class="c1"># 0.7918</span><span class="w">
</span></code></pre></div></div>
<p>Same answer!</p>
<h2 id="so-what-why-care-about-auc-anyway">So what? Why care about AUC anyway?</h2>
<ul>
<li>It has a really nice probabilistic meaning! :wink:</li>
</ul>
<p>Besides (arguably more importantly), as a measure of classification performance AUC has many advantages compared to other “single number” performance measures:</p>
<ul>
<li>Independence of the decision threshold.</li>
<li>Invariance to prior class probabilities or class prevalence in the data.</li>
<li>Can choose/change a decision threshold based on cost-benefit analysis after model training.</li>
<li>Extensively used in machine learning, and in medical research – and that for good reasons, as for example explained in an <a href="https://lukeoakdenrayner.wordpress.com/2017/12/06/do-machines-actually-beat-doctors-roc-curves-and-performance-metrics/">excellent blog post on deep learning research in medicine by Luke Oakden-Rayner</a>.</li>
</ul>Alexej GossmannUnfortunately this was not taught in any of my statistics or data analysis classes at university (wtf it so needs to be :scream_cat:). So it took me some time until I learned that the AUC has a nice probabilistic meaning.Mining USPTO full text patent data - An exploratory analysis of machine learning and AI related patents granted in 2017 so far2017-09-22T00:00:00-04:002017-09-22T00:00:00-04:00https://www.alexejgossmann.com/patents_part_1<p><a href="https://www.uspto.gov">The United States Patent and Trademark office (USPTO)</a> provides <em>immense</em> amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).</p>
<p>I started this exploration by downloading the full text data (excluding images) for all patents that were assigned by the USPTO within the year 2017 up to the time of writing (<a href="https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/">Patent Grant Full Text Data/XML for the year 2017 through the week of Sept 12 from the USPTO Bulk Data Storage System</a>).</p>
<p>In this blog post I address questions such as: How many ML and AI related patents were granted? Who are the most prolific inventors? The most frequent patent assignees? Where are inventions made? And when? Is the number of ML and AI related patents increasing over time? How long does it take to obtain a patent for a ML or AI related invention? Is the patent examination time shorter for big tech companies? Etc.</p>
<h2 id="identifying-patents-related-to-machine-learning-and-ai">Identifying patents related to machine learning and AI</h2>
<p>First, I curated a patent full text dataset consisting of “machine learning and AI related” patents.
I am not just looking for instances where actual machine learning or AI algorithms were patented; I am looking for inventions which are <em>related to ML or AI in any/some capacity</em>. That is, I am interested in patents where machine learning, data mining, predictive modeling, or AI is <em>utilized as a part of the invention in any way whatsoever</em>. The subset of relevant patents was determined by a keyword search as specified by the following definition.</p>
<p><strong>Definition:</strong> For the purposes of this blog post, <em>a machine learning or AI related patent</em> is a patent that contains at least one of the keywords
<em>“machine learning”, “deep learning”, “neural network”, “artificial intelligence”, “statistical learning”, “data mining”, or “predictive model”</em>
in its invention title, description, or claims text (while of course accounting for capitalization, pluralization, etc.).<sup id="fnref:keywords"><a href="#fn:keywords" class="footnote">1</a></sup></p>
<p>With this keyword matching approach a total of 6665 patents have been selected. The bar graph below shows how many times each keyword got matched.</p>
<p><img src="/images/2017-9-12-patents_part_1/keyword_match_freq.png" alt="plot of keyword match frequencies" /></p>
<p>Interestingly the term “neural network” is even more common than the more general terms “machine learning” and “artificial intelligence”.</p>
<h3 id="some-example-patents">Some example patents</h3>
<p>Here are three (randomly chosen) patents from the resulting dataset. For each printed are the invention title, the patent assignee, as well as one instance of the keyword match within the patent text.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $`2867`
## [1] "Fuselage manufacturing system"
## [2] "THE BOEING COMPANY"
## [3] "... using various techniques. For example, at least
## one of an artificial intelligence program, a
## knowledgebase, an expert ..."
##
## $`1618`
## [1] "Systems and methods for validating wind farm
## performance measurements"
## [2] "General Electric Company"
## [3] "... present disclosure leverages and fuses
## accurate available sensor data using machine
## learning algorithms. That is, the more ..."
##
## $`5441`
## [1] "Trigger repeat order notifications"
## [2] "Accenture Global Services Limited"
## [3] "... including user data obtained from a user
## device; obtaining a predictive model that
## estimates a likelihood of ..."</code></pre></figure>
<p>And here are three examples of (randomly picked) patents that contain the relevant keywords directly in their invention title.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $`5742`
## [1] "Adaptive demodulation method and apparatus using an
## artificial neural network to improve data recovery
## in high speed channels"
## [2] "Kelquan Holdings Ltd."
## [3] "... THE INVENTION\nh-0002\n1 The present invention
## relates to a neural network based integrated
## demodulator that mitigates ..."
##
## $`3488`
## [1] "Cluster-trained machine learning for image processing"
## [2] "Amazon Technologies, Inc."
## [3] "... BACKGROUND\nh-0001\n1 Artificial neural networks,
## especially deep neural network ..."
##
## $`3103`
## [1] "Methods and apparatus for machine learning based
## malware detection"
## [2] "Invincea, Inc."
## [3] "... a need exists for methods and apparatus that can
## use machine learning techniques to reduce the amount ..."</code></pre></figure>
<h2 id="who-holds-these-patents-inventors-and-assignees">Who holds these patents (inventors and assignees)?</h2>
<p>The first question I would like to address is who files most of the machine learning and AI related patents.</p>
<p>Each patent specifies one or several <em>inventors</em>, i.e., the individuals who made the patented invention, and a patent <em>assignee</em> which is typically the inventors’ employer company that holds the rights to the patent. The following bar graph visualizes the top 20 most prolific inventors and the top 20 most frequent patent assignees among the analyzed ML and AI related patents.</p>
<p><img src="/images/2017-9-12-patents_part_1/unnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>It isn’t surprising to see this list of companies. The likes of IBM, Google, Amazon, Microsoft, Samsung, and AT&T rule the machine learning and AI patent space. I have to admit that I don’t recognize any of the inventors’ names (but it might just be me not being familiar enough with the ML and AI community).</p>
<p>There are a number of interesting follow-up questions which for now I leave unanswered (hard to answer without additional data):</p>
<ul>
<li>What is the count of ML and AI related patents by industry or type of company (e.g., big tech companies vs. startups vs. reserach universities vs. government)?</li>
<li>Who is deriving the most financial benefit by holding ML or AI related patents (either through licensing or by driving out the competition)?</li>
</ul>
<h2 id="where-do-these-inventions-come-from-geographically">Where do these inventions come from (geographically)?</h2>
<p>Even though the examined patents were filed in the US, some of the inventions may have been made outside of the US.
In fact, the data includes specific geographic locations for each patent, so I can map in which cities within the US and the world inventors are most active.
The following figure is based on where the inventors are from, and shows the most active spots. Each point corresponds to the total number of inventions made at that location (though note that the color axis is a log10 scale, and so is the point size).</p>
<p><img src="/images/2017-9-12-patents_part_1/locations.png" alt="plot of chunk unnamed-chunk-16" /></p>
<p>The results aren’t that surprising.
However, we see that most (ML and AI related) inventions patented with the USPTO were made in the US. I wonder if inventors in other countries prefer to file patents in their home countries’ patent offices rather the in the US.</p>
<p>Alternatively, we can also map the number of patents per inventors’ origin countries.</p>
<p><img src="/images/2017-9-12-patents_part_1/countries.png" alt="plot of chunk unnamed-chunk-17" /></p>
<p>Sadly, there seem to be entire groups of countries (e.g., almost the entire African continent) which seem to be excluded from the USPTO’s patent game, at least with respect to machine learning and AI related inventions.
Whether it is a lack of access, infrastructure, education, political treaties or something else is an intriguing question.</p>
<h2 id="patent-application-and-publication-dates-and-duration-of-patent-examination-process">Patent application and publication dates, and duration of patent examination process</h2>
<p>Each patent has a <em>date of filing</em> and an <em>assignment date</em> attached to it. Based on the provided dates one can try to address questions such as:
When were these patents filed? Is the number of ML and AI related patents increasing over time? How long did it usually take from patent filing to assignment? And so on.</p>
<p>For the set of ML and AI related patents <em>that were granted between Jan 3 and Sept 12 2017</em> the following figure depicts…</p>
<ul>
<li>…in the top panel: number of patents (y-axis) per their original <em>month of filing</em> (x-axis),</li>
<li>…in the bottom panel: the number of patents (y-axis) that were <em>assigned</em> (approved) per week (x-axis) in 2017 so far.</li>
</ul>
<p><img src="/images/2017-9-12-patents_part_1/unnamed-chunk-7-1.png" alt="plot of chunk unnamed-chunk-7" /></p>
<p>The patent publication dates plot suggests that the number of ML and AI related patents seems to be increasing slightly throughout the year 2017.
The patent application dates plot suggests that the patent examination phase for the considered patents takes about 2.5 years. In fact the average time from patent filing to approval is 2.83 years with a standard deviation of 1.72 years in this dataset (that is, among the considered ML and AI related patents in 2017). However, the range is quite extensive spanning 0.24-12.57 years.</p>
<p>The distribution of the duration from patent filing date to approval is depicted by following figure.</p>
<p><img src="/images/2017-9-12-patents_part_1/unnamed-chunk-9-1.png" alt="plot of chunk unnamed-chunk-9" /></p>
<p>So, what are some of the inventions that took longest to get approved? Here are the five patents with the longest examination periods:</p>
<ul>
<li>“Printing and dispensing system for an electronic gaming device that provides an undisplayed outcome” (~12.57 years to approval; assignee: Aim Management, Inc.)</li>
<li>“Apparatus for biological sensing and alerting of pharmaco-genomic mutation” (~12.24 years to approval; assignee: NA)</li>
<li>“System for tracking a player of gaming devices” (~12.06 years to approval; assignee: Aim Management, Inc.)</li>
<li>“Device, method, and computer program product for customizing game functionality using images” (~11.72 years to approval; assignee: NOKIA TECHNOLOGIES OY)</li>
<li>“Method for the spectral identification of microorganisms” (~11.57 years to approval; assignee: MCGILL UNIVERSITY, and HEALTH CANADA)</li>
</ul>
<p>Each of these patents is related to either gaming or biotech. I wonder if that’s a coincidence…</p>
<p>We can also look at the five patents with the shortest approval time:</p>
<ul>
<li>“Home device application programming interface” (~91 days to approval; assignee: ESSENTIAL PRODUCTS, INC.)</li>
<li>“Avoiding dazzle from lights affixed to an intraoral mirror, and applications thereof” (~104 days to approval; assignee: DENTAL SMARTMIRROR, INC.)</li>
<li>“Media processing methods and arrangements” (~106 days to approval; assignee: Digimarc Corporation)</li>
<li>“Machine learning classifier that compares price risk score, supplier risk score, and item risk score to a threshold” (~111 days to approval; assignee: ACCENTURE GLOBAL SOLUTIONS LIMITED)</li>
<li>“Generating accurate reason codes with complex non-linear modeling and neural networks” (~111 days to approval; assignee: SAS INSTITUTE INC.)</li>
</ul>
<p>Interstingly the patent approved in the shortest amount of time among all 6665 analysed (ML and AI related) patents is some smart home thingy from <a href="https://en.wikipedia.org/wiki/Andy_Rubin">Andy Rubin’s</a> hyped up company Essential.</p>
<h3 id="do-big-tech-companies-get-their-patents-approved-faster-than-other-companies-eg-startups">Do big tech companies get their patents approved faster than other companies (e.g., startups)?</h3>
<p>The following figure separates the patent approval times according to the respective assignee company, considering several of the most well known tech giants.</p>
<p><img src="/images/2017-9-12-patents_part_1/unnamed-chunk-15-1.png" alt="plot of chunk unnamed-chunk-15" /></p>
<p>Indeed some big tech companies, such as AT&T or Samsung, manage to push their patent application though the USPTO process much faster than most other companies. However, there are other tech giants, such as Microsoft, which on average take longer to get their patent applications approved than even the companies in category “Other”. Also noteworthy is the fact that big tech companies tend to have fewer outliers regarding the patent examination process duration than companies in the category “Other”.</p>
<p>Of course it would also be interesting to categorize all patent assignees into categories like “Startup”, “Big Tech”, “University”, or “Government”, and compare the typical duration of the patent examination process between such groups. However, it’s not clear to me how to establish such categories without collecting additional data on each patent assignee, which at this point I don’t have time for :stuck_out_tongue:.</p>
<h2 id="conclusion">Conclusion</h2>
<p>There is definitely a lot of promise in the USPTO full text patent data.
Here I have barely scratched the surface, and I hope that I will find the time to play around with these data some more.
The end goal is, of course, to replace the patent examiner with an AI trained on historical patent data. :stuck_out_tongue_closed_eyes:</p>
<hr />
<p><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work (blog post and included figures) is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p>
<div class="footnotes">
<ol>
<li id="fn:keywords">
<p>There are two main aspects to my reasoning as to this particular choice of keywords. (1) I wanted to keep the list relatively short in order to have a more specific search, and (2) I tried to avoid keywords which may generate false positives (e.g., the term “AI” would match all sorts of codes present in the patent text, such as “123456789 AI N1”). In no way am I claiming that this is a perfect list of keywords to identify ML and AI related patents, but I think that it’s definitely a good start. <a href="#fnref:keywords" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexej GossmannThe United States Patent and Trademark office (USPTO) provides immense amounts of data (the data I used are in the form of XML files). After coming across these datasets, I thought that it would be a good idea to explore where and how my areas of interest fall into the intellectual property space; my areas of interest being machine learning (ML), data science, and artificial intelligence (AI).Freedman’s paradox2017-06-05T20:00:00-04:002017-06-05T20:00:00-04:00https://www.alexejgossmann.com/Freedmans_paradox<p>Recently I came across the classical 1983 paper <a href="http://www.public.asu.edu/~gasweete/crj604/readings/1983-Freedman%20(Screening%20Regression%20Equations).pdf"><em>A note on screening regression equations</em></a> by <a href="https://en.wikipedia.org/wiki/David_A._Freedman">David Freedman</a>. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as <a href="https://en.wikipedia.org/wiki/Freedman%27s_paradox">“Freedman’s paradox”</a>. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.</p>
<p>The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it.</p>
<p>I am going to use a few R packages (most notably the package <code class="highlighter-rouge">broom</code> to fit and analyze many many linear models in a single step).</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">broom</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyr</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">20170605</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The considered data structure is the following:</p>
<ul>
<li>A matrix of predictors with 100 rows and 50 columns is generated with independent standard normal entries.</li>
<li>The response variable is generated independently of the model matrix (also from the standard normal distribution), i.e., the true answer is that there is no relationship between predictors and response.</li>
</ul>
<p>Instead of Freedman’s 10 repetitions we perform 1000. So let’s generate all 1000 datasets at once as stacked in a large data frame:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">n_row</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="c1"># n_col is set to 51 because the 51st column will serve as y</span><span class="w">
</span><span class="n">n_col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">51</span><span class="w">
</span><span class="n">n_rep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="c1"># a stack of matrices for all n_rep repetitions is generated...</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">n_rep</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">n_row</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">n_col</span><span class="p">),</span><span class="w"> </span><span class="n">n_rep</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">n_row</span><span class="p">,</span><span class="w"> </span><span class="n">n_col</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="s2">"X"</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n_col</span><span class="p">)</span><span class="w">
</span><span class="c1"># ...and then transformed to a data frame with a repetition number column</span><span class="w">
</span><span class="n">X_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as_data_frame</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">repetition</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">n_rep</span><span class="p">,</span><span class="w"> </span><span class="n">each</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n_row</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p><strong>The data are analyzed in two successive linear models. The second (illegally) reusing the results of the first.</strong></p>
<p><em>The first model fit.</em>
After the 1000 ordinary linear models are fit to the data, we record for each of them the R squared, the F test statistic with corresponding p-value, and the t test statistics with p-values for the individual regression coefficients.</p>
<p>Using functions from the <code class="highlighter-rouge">broom</code> package we can fit and extract information from all 1000 models at once.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># all models can be fit at once...</span><span class="w">
</span><span class="n">models_df</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">group_by</span><span class="p">(</span><span class="n">repetition</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">do</span><span class="p">(</span><span class="n">full_model</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">X51</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">.</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">repetition</span><span class="p">)))</span><span class="w">
</span><span class="c1"># ...then the results are extracted</span><span class="w">
</span><span class="n">model_coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tidy</span><span class="p">(</span><span class="n">models_df</span><span class="p">,</span><span class="w"> </span><span class="n">full_model</span><span class="p">)</span><span class="w">
</span><span class="n">model_statistics</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glance</span><span class="p">(</span><span class="n">models_df</span><span class="p">,</span><span class="w"> </span><span class="n">full_model</span><span class="p">)</span><span class="w">
</span><span class="n">model_statistics</span><span class="o">$</span><span class="n">data_reuse</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">model_statistics</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p><em>The second model fit.</em>
For each one of the first 1000 models, the corresponding second linear model is fit using only those variables which have p-values significant at the 25% level in the first model.
That is, the second model uses the first model for variable selection.</p>
<p>This gives us 1000 reduced re-fitted linear models. We record the same model statistics (R squared, F, and t tests) as for the first group of models.</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">reduced_models</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="n">n_rep</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">full_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X_df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">repetition</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">i</span><span class="p">)</span><span class="w">
</span><span class="n">significant_coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">model_coefs</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">repetition</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">&</span><span class="w"> </span><span class="n">p.value</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">0.25</span><span class="p">)</span><span class="w">
</span><span class="n">reduced_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">select</span><span class="p">(</span><span class="n">full_data</span><span class="p">,</span><span class="w">
</span><span class="n">one_of</span><span class="p">(</span><span class="n">unlist</span><span class="p">(</span><span class="n">significant_coefs</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="s2">"term"</span><span class="p">])),</span><span class="w"> </span><span class="n">X51</span><span class="p">)</span><span class="w">
</span><span class="n">reduced_models</span><span class="p">[[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">X51</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">reduced_data</span><span class="p">)</span><span class="w">
</span><span class="n">tmp_df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">glance</span><span class="p">(</span><span class="n">reduced_models</span><span class="p">[[</span><span class="n">i</span><span class="p">]])</span><span class="w">
</span><span class="n">tmp_df</span><span class="o">$</span><span class="n">repetition</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w">
</span><span class="n">tmp_df</span><span class="o">$</span><span class="n">data_reuse</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="n">model_statistics</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bind_rows</span><span class="p">(</span><span class="n">model_statistics</span><span class="p">,</span><span class="w"> </span><span class="n">tmp_df</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Finally let’s look at the results. The figure shows the distributions of the considered model statistics across the 1000 repetitions for model fits with and without data reuse (the code producing this figure is given at the bottom of this post):</p>
<p><img src="/images/2017-6-5-Freedmans_paradox/unnamed-chunk-5-1.png" alt="plot of chunk unnamed-chunk-5" /></p>
<p>Well, the R squared statistic shows a moderate change between models with or without data reuse (average of 0.3093018 vs. 0.5001641). The F test statistic however grows immensely to an average of 3.2806118 (from 1.0480097), and the p-values fall after data reuse to an average of 0.0112216 (from 0.5017696), below the widely used (but arbitrary) 5% significance level.</p>
<p>Obviously the model with data reuse is highly misleading here, because in fact there are absolutely no relationships between the predictor variables and the response (as per the data generation procedure).</p>
<p>In fact, Freedman derived asymptotic estimates for the magnitudes of change in the considered model statistics, and they indeed match the above observations. However I’m too lazy to summarize them here. So I refer to <a href="http://www.public.asu.edu/~gasweete/crj604/readings/1983-Freedman%20(Screening%20Regression%20Equations).pdf">the primary source</a>.</p>
<hr />
<p>This code generates the above figure:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model_statistics</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">select</span><span class="p">(</span><span class="n">r.squared</span><span class="p">,</span><span class="w"> </span><span class="n">p.value</span><span class="p">,</span><span class="w"> </span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">repetition</span><span class="p">,</span><span class="w"> </span><span class="n">data_reuse</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">data_reuse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">data_reuse</span><span class="p">,</span><span class="w"> </span><span class="s2">"With Data Reuse"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Without Data Reuse"</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">data_reuse</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">data_reuse</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Without Data Reuse"</span><span class="p">,</span><span class="w"> </span><span class="s2">"With Data Reuse"</span><span class="p">),</span><span class="w">
</span><span class="n">ordered</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">rename</span><span class="p">(</span><span class="s2">"F-statistic"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="s2">"p-value"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">p.value</span><span class="p">,</span><span class="w"> </span><span class="s2">"R squared"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r.squared</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">gather</span><span class="p">(</span><span class="n">stat</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">repetition</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="n">data_reuse</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stat</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_violin</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stat</span><span class="p">),</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"width"</span><span class="p">,</span><span class="w"> </span><span class="n">draw_quantiles</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.25</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">0.75</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_hline</span><span class="p">(</span><span class="n">yintercept</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="n">linetype</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">data_reuse</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme_linedraw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0.05</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="n">paste</span><span class="p">(</span><span class="n">n_rep</span><span class="p">,</span><span class="w"> </span><span class="s2">"repetitions of an LM fit with"</span><span class="p">,</span><span class="w"> </span><span class="n">n_row</span><span class="p">,</span><span class="w"> </span><span class="s2">"rows,"</span><span class="p">,</span><span class="w"> </span><span class="n">n_col</span><span class="p">,</span><span class="w"> </span><span class="s2">"columns"</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>Alexej GossmannRecently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.5 ways to measure running time of R code2017-05-28T00:09:00-04:002017-05-28T00:09:00-04:00https://www.alexejgossmann.com/benchmarking_r<p>A reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.</p>
<p>A quick online search revealed at least three R packages for benchmarking R code (<code class="highlighter-rouge">rbenchmark</code>, <code class="highlighter-rouge">microbenchmark</code>, and <code class="highlighter-rouge">tictoc</code>). Additionally, base R provides at least two methods to measure the running time of R code (<code class="highlighter-rouge">Sys.time</code> and <code class="highlighter-rouge">system.time</code>). In the following I briefly go through the syntax of using each of the five option, and present my conclusions at the end.</p>
<h3 id="1-using-systime">1. Using “Sys.time”</h3>
<p>The run time of a chunk of code can be measured by taking the difference between the time at the start and at the end of the code chunk. Simple yet flexible :sunglasses:.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sleep_for_a_minute</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">()</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">Sys.sleep</span><span class="p">(</span><span class="m">60</span><span class="p">)</span><span class="w"> </span><span class="p">}</span><span class="w">
</span><span class="n">start_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">sleep_for_a_minute</span><span class="p">()</span><span class="w">
</span><span class="n">end_time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Sys.time</span><span class="p">()</span><span class="w">
</span><span class="n">end_time</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">start_time</span><span class="w">
</span><span class="c1"># Time difference of 1.000327 mins</span><span class="w">
</span></code></pre></div></div>
<h3 id="2-library-tictoc">2. Library “tictoc”</h3>
<p>The functions <code class="highlighter-rouge">tic</code> and <code class="highlighter-rouge">toc</code> are used in the same manner for benchmarking as the just demonstrated <code class="highlighter-rouge">Sys.time</code>. However <code class="highlighter-rouge">tictoc</code> adds a lot more convenience to the whole.</p>
<p>The most recent development version of <code class="highlighter-rouge">tictoc</code> can be installed from github:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"jabiru/tictoc"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>One can time a single code chunk:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tictoc</span><span class="p">)</span><span class="w">
</span><span class="n">tic</span><span class="p">(</span><span class="s2">"sleeping"</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"falling asleep..."</span><span class="p">)</span><span class="w">
</span><span class="n">sleep_for_a_minute</span><span class="p">()</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="s2">"...waking up"</span><span class="p">)</span><span class="w">
</span><span class="n">toc</span><span class="p">()</span><span class="w">
</span><span class="c1"># [1] "falling asleep..."</span><span class="w">
</span><span class="c1"># [1] "...waking up"</span><span class="w">
</span><span class="c1"># sleeping: 60.026 sec elapsed</span><span class="w">
</span></code></pre></div></div>
<p>Or nest multiple timers:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tic</span><span class="p">(</span><span class="s2">"total"</span><span class="p">)</span><span class="w">
</span><span class="n">tic</span><span class="p">(</span><span class="s2">"data generation"</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">50000</span><span class="o">*</span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="m">50000</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">50000</span><span class="p">)</span><span class="w">
</span><span class="n">toc</span><span class="p">()</span><span class="w">
</span><span class="n">tic</span><span class="p">(</span><span class="s2">"model fitting"</span><span class="p">)</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="n">toc</span><span class="p">()</span><span class="w">
</span><span class="n">toc</span><span class="p">()</span><span class="w">
</span><span class="c1"># data generation: 3.792 sec elapsed</span><span class="w">
</span><span class="c1"># model fitting: 39.278 sec elapsed</span><span class="w">
</span><span class="c1"># total: 43.071 sec elapsed</span><span class="w">
</span></code></pre></div></div>
<h3 id="3-using-systemtime">3. Using “system.time”</h3>
<p>One can time the evaluation of an R expression using <code class="highlighter-rouge">system.time</code>. For example, we can use it to measure the execution time of the function <code class="highlighter-rouge">sleep_for_a_minute</code> (defined above) as follows.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">system.time</span><span class="p">({</span><span class="w"> </span><span class="n">sleep_for_a_minute</span><span class="p">()</span><span class="w"> </span><span class="p">})</span><span class="w">
</span><span class="c1"># user system elapsed</span><span class="w">
</span><span class="c1"># 0.004 0.000 60.051</span><span class="w">
</span></code></pre></div></div>
<p><strong>But what exactly are the reported times <code class="highlighter-rouge">user</code>, <code class="highlighter-rouge">system</code>, and <code class="highlighter-rouge">elapsed</code>?</strong> :confused:</p>
<p>Well, clearly <code class="highlighter-rouge">elapsed</code> is the wall clock time taken to execute the function <code class="highlighter-rouge">sleep_for_a_minute</code>, plus some benchmarking code wrapping it (that’s why it took slightly more than a minute to run I guess).</p>
<p>As for <code class="highlighter-rouge">user</code> and <code class="highlighter-rouge">system</code> times, <a href="http://r.789695.n4.nabble.com/Meaning-of-proc-time-td2303263.html#a2306691">William Dunlap has posted a great explanation to the r-help mailing list</a>:</p>
<blockquote>
<p>“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.</p>
</blockquote>
<p>:grinning:</p>
<h3 id="4-library-rbenchmark">4. Library “rbenchmark”</h3>
<p>The documentation to the function <code class="highlighter-rouge">benchmark</code> from the <code class="highlighter-rouge">rbenchmark</code> R package describes it as “a simple wrapper around <code class="highlighter-rouge">system.time</code>”. However it adds a lot of convenience compared to bare <code class="highlighter-rouge">system.time</code> calls. For example it requires just one <code class="highlighter-rouge">benchmark</code> call to time multiple replications of multiple expressions. Additionally the returned results are conveniently organized in a data frame.</p>
<p>I installed the development[^1] version of the <code class="highlighter-rouge">rbenchmark</code> package from github:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"eddelbuettel/rbenchmark"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>For example purposes, let’s compare the time required to compute linear regression coefficients using three alternative computational procedures:</p>
<ol>
<li><code class="highlighter-rouge">lm</code>,</li>
<li>the Moore-Penrose pseudoinverse,</li>
<li>the Moore-Penrose pseudoinverse but without explicit matrix inverses.</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rbenchmark</span><span class="p">)</span><span class="w">
</span><span class="n">benchmark</span><span class="p">(</span><span class="s2">"lm"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="o">$</span><span class="n">coef</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"pseudoinverse"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"linear system"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="m">1000</span><span class="p">),</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="n">replications</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w">
</span><span class="n">columns</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"test"</span><span class="p">,</span><span class="w"> </span><span class="s2">"replications"</span><span class="p">,</span><span class="w"> </span><span class="s2">"elapsed"</span><span class="p">,</span><span class="w">
</span><span class="s2">"relative"</span><span class="p">,</span><span class="w"> </span><span class="s2">"user.self"</span><span class="p">,</span><span class="w"> </span><span class="s2">"sys.self"</span><span class="p">))</span><span class="w">
</span><span class="c1"># test replications elapsed relative user.self sys.self</span><span class="w">
</span><span class="c1"># 3 linear system 1000 0.167 1.000 0.208 0.240</span><span class="w">
</span><span class="c1"># 1 lm 1000 0.930 5.569 0.952 0.212</span><span class="w">
</span><span class="c1"># 2 pseudoinverse 1000 0.240 1.437 0.332 0.612</span><span class="w">
</span></code></pre></div></div>
<p>Here, the meaning of <code class="highlighter-rouge">elapsed</code>, <code class="highlighter-rouge">user.self</code>, and <code class="highlighter-rouge">sys.self</code> is the same as described above in the section about <code class="highlighter-rouge">system.time</code>, and <code class="highlighter-rouge">relative</code> is simply the time ratio with the fastest test. Interestingly <code class="highlighter-rouge">lm</code> is by far the slowest here.</p>
<h3 id="5-library-microbenchmark">5. Library “microbenchmark”</h3>
<p>The most recent development version of <code class="highlighter-rouge">microbenchmark</code> can be installed from github:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"olafmersmann/microbenchmarkCore"</span><span class="p">)</span><span class="w">
</span><span class="n">devtools</span><span class="o">::</span><span class="n">install_github</span><span class="p">(</span><span class="s2">"olafmersmann/microbenchmark"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Much like <code class="highlighter-rouge">benchmark</code> from the package <code class="highlighter-rouge">rbenchmark</code>, the function <code class="highlighter-rouge">microbenchmark</code> can be used to compare running times of multiple R code chunks. But it offers a great deal of convenience and additional functionality.</p>
<p>I find that one particularly nice feature of <code class="highlighter-rouge">microbenchmark</code> is the ability to automatically check the results of the benchmarked expressions with a user-specified function. This is demonstrated below, where we again compare three methods computing the coefficient vector of a linear model.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">microbenchmark</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">2017</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="o">*</span><span class="n">p</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">check_for_equal_coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">values</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">tol</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1e-12</span><span class="w">
</span><span class="n">max_error</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">max</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">values</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">values</span><span class="p">[[</span><span class="m">2</span><span class="p">]]),</span><span class="w">
</span><span class="nf">abs</span><span class="p">(</span><span class="n">values</span><span class="p">[[</span><span class="m">2</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">values</span><span class="p">[[</span><span class="m">3</span><span class="p">]]),</span><span class="w">
</span><span class="nf">abs</span><span class="p">(</span><span class="n">values</span><span class="p">[[</span><span class="m">1</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">values</span><span class="p">[[</span><span class="m">3</span><span class="p">]])))</span><span class="w">
</span><span class="n">max_error</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">tol</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">mbm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">microbenchmark</span><span class="p">(</span><span class="s2">"lm"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">0</span><span class="p">)</span><span class="o">$</span><span class="n">coef</span><span class="w"> </span><span class="p">},</span><span class="w">
</span><span class="s2">"pseudoinverse"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="s2">"linear system"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="n">check</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">check_for_equal_coefs</span><span class="p">)</span><span class="w">
</span><span class="n">mbm</span><span class="w">
</span><span class="c1"># Unit: milliseconds</span><span class="w">
</span><span class="c1"># expr min lq mean median uq max neval cld</span><span class="w">
</span><span class="c1"># lm 96.12717 124.43298 150.72674 135.12729 188.32154 236.4910 100 c</span><span class="w">
</span><span class="c1"># pseudoinverse 26.61816 28.81151 53.32246 30.69587 80.61303 145.0489 100 b</span><span class="w">
</span><span class="c1"># linear system 16.70331 18.58778 35.14599 19.48467 22.69537 138.6660 100 a</span><span class="w">
</span></code></pre></div></div>
<p>We used the function argument <code class="highlighter-rouge">check</code> to check for equality (up to a maximal error of <code class="highlighter-rouge">1e-12</code>) of the results returned by the three methods. If the results weren’t equal, <code class="highlighter-rouge">microbenchmark</code> would return an error message.</p>
<p>Another great feature is the integration with <code class="highlighter-rouge">ggplot2</code> for plotting <code class="highlighter-rouge">microbenchmark</code> results.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">autoplot</span><span class="p">(</span><span class="n">mbm</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><img src="/images/mbm.png" alt="Microbenchmark results plot" /></p>
<h3 id="conclusion">Conclusion</h3>
<p>The given demonstration of the different benchmarking functions is surely not exhaustive. Nevertheless I made some conclusions for my personal benchmarking needs:</p>
<ul>
<li>The <code class="highlighter-rouge">Sys.time</code> approach as well as the <code class="highlighter-rouge">tictoc</code> package can be used for timing (potentially nested) steps of a complicated algorithm (that’s often my use case). However, <code class="highlighter-rouge">tictoc</code> is more convenient, and (most importantly) foolproof.</li>
<li>We saw that <code class="highlighter-rouge">microbenchmark</code> returns other types of measurements than <code class="highlighter-rouge">benchmark</code>, and I think that in most situations the <code class="highlighter-rouge">microbenchmark</code> measurements are of a higher practical significance :stuck_out_tongue:.</li>
<li>To my knowledge <code class="highlighter-rouge">microbenchmark</code> is the only benchmarking package that has visualizations built in :+1:.</li>
</ul>
<p>For these reasons I will go with <code class="highlighter-rouge">microbenchmark</code> and <code class="highlighter-rouge">tictoc</code>. :bowtie:</p>Alexej GossmannA reviewer asked me to report detailed running times for all (so many :scream:) performed computations in one of my papers, and so I spent a Saturday morning figuring out my favorite way to benchmark R code. This is a quick summary of the options I found to be available.The Lean PhD Student - Can The Lean Startup principles be applied to personal productivity in graduate school?2017-05-09T21:40:00-04:002017-05-09T21:40:00-04:00https://www.alexejgossmann.com/the-lean-phd-student<p>The lean startup methodology consists of a set of principles that were proposed and popularized by Eric Ries in the book <em>The Lean Startup</em> (and elsewhere). He believes that startup success can be engineered by following the lean startup methodology. Eric Ries defines a startup as <em>“a human institution designed to deliver a new product or service under conditions of extreme uncertainty”</em>. If we replace “product or service” by “research result”, that sounds awfully similar to what a PhD student has to do. Indeed, the similarities between being a junior researcher, such as a PhD student, and running a startup have been often pointed out (for example: <a href="http://pgbovine.net/research-and-startup.htm">[1]</a>, <a href="http://www.bench2business.com/5-ways-start-up-life-is-similar-to-academia/">[2]</a>, <a href="https://pragmaticstartup.wordpress.com/2011/02/21/similarities-between-an-entrepreneur-and-an-academic/">[3]</a>). In light of this, I propose that <em>the lean startup methodology</em> can also be applied to academic pursuits of a PhD student. Below, I adapt some of the most important lean startup concepts for application to a junior researcher’s personal productivity and academic success.<sup id="fnref:nonexperimental"><a href="#fn:nonexperimental" class="footnote">1</a></sup></p>
<p>If we want to carry over startup concepts to academic research, then the first (and most obvious) question is, what would be the “product” and who would be the “customer” of the PhD student? I think the analogy here is quite straight forward. The “products” of a PhD student clearly are the student’s peer-reviewed publications, conference presentations, the dissertation, software releases, etc.; and the “customers” are other researchers and to a much smaller extent the general public. An especially important set of (quite often tough) “customers” includes the journal or conference paper reviewers and editors, and the student’s committee members.</p>
<h3 id="build---measure---learn">Build - Measure - Learn</h3>
<p>At the center of the lean startup methodology is the so-called <em>build-measure-learn</em> feedback loop. One of the main goals of the lean startup methodology is to minimize the time (and other resources) required to pass through the build-measure-learn feedback loop, and to maximize the number of times that the build-measure-learn loop is completed. Its adaptation to academic research would be something like the following.</p>
<h4 id="1-build">1. Build</h4>
<p>:hammer: Start with a novel idea, whose good execution you assume to be valuable to your scientific audience, and then share a <em>minimally viable</em> execution of the idea with members of your audience.</p>
<p>The concept of a <em>minimum viable product</em> (or MVP) is especially important during this stage of the lean startup trajectory, in order to minimize time spent in this stage. <em>The Lean Startup</em> defines the MVP as the “version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort”. Analogously, I think that a <em>minimum viable research result</em> can for example consist of an exploration of the main idea on small samples, toy problems, and special cases, designed in such a way that would allow the researcher to obtain sufficient amount of feedback on his/her idea with the least effort.</p>
<h4 id="2-measure">2. Measure</h4>
<p>:triangular_ruler: Observe how other researchers react to your idea and its minimally viable execution.</p>
<p>In this step it is important to use so-called <em>actionable metrics</em>, as opposed to <em>vanity metrics</em>. Actionable metrics accurately reflect the key success factors of the project, while vanity metrics are measurements that give “the rosiest picture possible”. With regard to academic research actionable metrics may be (not an exhaustive list):</p>
<ul>
<li>Direct feedback from researchers who you trust.</li>
<li>Others applying your work to their own research.</li>
<li>Major contributions to high-quality peer-reviewed publications.</li>
</ul>
<p>And (academic) vanity metrics may include:</p>
<ul>
<li>Co-authorships on other people’s papers, while having only slightly more contribution than none.</li>
<li>Association or acquaintance with a “big name” scientist.</li>
<li>Number of views of a researcher’s homepage, or paper view count on some online platform.</li>
<li>Appearance in mainstream media.</li>
</ul>
<p>Measuring the right metrics is a big part of what Eric Ries calls <em>innovation accounting</em>.</p>
<h4 id="3-learn">3. Learn</h4>
<p>:bulb: Learn how valuable your audience actually considers your idea to be based on the received feedback and your actionable metrics of choice. Utilize that new knowledge to improve the initial idea in order to make it more valuable to a targeted scientific audience, and adjust your assumptions about what your audience needs.</p>
<p>In <em>The Lean Startup</em> this type of a modification process on your initial idea is called a <em>pivot</em>, or in Eric Ries’ words: “A pivot is a structured course correction designed to test a new fundamental hypothesis about the product, strategy, and engine of growth.” Having obtained the corrected research idea, you would go back to step 1 and reiterate the whole process.</p>
<h3 id="conclusion">Conclusion</h3>
<p>So, what do we get out of all of this? I think that a clear strategy emerges here.</p>
<p>The strategy consists in striving to push results out fast, in order to receive feedback fast, which (feedback) is evaluated according to a suitable set of actionable metrics that were chosen in advance. That is, one needs to be writing papers fast, initially without worrying about things outside the scope of an MVP, such as the perfect word choice, the optimal formatting, coverage of all corner cases, etc., in order to get and measure the feedback from members of the target scientific audience quickly. Then the ideas need to be improved upon according to what was learned, and the process is reiterated. One of the major goals of the PhD student should be to minimize the time required to pass through this loop.</p>
<p>So is this a good strategy for a PhD student? Well, I can’t say before I try it out :stuck_out_tongue:. One crucial factor not mentioned here though is the PhD advisor. In my case I have a lot of freedom to come up with my own projects and pursue my own ideas as long as they are within a specific (but somewhat loosely-defined) area, and I could totally incorporate this lean startup inspired research strategy into my work. On the other extreme, there are professors who micromanage their PhD student’s every step, in which case the PhD student will find it much harder to experiment with their research strategy.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:nonexperimental">
<p>Please note that I’m writing from the point of view of mathematical, statistical, and computational sciences, rather than from the viewpoint of experimental sciences. <a href="#fnref:nonexperimental" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Alexej GossmannThe lean startup methodology consists of a set of principles that were proposed and popularized by Eric Ries in the book The Lean Startup (and elsewhere). He believes that startup success can be engineered by following the lean startup methodology. Eric Ries defines a startup as “a human institution designed to deliver a new product or service under conditions of extreme uncertainty”. If we replace “product or service” by “research result”, that sounds awfully similar to what a PhD student has to do. Indeed, the similarities between being a junior researcher, such as a PhD student, and running a startup have been often pointed out (for example: [1], [2], [3]). In light of this, I propose that the lean startup methodology can also be applied to academic pursuits of a PhD student. Below, I adapt some of the most important lean startup concepts for application to a junior researcher’s personal productivity and academic success.1 Please note that I’m writing from the point of view of mathematical, statistical, and computational sciences, rather than from the viewpoint of experimental sciences. ↩Salaries by alma mater - an interactive visualization with R and plotly2017-04-28T00:08:00-04:002017-04-28T00:08:00-04:00https://www.alexejgossmann.com/salaries_by_school_plotly_viz<p><img src="/images/salaries_by_college/starting_salary.png" alt="Visualization of starting salaries by college" /></p>
<p>Based on <a href="http://online.wsj.com/public/resources/documents/info-Salaries_for_Colleges_by_Region-sort.html">an interesting dataset from the Wall Street Journal</a> I made the above visualization of the median starting salary for US college graduates from different undergraduate institutions (I have also looked at the mid-career salaries, and the salary increase, but more on that later). However, I thought that it would be a lot more informative, if it were <em>interactive</em>. To the very least I wanted to be able to see the school names when hovering over or clicking on the points with the mouse.</p>
<p>Luckily, this kind of interactivity can be easily achieved in R with the library <a href="https://cran.r-project.org/package=plotly"><code class="highlighter-rouge">plotly</code></a>, especially due to its excellent integration with <a href="https://cran.r-project.org/package=ggplot2"><code class="highlighter-rouge">ggplot2</code></a>, which I used to produce the above figure. In the following I describe how exactly this can be done.</p>
<p>Before I show you the interactive visualizations, a few words on the data preprocessing, and on how the map and the points are plotted with <code class="highlighter-rouge">ggplot2</code>:</p>
<ul>
<li>I generally use functions from the <a href="http://tidyverse.org/">tidyverse</a> R packages.</li>
<li>I save the data in the data frame <code class="highlighter-rouge">salaries</code>, and transform the given amounts to proper floating point numbers, stripping the dollar signs and extra whitespaces.</li>
<li>The data provide school names. However, I need to find out the exact geographical coordinates of each school to put it on the map. This can be done in a very convenient way, by using the <code class="highlighter-rouge">geocode</code> function from the <a href="https://cran.r-project.org/package=ggmap"><code class="highlighter-rouge">ggmap</code></a> R package:
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">school_longlat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">geocode</span><span class="p">(</span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">)</span><span class="w">
</span><span class="n">school_longlat</span><span class="o">$</span><span class="n">school</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="w">
</span><span class="n">salaries</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">left_join</span><span class="p">(</span><span class="n">salaries</span><span class="p">,</span><span class="w"> </span><span class="n">school_longlat</span><span class="p">)</span><span class="w">
</span></code></pre></div> </div>
</li>
<li>For the visualization I want to disregard the colleges in Alaska and Hawaii to avoid shrinking the rest of the map. The respective rows of <code class="highlighter-rouge">salaries</code> can be easily determined with a <code class="highlighter-rouge">grep</code> search:
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grep</span><span class="p">(</span><span class="s2">"alaska"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 206</span><span class="w">
</span><span class="n">grep</span><span class="p">(</span><span class="s2">"hawaii"</span><span class="p">,</span><span class="w"> </span><span class="n">salaries</span><span class="o">$</span><span class="n">school</span><span class="p">,</span><span class="w"> </span><span class="n">ignore.case</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 226</span><span class="w">
</span></code></pre></div> </div>
</li>
<li>A data frame containing geographical data that can be used to plot the outline of all US states can be loaded using the function <code class="highlighter-rouge">map_data</code> from the <code class="highlighter-rouge">ggplot2</code> package:
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">states</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">map_data</span><span class="p">(</span><span class="s2">"state"</span><span class="p">)</span><span class="w">
</span></code></pre></div> </div>
</li>
<li>And I load a yellow-orange-red palette with the function <code class="highlighter-rouge">brewer.pal</code> from the <a href="http://colorbrewer2.org"><code class="highlighter-rouge">RColorBrewer</code> library</a>, to use as a scale for the salary amounts:
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">yor_col</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="s2">"YlOrRd"</span><span class="p">)</span><span class="w">
</span></code></pre></div> </div>
</li>
<li>Finally the (yet non-interactive) visualization is created with <code class="highlighter-rouge">ggplot2</code>:
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">salaries</span><span class="p">[</span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">206</span><span class="p">,</span><span class="w"> </span><span class="m">226</span><span class="p">),</span><span class="w"> </span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_polygon</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">long</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">group</span><span class="p">),</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">states</span><span class="p">,</span><span class="w"> </span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"black"</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"white"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lon</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lat</span><span class="p">,</span><span class="w">
</span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">starting</span><span class="p">,</span><span class="w"> </span><span class="n">text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">school</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">coord_fixed</span><span class="p">(</span><span class="m">1.3</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_color_gradientn</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting\nSalary"</span><span class="p">,</span><span class="w">
</span><span class="n">colors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">yor_col</span><span class="p">),</span><span class="w">
</span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">comma</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">guides</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme_bw</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.line</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.ticks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.border</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">panel.grid</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">(),</span><span class="w">
</span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_blank</span><span class="p">())</span><span class="w">
</span></code></pre></div> </div>
</li>
</ul>
<p>Now, entering <code class="highlighter-rouge">p</code> into the R console will generate the figure shown at the top of this post.</p>
<p>However, we want to…</p>
<h2 id="make-it-interactive">…make it interactive</h2>
<p>The function <code class="highlighter-rouge">ggplotly</code> immediately generates a <a href="https://plot.ly/">plotly</a> interactive visualization from a <code class="highlighter-rouge">ggplot</code> object. It’s that simple! :smiley: (Though I must admit that, more often than I would be okay with, some elements of the ggplot visualization disappear or don’t look as expected. :fearful:)</p>
<p>The function argument <code class="highlighter-rouge">tooltip</code> can be used to specify which aesthetic mappings from the <code class="highlighter-rouge">ggplot</code> call should be shown in the tooltip. So, the code</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ggplotly</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">tooltip</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"text"</span><span class="p">,</span><span class="w"> </span><span class="s2">"starting"</span><span class="p">),</span><span class="w">
</span><span class="n">width</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">800</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>generates <a href="https://plot.ly/~agisga/13">the following interactive visualization</a>.</p>
<iframe width="800" height="500" frameborder="0" scrolling="no" src="//plot.ly/~agisga/13.embed"></iframe>
<p>Now, if you want to publish a plotly visualization to <a href="https://plot.ly/">https://plot.ly/</a>, you first need to communicate your account info to the plotly R package:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_username"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"??????"</span><span class="p">)</span><span class="w">
</span><span class="n">Sys.setenv</span><span class="p">(</span><span class="s2">"plotly_api_key"</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"????????????"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>and after that, posting the visualization to your account at <a href="https://plot.ly/">https://plot.ly/</a> is as simple as:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plotly_POST</span><span class="p">(</span><span class="n">filename</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Starting"</span><span class="p">,</span><span class="w"> </span><span class="n">sharing</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"public"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h2 id="more-visualizations">More visualizations</h2>
<p>Finally, based on <a href="http://online.wsj.com/public/resources/documents/info-Salaries_for_Colleges_by_Region-sort.html">the same dataset</a> I have generated an interactive visualization of the median mid-career salaries by undergraduate alma mater (the R script is almost identical to the one described above).
<a href="https://plot.ly/~agisga/15">The resulting interactive visualization</a> is embedded below.</p>
<iframe width="800" height="500" frameborder="0" scrolling="no" src="//plot.ly/~agisga/15.embed"></iframe>
<p>Additionally, it is quite informative to look at a <a href="https://plot.ly/~agisga/11">visualization of the salary increase from starting to mid-career</a>.</p>
<iframe width="800" height="500" frameborder="0" scrolling="no" src="//plot.ly/~agisga/11.embed"></iframe>Alexej GossmannUnderstanding the Tucker decomposition, and compressing tensor-valued data (with R code)2017-04-05T01:50:00-04:002017-04-05T01:50:00-04:00https://www.alexejgossmann.com/tensor_decomposition_tucker<p>In many applications, data naturally form an <em>n</em>-way tensor with <em>n > 2</em>, rather than a “tidy” table.
As mentioned in the beginning of <a href="/tensor_decomposition_CP/">my last blog post</a>, a tensor is essentially a multi-dimensional array:</p>
<ul>
<li>a tensor of order one is a vector, which simply is a column of numbers,</li>
<li>a tensor of order two is a matrix, which is basically numbers arranged in a rectangle,</li>
<li>a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension),</li>
<li>an <em>n</em>th order (or <em>n</em>-way) tensor looks like numbers arranged in an <em>n</em>-hyperrectangle… you get the idea…</li>
</ul>
<p>In this post I introduce the Tucker decomposition (<a href="https://link.springer.com/article/10.1007%2FBF02289464">Tucker (1966) “Some mathematical notes on three-mode factor analysis”</a>). The Tucker decomposition family includes methods such as</p>
<ol>
<li>the <em>higher-order SVD</em>, or HOSVD, which is a generalization of the matrix SVD to tensors (<a href="http://www.sandia.gov/~tgkolda/tdw2004/ldl-94-31.pdf">De Lathauwer, De Moor, and Vanderwalle (2000) “A multilinear singular value decomposition”</a>),</li>
<li>the <em>higher order orthogonal iteration</em>, or HOOI, which delivers the best approximation to a given tensor by another tensor with prescribed mode-1 rank, mode-2 rank, etc. (<a href="http://epubs.siam.org/doi/abs/10.1137/S0895479898346995?journalCode=sjmael">De Lathauwer, De Moor, and Vanderwalle (2000) “On the Best Rank-1 and Rank-(R1,R2,…,RN) Approximation of Higher-Order Tensors”</a>).</li>
</ol>
<p>I introduce both approaches, and in order to demonstrate the usefulness of these concepts, I present a simple data compression example using <a href="http://data.worldbank.org/data-catalog/world-development-indicators">The World Bank’s World Development Indicators dataset</a> (though I use the <a href="https://www.kaggle.com/worldbank/world-development-indicators">version available on Kaggle</a>).</p>
<p>However, before we can get started with the decompositions, we need to look at and understand the <em>k</em>-mode tensor product.</p>
<p>Throughout this post, I will also introduce the R functions from the package <a href="https://cran.r-project.org/package=rTensor"><code class="highlighter-rouge">rTensor</code></a>, which can be used to perform all of the presented computations.</p>
<h2 id="tensor-times-matrix-the-k-mode-product">Tensor times matrix: the <em>k</em>-mode product</h2>
<p>The <script type="math/tex">k</script>-mode product of a tensor <script type="math/tex">X \in \mathbb{R}^{I\subscript{1} \times I\subscript{2} \times \ldots \times I\subscript{N}}</script> with a matrix <script type="math/tex">A \in \mathbb{R}^{J \times I\subscript{k}}</script> is written as</p>
<script type="math/tex; mode=display">Y = X \times\subscript{k} A.</script>
<p>The resulting tensor <script type="math/tex">Y</script> is of size <script type="math/tex">I\subscript{1} \times \ldots \times I\subscript{k-1} \times J \times I\subscript{k+1} \times \ldots \times I\subscript{N}</script>, and contains the elements</p>
<script type="math/tex; mode=display">y\subscript{i\subscript{1} \cdots i\subscript{k-1} j i\subscript{k+1} \cdots i\subscript{N}} = \sum\subscript{i\subscript{k} = 1}^{I\subscript{k}} x\subscript{i\subscript{1} i\subscript{2} \cdots i\subscript{N}} a\subscript{ji\subscript{k}}.</script>
<p>It can be hard, at first, to understand what that definition really means, or to visualize it in your mind. I find that it becomes easier once you realize that the <em>k</em>-mode product amounts to multiplying each mode-<em>k</em> fiber of <script type="math/tex">X</script> by the matrix <script type="math/tex">A</script>.</p>
<p>We can demonstrate that in R:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rTensor</span><span class="p">)</span><span class="w">
</span><span class="n">tnsr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.tensor</span><span class="p">(</span><span class="n">array</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">dim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)))</span><span class="w">
</span><span class="n">mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># 1-mode product performed via the function ttm in rTensor</span><span class="w">
</span><span class="n">tnsr_times_mat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ttm</span><span class="p">(</span><span class="n">tnsr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tnsr</span><span class="p">,</span><span class="w"> </span><span class="n">mat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mat</span><span class="p">,</span><span class="w"> </span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Now, for example, the first slice of <code class="highlighter-rouge">tnsr_times_mat</code> is the same as the matrix product of <code class="highlighter-rouge">mat</code> with the first slice of <code class="highlighter-rouge">tnsr</code>:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tnsr_times_mat</span><span class="o">@</span><span class="n">data</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="c1"># [,1] [,2]</span><span class="w">
</span><span class="c1"># [1,] 9 19</span><span class="w">
</span><span class="c1"># [2,] 12 26</span><span class="w">
</span><span class="c1"># [3,] 15 33</span><span class="w">
</span><span class="n">mat</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">tnsr</span><span class="o">@</span><span class="n">data</span><span class="p">[</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="c1"># [,1] [,2]</span><span class="w">
</span><span class="c1"># [1,] 9 19</span><span class="w">
</span><span class="c1"># [2,] 12 26</span><span class="w">
</span><span class="c1"># [3,] 15 33</span><span class="w">
</span></code></pre></div></div>
<p>You might want to play around some more with the function <code class="highlighter-rouge">ttm</code> in R to get a better understanding of the <em>k</em>-mode product.</p>
<p>A few important facts about the <em>k</em>-mode product:</p>
<ul>
<li><script type="math/tex">X \times\subscript{m} A \times\subscript{n} B = X \times\subscript{n} B \times\subscript{m} A</script> if <script type="math/tex">n \neq m</script>,</li>
<li>but <script type="math/tex">X \times\subscript{n} A \times\subscript{n} B = X \times\subscript{n} (BA)</script> (in general <script type="math/tex">\neq X \times\subscript{n} B \times\subscript{n} A</script>).</li>
</ul>
<h2 id="tucker-decomposition">Tucker decomposition</h2>
<p>The Tucker decomposition (<a href="https://link.springer.com/article/10.1007%2FBF02289464">Tucker (1966)</a>) decomposes a tensor into a core tensor multiplied by a matrix along each mode (i.e., transformed via a <script type="math/tex">k</script>-mode product for every <script type="math/tex">k = 1, 2, \ldots, N</script>):</p>
<script type="math/tex; mode=display">X = G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)}.</script>
<p>Note that <script type="math/tex">G</script> might be much smaller than the original tensor <script type="math/tex">X</script> if we accept an approximation instead of an exact equality.</p>
<p>In case of three-way tensors, we can hold on to the following mental image:</p>
<p><img src="/images/Tucker_decomposition/3-modal_Tucker_decomposition.jpg" alt="Figure showing a Tucker decomposition of a 3-way tensor" /></p>
<p>It is interesting to note that the CP decomposition, that I introduced in <a href="/tensor_decomposition_CP/">a previous blog post</a>, is a special case of the Tucker decomposition, where the core tensor <script type="math/tex">G</script> is constrained to be superdiagonal.</p>
<h3 id="higher-order-svd-hosvd">Higher-order SVD (HOSVD)</h3>
<p>So, how do you compute the Tucker decomposition?</p>
<p>Many algorithms rely on the following fundamental equivalence:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
Y &=& X \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)} \nonumber\\
\Leftrightarrow\quad Y\subscript{(k)} &=& A^{(k)} X\subscript{(k)} \left(A^{(N)} \otimes \cdots \otimes A^{(k+1)} \otimes A^{(k-1)} \otimes \cdots A^{(1)} \right)^T. \nonumber
\end{eqnarray} %]]></script>
<p>The above equation uses some notation that was not introduced yet:</p>
<ul>
<li><script type="math/tex">\otimes</script> denotes the <a href="https://en.wikipedia.org/wiki/Kronecker_product">Kronecker product</a>.</li>
<li>
<p><script type="math/tex">X\subscript{(k)}</script> is the mode-<script type="math/tex">k</script> unfolding (or mode-<script type="math/tex">k</script> matricization) of the tensor <script type="math/tex">X</script>. The mode-<script type="math/tex">k</script> unfolding arranges the mode-<script type="math/tex">k</script> fibers (a <em>fiber</em> is a generalization of <em>column</em> to tensors) of <script type="math/tex">X</script> as columns into a matrix. The concept may be easiest to understand by looking at an example. The following R code shows a 3-way tensor and all three of its mode-<script type="math/tex">k</script> unfoldings (using the <code class="highlighter-rouge">k_unfold</code> function from the <code class="highlighter-rouge">rTensor</code> package):</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tnsr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.tensor</span><span class="p">(</span><span class="n">array</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="m">12</span><span class="p">,</span><span class="w"> </span><span class="n">dim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)))</span><span class="w">
</span><span class="n">tnsr</span><span class="o">@</span><span class="n">data</span><span class="w">
</span><span class="c1"># , , 1</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># [,1] [,2] [,3]</span><span class="w">
</span><span class="c1"># [1,] 1 3 5</span><span class="w">
</span><span class="c1"># [2,] 2 4 6</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># , , 2</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># [,1] [,2] [,3]</span><span class="w">
</span><span class="c1"># [1,] 7 9 11</span><span class="w">
</span><span class="c1"># [2,] 8 10 12</span><span class="w">
</span><span class="c1"># mode-1 unfolding:</span><span class="w">
</span><span class="n">k_unfold</span><span class="p">(</span><span class="n">tnsr</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">@</span><span class="n">data</span><span class="w">
</span><span class="c1"># [,1] [,2] [,3] [,4] [,5] [,6]</span><span class="w">
</span><span class="c1"># [1,] 1 3 5 7 9 11</span><span class="w">
</span><span class="c1"># [2,] 2 4 6 8 10 12</span><span class="w">
</span><span class="c1"># mode-2 unfolding:</span><span class="w">
</span><span class="n">k_unfold</span><span class="p">(</span><span class="n">tnsr</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="o">@</span><span class="n">data</span><span class="w">
</span><span class="c1"># [,1] [,2] [,3] [,4]</span><span class="w">
</span><span class="c1"># [1,] 1 2 7 8</span><span class="w">
</span><span class="c1"># [2,] 3 4 9 10</span><span class="w">
</span><span class="c1"># [3,] 5 6 11 12</span><span class="w">
</span><span class="c1"># mode-3 unfolding:</span><span class="w">
</span><span class="n">k_unfold</span><span class="p">(</span><span class="n">tnsr</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="o">@</span><span class="n">data</span><span class="w">
</span><span class="c1"># [,1] [,2] [,3] [,4] [,5] [,6]</span><span class="w">
</span><span class="c1"># [1,] 1 2 3 4 5 6</span><span class="w">
</span><span class="c1"># [2,] 7 8 9 10 11 12</span><span class="w">
</span></code></pre></div> </div>
</li>
</ul>
<p>A straightforward approach to solve the Tucker decomposition would be to solve each mode-<script type="math/tex">k</script> matricized form of the Tucker decomposition (shown in the equivalence above) for <script type="math/tex">A^{(k)}</script>. This approach is known as <em>higher order SVD</em>, or HOSVD. It can be regarded as a generalization of the matrix SVD, because the matrices <script type="math/tex">A^{(k)}</script> are orthogonal, while the tensor <script type="math/tex">G</script> is “ordered” and “all-orthogonal” (see <a href="http://www.sandia.gov/~tgkolda/tdw2004/ldl-94-31.pdf">De Lathauwer et. al. (2000)</a> for detail). The resulting algorithm is shown below.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
&\,&\mathrm{for}\, k = 1, 2, ..., N \,\mathrm{do} \nonumber\\
&\,&\quad A^{(k)} \leftarrow \,\mathrm{left\,orthogonal\,matrix\,of\,SVD\,of}\,X\subscript{(k)} \nonumber\\
&\,&\mathrm{end\,for} \nonumber\\
&\,&G \leftarrow X \times\subscript{1} (A^{(1)})^T \times\subscript{2} (A^{(2)})^T \times\subscript{3} \cdots \times\subscript{N} (A^{(N)})^T \nonumber
\end{eqnarray} %]]></script>
<p>In R we can perform HOSVD using the function <code class="highlighter-rouge">hosvd</code> from <code class="highlighter-rouge">rTensor</code>:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tnsr</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rand_tensor</span><span class="p">(</span><span class="n">modes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">40</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">))</span><span class="w">
</span><span class="n">hosv_decomp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hosvd</span><span class="p">(</span><span class="n">tnsr</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Now <code class="highlighter-rouge">hosv_decomp$Z</code> is our matrix <script type="math/tex">G</script>, and <code class="highlighter-rouge">hosv_decomp$U</code> is a list containing all the matrices <script type="math/tex">A^{(k)}</script>. We can use the function <code class="highlighter-rouge">ttl</code>, which performs multiple <em>k</em>-mode products on multiple modes successively given a tensor and a list of matrices, to check that up to numerical error the equation
<script type="math/tex">X = G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)}</script>
is satisfied:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">HOSVD_prod</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ttl</span><span class="p">(</span><span class="n">hosv_decomp</span><span class="o">$</span><span class="n">Z</span><span class="p">,</span><span class="w"> </span><span class="n">hosv_decomp</span><span class="o">$</span><span class="n">U</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">error</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tnsr</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">HOSVD_prod</span><span class="w">
</span><span class="n">table</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">error</span><span class="o">@</span><span class="n">data</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">1e-12</span><span class="p">)</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># TRUE</span><span class="w">
</span><span class="c1"># 60000</span><span class="w">
</span></code></pre></div></div>
<h3 id="higher-order-orthogonal-iteration-hooi">Higher order orthogonal iteration (HOOI)</h3>
<p>Note that we can also use HOSVD to compress <script type="math/tex">X</script> by truncating the matrices <script type="math/tex">A^{(k)}</script>. The truncated HOSVD, however, is known to not give the best fit, as measured by the norm of the difference</p>
<script type="math/tex; mode=display">\lVert X - G \times\subscript{1} A^{(1)} \times\subscript{2} A^{(2)} \times\subscript{3} \ldots \times\subscript{N} A^{(N)} \rVert.</script>
<p>The <em>higher order orthogonal iteration</em>, or HOOI, algorithm finds the optimal approximation <script type="math/tex">\widehat{X}</script> (with respect to the Frobenius norm loss) by, essentially, iterating the alternating truncation and SVD until convergence. If we truncate <script type="math/tex">A^{(k)}</script> to have <script type="math/tex">r\subscript{k}</script> columns, then the HOOI solution can be obtained by the following algorithm.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
&\,&\mathrm{initialize\,via\,HOSVD} \nonumber\\
&\,&\mathrm{while\,not\,converged\,do} \nonumber\\
&\,&\quad \mathrm{for}\, k = 1, 2, ..., N \nonumber\\
&\,&\quad\quad Y \leftarrow X \times\subscript{1} (A^{(1)})^T \times\subscript{2} \cdots \times\subscript{k-1} (A^{(k-1)})^T \times\subscript{k+1} (A^{(k+1)})^T \times\subscript{k+2} \cdots \times\subscript{N} (A^{(N)})^T \nonumber\\
&\,&\quad\quad A^{(k)} \leftarrow r\subscript{k}\,\mathrm{leading\,left\,singular\,vectors\,of}\,Y\subscript{(k)} \nonumber\\
&\,&\quad \mathrm{end\,for} \nonumber\\
&\,&\mathrm{end\,while} \nonumber\\
&\,&G \leftarrow X \times\subscript{1} (A^{(1)})^T \times\subscript{2} (A^{(2)})^T \times\subscript{3} \cdots \times\subscript{N} (A^{(N)})^T \nonumber\\
\end{eqnarray} %]]></script>
<h3 id="application-of-hooi-to-data-compression">Application of HOOI to data compression</h3>
<p>The example considered below is somewhat silly, given that the tensor I’m compressing isn’t very big, and thus there isn’t much of a point in compressing it. However, I think that the example still shows off very well how the algorithm can be very useful when the data size is much bigger (or the available storage much smaller).</p>
<p>I have downloaded from Kaggle the World Development Indicators <a href="https://www.kaggle.com/worldbank/world-development-indicators">dataset</a>, originally collected and published by The World Bank (the original dataset is available <a href="http://data.worldbank.org/data-catalog/world-development-indicators">here</a>).</p>
<p>The data can be arranged into a three-way tensor with the three modes corresponding to <em>country</em> (<a href="https://www.kaggle.com/benhamner/d/worldbank/world-development-indicators/countries-in-the-wdi-data">list of available countries</a>), <em>indicator</em> (<a href="https://www.kaggle.com/benhamner/d/worldbank/world-development-indicators/indicators-in-data">list of available indicators</a>), and <em>year</em> (1960-2014). Since I didn’t have any time to deal with NA values in any creative way, I have kept only three indicators in the dataset. And I have replaced the remaining NAs with a country-wise average value for each particular indicator. Also, I have forgotten to normalize the data :disappointed:. The preprocessing resulted in a tensor of size 247-countries-by-3-indicators-by-55-years, that looks sort of like this:</p>
<p><img src="/images/Tucker_decomposition/3-way_tensor_small.png" alt="Figure: 3-way tensor obtained from the WDI data" /></p>
<p>In particular, large stretches of the data within a given country tend to be nearly constant, or nearly piece-wise constant.</p>
<p>We use the function <code class="highlighter-rouge">tucker</code> from <code class="highlighter-rouge">rTensor</code> to obtain a Tucker decomposition via HOOI, where we set the ranks to the value 3 at each mode.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">dim</span><span class="p">(</span><span class="n">wdi_tnsr</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 247 3 55</span><span class="w">
</span><span class="n">tucker_decomp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tucker</span><span class="p">(</span><span class="n">wdi_tnsr</span><span class="p">,</span><span class="w"> </span><span class="n">ranks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">str</span><span class="p">(</span><span class="n">tucker_decomp</span><span class="p">)</span><span class="w">
</span><span class="c1"># List of 7</span><span class="w">
</span><span class="c1"># $ Z :Formal class 'Tensor' [package "rTensor"] with 3 slots</span><span class="w">
</span><span class="c1"># .. ..@ num_modes: int 3</span><span class="w">
</span><span class="c1"># .. ..@ modes : int [1:3] 3 3 3</span><span class="w">
</span><span class="c1"># .. ..@ data : num [1:3, 1:3, 1:3] -6.60e+10 -1.13e+05 6.24e+05 -7.76e+05 -1.93e+08 ...</span><span class="w">
</span><span class="c1"># $ U :List of 3</span><span class="w">
</span><span class="c1"># ..$ : num [1:247, 1:3] -0.02577 -0.00065 -0.01146 -0.19637 -0.17317 ...</span><span class="w">
</span><span class="c1"># ..$ : num [1:3, 1:3] -1.00 -6.97e-10 -2.08e-02 2.08e-02 -4.70e-08 ...</span><span class="w">
</span><span class="c1"># ..$ : num [1:55, 1:3] -0.0762 -0.0772 -0.0785 -0.0802 -0.082 ...</span><span class="w">
</span><span class="c1"># $ conv : logi TRUE</span><span class="w">
</span><span class="c1"># $ est :Formal class 'Tensor' [package "rTensor"] with 3 slots</span><span class="w">
</span><span class="c1"># .. ..@ num_modes: int 3</span><span class="w">
</span><span class="c1"># .. ..@ modes : int [1:3] 247 3 55</span><span class="w">
</span><span class="c1"># .. ..@ data : num [1:247, 1:3, 1:55] 9.83e+07 4.44e+06 8.81e+07 1.05e+09 8.97e+08 ...</span><span class="w">
</span><span class="c1"># $ norm_percent: num 99.4</span><span class="w">
</span><span class="c1"># $ fnorm_resid : num 3.9e+08</span><span class="w">
</span><span class="c1"># $ all_resids : num [1:2] 3.9e+08 3.9e+08</span><span class="w">
</span><span class="c1"># NULL</span><span class="w">
</span></code></pre></div></div>
<p>To see how well the tensor decomposition approximates the original tensor, we can look at the relative error</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wdi_tnsr_approx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ttl</span><span class="p">(</span><span class="n">tucker_decomp</span><span class="o">$</span><span class="n">Z</span><span class="p">,</span><span class="w"> </span><span class="n">tucker_decomp</span><span class="o">$</span><span class="n">U</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">fnorm</span><span class="p">(</span><span class="n">wdi_tnsr</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">wdi_tnsr_approx</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">fnorm</span><span class="p">(</span><span class="n">wdi_tnsr</span><span class="p">)</span><span class="w">
</span><span class="c1"># [1] 0.005908934</span><span class="w">
</span></code></pre></div></div>
<p>and at the percentage of the norm of the original tensor explained by the Tucker decomposition</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tucker_decomp</span><span class="o">$</span><span class="n">norm_percent</span><span class="w">
</span><span class="c1"># [1] 99.40911</span><span class="w">
</span></code></pre></div></div>
<p>We, observe that we indeed achieve a recovery with an accuracy of over 99%. For comparison, the original tensor contains <code class="highlighter-rouge">247 * 3 * 55 = 40755</code> entries, while the computed Tucker decomposition consists of only <code class="highlighter-rouge">127 * 3 + 3 * 3 + 55 * 3 + 3 * 3 * 3 = 582</code> numbers. That’s a reduction in size by a factor greater than 70.</p>
<p>Even though data compression does not make much sense for the size of the dataset considered here, it clearly shows potential to be very useful for purposes of data distribution and data storage, when the data size far exceeds the terabyte range.</p>Alexej GossmannIn many applications, data naturally form an n-way tensor with n > 2, rather than a “tidy” table. As mentioned in the beginning of my last blog post, a tensor is essentially a multi-dimensional array: a tensor of order one is a vector, which simply is a column of numbers, a tensor of order two is a matrix, which is basically numbers arranged in a rectangle, a tensor of order three looks like numbers arranged in rectangular box (or a cube, if all modes have the same dimension), an nth order (or n-way) tensor looks like numbers arranged in an n-hyperrectangle… you get the idea…