This will be my last blog post before exams sadly :( . Have 5 exams coming up in 2 weeks, so I will be prioritising my time for the month. My next post will likely be about the PhD topic I’ve chosen (we find out roughly after exams.)
Usually I try to avoid writing about events I’ve attended as I’d have a feeling that I’d make them sound worse than they were. »
I’ve done a couple blogposts in the past on Statistical learning, see here if you havn’t read them yet. In this blog post I’ll explain the most popular way to compare models and decide which one is best. It’s known as the test-train split. This is really only useful for supervised problems.
The test set approach So the test-set approach is quite intuitive when you hear about it. You have your \(n\) data-points you observed each of which has explanatory \(x_i\) and response \(y_i\) and our end goal is to predict the \(y_i\). »
So, this is a follow on from the Supervised or not blog post where I looked at how to decide if a problem is supervised or unsupervised and looked at a simple example on the iris dataset. Similar to that post, here I’ll look at classification again, but we’ll go more in-depth into some issues with classification.
Linear Discriminant Analysis In the previous post we’ve used K-nn, here we’ll use Linear discriminant analysis (LDA) which is slightly more complicated. »
This blog post is about a talk given by Prof. Michael Jordan at SysML conference on the 15th of February.. He’s a professor of Statistics in the Department of Electrical Engineering and Computer Science and the Department of Statistics at the University of California, Berkeley. He’s extremely well-known, and has over 130,000 citations on google scholar. This is very much a follow on post from my previous blog post The Two Cultures of Data Analysis. »
Here’s a nice little mind blower: it’s actually incredibly hard to find and measure something that is truly random. This is actually a bit of a problem as there’s so many places where random numbers are needed. In this blog post I’ll use a few running examples where random number generation is needed:
The gambling sector, pretty much everywhere from shuffling cards in online poker (and in casinos for the particularly high tech ones) to slot machines for obvious reasons. »
Much of the reading I do tends to end up leading me to many papers which seem to be carried out in the machine learning field rather than statistics. I always ask myself what’s the difference. There’s so much blurring between the two areas like topic modelling which is based off of Bayesian statistics but still is worked on primarily by people in the machine learning community. Then, there’re areas which fall more in the statistics field like Expectation-Maximization; and the computer science field like neural networks. »