NYU Large Scale Machine Learning (Big Data) Lecture One: Online Linear Classification

Thursday January 31, 2013

I'm very fortunate to be auditing Yann Lecun and John Langford's new big data course at NYU. Their personal web sites might not win any design awards, but they certainly have no shortage of expertise in machine learning - and people know. It was standing-room only in lecture hall 101 at Courant. I think a number of students accepted the offer from professor LeCun to sit on the ground in the front. LeCun's introduction to the course included a variation on Drew Conway's "Data Science Venn Diagram" - Drew Conway was in the back of the room. The professors are planning to bring in some other guest lecturers, and the class is also going to get to use a hundred-node Hadoop cluster donated to NYU by Yahoo!.

It was an exciting atmosphere! What follows are my possibly opinionated, possibly incorrect, definitely selective notes from the class.

We're using N for the dimension of vectors and T for number of training samples, which is not the notation I generally think of. I'm used to using N (or n) for number of observations. Maybe this other convention is a standard among machine learning people? Anyway, imagine a matrix with N dimension (feature) labels across the top and T training example (observation) labels down the left side. LeCun didn't draw this out, but he did specify these cells contents in his presentation:

	small N	large N
small T		hell!
large T	great!
infinite T	on-line/streaming

I might fill in those empty cells as follows:

	small N	large N
small T	statistics	hell!
large T	great!	also hell?
infinite T	on-line/streaming	also hell?

And I guess we never go on systematically adding more features forever in the same way that we systematically go on adding more data forever.

After the introduction by LeCun, Langford took over for the main lecture, about on-line learning for classification. (I was going to specify supervised, but apparently this is redundant.) As they were switching computers, we could see that Langford was running Outlook in a Windows VM on his laptop - maybe that's the price you pay when you work at Microsoft Research? The real demonstration tool of choice is Vowpal Wabbit, which is the creation of Langford himself. Maybe I should call it the reference implementation. As you can see on the published slides for the lecture, the first directions were to clone and make vw from github. Of course make just failed on my Mac and I didn't have time to figure out why. It wasn't really necessary to have vw to follow along, but I should really get that to work... I did feel, throughout the lecture, that I was much better off for previously having seen Langford give a talk about vw at the NYC Machine Learning meetup back in September of 2011. Of course it also helped that he was being much more didactic in his presentation.

It wasn't long before the mathematics was enough to keep me busy and I was typing a good deal less. There were also lighter moments, as for example when Langford enjoyed getting in a little rip against Mahout, which it seems is often much slower than his vw. If it's really true that the biggest data set they even try to crunch is processed in about a second on his laptop in vw, I guess that's fair!

Langford spent most of his time talking about choosing a loss function and online learning generally. I hope we get to hear more in the future about how vw uses hashing to achieve its speed. I kind of think this must not be that complex a topic, but I'd like to understand it better.

The discussion of what loss function is appropriate was the best didactic component of the lecture, in my opinion, as Langford gave the whole room time to think about and suggest what should be used for a range of scenarios. You can see this in his slides titled "Know your loss function semantics", but I think it was much better experienced in the room, at least for me, since I don't instantly know the answers to that type of question. And as Langford says, "one of the most common ways to mess up in machine learning is to optimize the wrong thing." I thought the example of house sales was especially interesting, because the solution involved a consideration of the weirdness in the data coming from non-market home sales between family members and such. I hadn't considered that.

There was also some discussion of how to best deal with the often varying importance of different types of failures. (It is much worse, for example, to mark non-spam as spam than to mark spam as non-spam.) I asked a question here that I think, in retrospect, must have been already understood by many people in the room. Langford's response was clear and helpful. He's clearly both intelligent and kind, and sometimes even funny. His response to another question: "Recall and precision... Those always confuse me. Yes, certainly you want something that is good." He did go on to resolve the issue, which I think may have been that recall and precision weren't particularly relevant to the discussion at hand at that time. He also artfully addressed a student question that suggested a method that would break information causality, so to speak. Really good stuff.

There was some discussion of how best to adjust learning rate, a recommendation for per-feature learning rate decay, and Langford also advocated gently for progressive validation rather than (or maybe just in addition to) train-test validation for on-line learning. I had never seen z-scoring referred to as "Gaussian Sphering" before. Both sound pretty cool. I mean the names. Both names are good. But of course the way Vowpal Wabbit does variable normalization is way smarter.

Very good stuff! I'm definitely looking forward to next week.

This post was originally hosted elsewhere.