Command-Line Data Manipulation

You can do quite a lot at the command line, even without sed, awk, or scripts. This vignette will explore some typical preliminary data tasks – many of which might often be done in an environment such as R – without leaving the shell prompt.

Prerequisites: You should know what ls is.

Commands used here: export, cat, wc, time, head, tail, diff, mv, rm, man, tr, cut, sort, uniq, grep, echo, bc, paste, join.

Begin!

Copy these two starting data sets into files called data1.txt and data2.txt in your working directory so that you can display them as follows with cat (concatenate). Commands are shown after the prompt ‘$ ‘. (Use export PS1='$ ' to make your terminal look just like this.)

$ cat data1.txt 
type,name,owner,legs,age
human,Bob,,,42
pet,Basil,Bob,4,
human,Charles,,,27
human,Aaron,,,30
$ cat data2.txt
type,name,owner,legs,age
pet,Billy,Bob,4,
pet,Arty,aaron,4,
pet,Chuck,Charles,2,

Our data will always be text. The command wc (word count) is super useful for checking out text files.

$ wc *
       5       5      93 data1.txt
       4       4      81 data2.txt
       9       9     174 total

The first column is the number of lines. Really this just counts newline (‘\n‘) characters. If your results don’t match here, you’re probably missing the final newline characters in the files. Always include those final newlines! (In an editor it will just look like a blank line at the bottom of the file.) To just get the number of lines, you can use wc -l.

The second column is the number of words. There aren’t any spacing characters in the files, aside from the newlines, so right now every line is one ‘word’. Word counts come from wc -w as you’d expect.

The third column is the number of characters, of any type, whether they print or not. You can get this count with wc -c.

An occasionally interesting thing when working with largish files is to see how long it takes your machine to read through a file, using the time command.

$ time wc data1.txt 
       5       5      93 data1.txt

real	0m0.005s
user	0m0.001s
sys	0m0.003s

With tiny files like this we can do whatever we please. If it took 20 minutes to go through the file, we might think more carefully about techniques to apply.

The commands head and tail show the top and bottom of files, respectively. They take an argument -n X where X is the number of lines to display, which can be shortened to -X. So we can write the headers from these files into new files like this:

$ head -1 data1.txt > header1.txt
$ head -1 data2.txt > header2.txt

The diff command is useful for comparing things.

$ diff header1.txt header2.txt
$

The lack of output here tells us that the two files are identical. We could have used diff on the original files:

$ diff data1.txt data2.txt 
2,5c2,4
< human,Bob,,,42
< pet,Basil,Bob,4,
< human,Charles,,,27
< human,Aaron,,,30
---
> pet,Billy,Bob,4,
> pet,Arty,aaron,4,
> pet,Chuck,Charles,2,

This output shows that the first lines are identical, with lines 2-5 of the first file differing from lines 2-4 of the second file. Since they header rows are the same, we only need one copy.

$ mv header1.txt header.txt
$ rm header2.txt

There’s a handy way to get every line but the first:

$ tail -n +2 data*
==> data1.txt <==
human,Bob,,,42
pet,Basil,Bob,4,
human,Charles,,,27
human,Aaron,,,30

==> data2.txt <==
pet,Billy,Bob,4,
pet,Arty,aaron,4,
pet,Chuck,Charles,2,

Again, the -n can be dropped. Check the help for tail to see how to suppress those per-file headers.

$ man tail

(Press q to get out of the manual.)

Now, let’s combine our two files together, with just one header row.

$ cat header.txt > data.txt
$ tail +2 -q data[12].txt >> data.txt
$ wc data.txt 
       8       8     149 data.txt

Of note: ‘>‘ redirects standard output to a file, wiping out anything that was there before, while ‘>>‘ appends to a file if there’s already one there. Also, data[12].txt behaves similarly to a regular expression, expanding to data1.txt and data2.txt but not data.txt, which is what we wanted. So now all the data is together in data.txt.

$ cat data.txt 
type,name,owner,legs,age
human,Bob,,,42
pet,Basil,Bob,4,
human,Charles,,,27
human,Aaron,,,30
pet,Billy,Bob,4,
pet,Arty,aaron,4,
pet,Chuck,Charles,2,

It’s a little annoying to read comma-separated values (CSV), sometimes easier to read tab-separated values (TSV). We can pipe to the command tr (translate) to aid readability.

$ cat data.txt | tr ',' '\t'
type	name	owner	legs	age
human	Bob			42
pet	Basil	Bob	4	
human	Charles			27
human	Aaron			30
pet	Billy	Bob	4	
pet	Arty	aaron	4	
pet	Chuck	Charles	2

Note also that once we do this, wc will be able to count words in a more meaningful fashion.

$ cat data.txt | tr ',' '\t' | wc
       8      30     149

Sometimes it can be worth thinking about missing values – in this case we have (possibly) an eight-by-five table with only 30 “words” in it. Of course we’re also relying on the file being “nice enough” to be processed without worrying about, for example, commas and spaces within entries. Data will certainly not always be nice enough.

The cut command is useful. For example, cut -c 1-80 will show just the first 80 characters of each line in a file with really long lines. We can also ‘cut’ by delimiters, which is what we’ll do here. The default delimiter is tab, but we can specify commas and then get out the column of names – the second ‘field’.

$ cut -d , -f 2 data.txt 
name
Bob
Basil
Charles
Aaron
Billy
Arty
Chuck

Combining something old with something new, we can get a sorted list of just the name values:

$ tail +2 data.txt | cut -d , -f 2 | sort
Aaron
Arty
Basil
Billy
Bob
Charles
Chuck

Adding one more command, with field three (owners) we see only unique values:

$ tail +2 data.txt | cut -d , -f 3 | sort | uniq

Bob
Charles
aaron

Note that the empty string (the blank) counts as a value. Note also that uniq will not perform as its name suggests if its input is not sorted. It only removes duplicates that are adjacent lines. Finally, note that the alphabetization is not necessarily what you’d expect, but makes sense in its way.

Let’s lose the header row and split to a table of humans and a table of pets. We’re very lucky that we can use these terms to grep for lines. Take the command apart if you aren’t sure what grep is doing here.

$ grep human data.txt | cut -d , -f 2- > human.txt
$ grep pet data.txt | cut -d , -f 2- > pet.txt

There is a command-line math engine, bc, which is very handy. It takes a string representing some mathematics, and does the calculation for you. The syntax is usually what you’d expect, though you do have to write the natural log as just ‘l()‘ and turn on such functionality as logs and floating-point arithmetic with ‘-l‘.

$ echo '2+3' | bc
5
$ echo 'l(2+3)' | bc -l
1.60943791243410037460

There’s also an interactive mode; you probably want to start it as bc -il.

Using the paste command, we can generate a mathematical expression from a data file and pass it to bc. This adds up the total number of pet legs:

$ cut -d , -f 3 pet.txt
4
4
4
2
$ cut -d , -f 3 pet.txt | paste -sd '+' -
4+4+4+2
$ cut -d , -f 3 pet.txt | paste -sd '+' - | bc
14

With -s, paste combines multiple lines onto one. Without -s, it combines corresponding lines from multiple files. This is like using cbind in R, combining columns together. We already saw that we have mis-matching capitalization, with ‘Aaron‘ vs. ‘aaron‘. Let’s add a fully capitalized owner name column to both tables.

$ cut -d , -f 1 human.txt | tr [a-z] [A-Z] | paste -d , - human.txt | sort > Hhuman.txt
$ cut -d , -f 2 pet.txt | tr [a-z] [A-Z] | paste -d , - pet.txt | sort > Ppet.txt

Interesting! I’ve just discovered that HFS+ (the Mac file system) will not let you have different files called human.txt and Human.txt, hence the new filenames have double their initial letters. You learn something new every day.

Finally, we’re going to do an inner join to combine owners and pets together on the appropriate lines. Like in SAS, it’s important at the command line that the data is sorted by the column to merge on, which is the first column by default. (You can have fun keeping track of big-O complexity at every step!) Confusingly, here -t is used to specify the delimiter instead of -d. We’ll finish with some cleaning and labeling, for a nice final data set.

$ echo 'pet_name,owner_name,pet_legs,owner_age' > final.txt
$ cut -d , -f 3,9,2,4 raw_join.txt >> final.txt 
$ cat final.txt 
pet_name,owner_name,pet_legs,owner_age
Arty,aaron,4,30
Basil,Bob,4,42
Billy,Bob,4,42
Chuck,Charles,2,27

(Using cut we don’t have immediate control over the column ordering.)

There is certainly much more that can be done at the command line. Especially if you install one of the cute command-line data visualization projects, you can get a very large share of “statistical” functionality in the traditional unix style of small, focused tools. On the other hand, it can be very nice to step up to more comfortable environments such as R or Python.

Programming is not Mathematics

This is programming.

total = 0
for (i in 1:100) {
  total = total + i
}
total
## [1] 5050

Figuring out that you ought to do this instead is mathematics.

100*101/2
## [1] 5050

There’s a lot of excitement about technology education, which is good. Teach everyone to program! Yes! Program or Be Programmed!

What there isn’t enough of yet is the realization that programming is a basic skill, more akin to multiplication tables than it is to critical thinking, more done by rote than innovative.

Like elementary-school calculating, there is value in learning to program, both for the pure utility and foundational nature of it and for the organized patterns of thinking that it can begin to inculcate.

And programming does provide another avenue for the deepening of thought, just as writing an essay can allow for careful reflection and refinement of thinking. It allows it, but it does not require it, and in this way there is a clear similarity across many fields and all their basic toolkits.

Programming by itself is not enough. Learning the mechanics of one programming language this year does not prepare you for a job a year from now. It may be necessary, but it is not sufficient. I hope educators and learners will reach for deeper learning objectives as they work with technology. It is a means, not an end.

Alternate titles:

  • Programming is not Computer Science
  • Calculating is not Mathematics
  • Calculating is not Computer Science

(Examples above work in R. Apologies to <-.)

Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains

I was recently pointed to the 2010 USDOE/Mathematica paper that shares this post’s name. One has to think that Rockoff and Kane et al. have seen it, but nobody seems to ever mention it. From the abstract:

Simulation results suggest that value-added estimates are likely to be noisy using the amount of data that are typically used in practice. Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data.

Those numbers are awful. What’s interesting to me is that they aren’t even that much better when using three years of data, as compared to one year. I had thought they would probably improve a lot with triple the data. They just stay bad.

This study used a fairly simple VAM, similar to what they do in Tennessee, so that’s one possible critique. But the fact is, this is the only research I’ve seen that seriously attempts to address the trustworthiness of VAM at the teacher and school level. Everybody else seems to be ignoring it, as if there’s no cost to making arbitrarily incorrect judgements about teachers’ work. This paper is worth a look.

Human-Computer Cooperation for Data Merging

This problem comes up all the time, but the instance that got me thinking about it most recently was this: The NYC MTA provides subway station geo-coordinates like this:

127,,Times Sq - 42 St,,40.75529,-73.987495,,,1,
624,,103 St,,40.7906,-73.947478,,,1,
119,,103 St,,40.799446,-73.968379,,,1,

But their subway turnstile data is linked with a table like this:

R032	R145	42 ST-TIMES SQ	1237ACENQRS	IRT
R180	R252	103 ST		6		IRT
R191	R170	103 ST		1		IRT

Notice a couple things. There’s only one subway stop called ‘Times Square’, but the name is different in the two files. That’s one thing, but then there are actually three different stations called ’103 Street’. (Only two are shown here.) To tell the difference between them you have to look at subway lines (1, 6, etc.) in one file, and at longitudes in the other. It’s a huge pain even identifying the contenders to choose between, and certainly no generic matching algorithm will do it correctly for you. It’s such a specialized case that it probably isn’t worth developing an algorithm just for this data – or perhaps it’s that an ‘algorithm’ which would solve this problem would be essentially identical to just doing the matching by hand anyway.

This kind of merge difficulty comes up all the time while getting data munged into a usable state. In this example, I just want to attach the turnstile data to the geo data, and most of the matches aren’t terribly hard, but it’s hard enough that it won’t happen automatically, and it would be a super big hassle to do by hand, grepping or control-f’ing around the files to build a merge table. More than once I’ve wished there was a tool that would help me do this kind of thing quickly.

One way to improve the situation going forward is to have everyone in the world implement unique IDs and/or controlled vocabularies for absolutely everything, and enforce their usage through input validation and strong checking of everything at every point in data processes. That would be nice. I think some messy data situations may persist.

There are a couple special cases of this matching problem that have received a lot of attention:

Geocoding is essentially doing messy matching on strings that represent addresses in order to figure out a unique match with associated geo-coordinates.  These can be found in various GIS software packages, and New York City makes available such a tool specifically for NYC addresses.

Record linkage in health care and education seeks to match up humans records using name, date of birth, and so on. My colleagues at DOE and I worked (and they continue to work) on the education case, for example matching college outcome data to DOE internal student ID numbers that have associated test scores. The Link King is a pretty complete human-matching tool birthed in health care but possibly more broadly applicable. (It is itself free but runs on top of SAS, which isn’t.)

I don’t particularly care about address matching or human matching, for two reasons. First, they won’t help me with the subway stations. I want a general-purpose tool. Domain-specific rules are certainly useful in specific domains, but I don’t like them much in general. Second, in both cases the tools achieve some match percentage which I don’t find acceptable. I want to achieve 100% matched data. I want a tool that will make it easy for me to personally review any case that isn’t 100% certain. (The Link King has some functionality for this, but it’s super specific to human-identifying data. Also, it runs on SAS, which is gross.)

There are more general tools which compare strings for similarity. They all seem to be based on Soundex or Levenshtein edit distance or something like these. Python has difflib, there’s string_score for JavaScript, etc. These are a good start, I think. SeatGeek has their FuzzyWuzzy, which extends basic string comparisons to work better for common cases while still remaining fairly general.

I want to be super OCD with these merges though – I don’t want to just hope that the computer is doing a good job matching. So what I really want is not just the computer to find a likely match, but to let me confirm or correct the matches as it goes. I want not just an algorithm but an interface. In the case of the 103rd St subway stop, it should show me that there are three good possible matches and let me work it out – and it should make the whole operation as frictionless as possible.

I’m imagining this as a JavaScript/web app, probably running entirely client-side. You give it two lists of values that you’d like to be able to merge on. For example,

mergevals1
cow
Ducks

and

mergevals2
duckies
Cow

For each value in the first list, the interface shows you a list of options from the second list, ordered by apparent likeliness of being a match, based on string comparisons of some kind. So you would quickly click on “Cow” to match with “cow” and perhaps agree more slowly that “duckies” is a match for “Ducks” and perhaps specify new common names for both matches (it could just use the first list’s version by default as well). The interface would then produce these:

mergevals1	common
cow		cow
Ducks		duck

and

mergevals2	common
duckies		duck
Cow		cow

Then you can merge common onto your first data set, and also onto your second data set, and then you can merge both your data sets by common.

(In the subway case, you would pass in a combined column with station name and lines on the one side, and station name and geo-coordinates on the other.)

The process would still have a lot of human review, but the annoying parts are sped up by the computer so that the human user is just doing the decision-making. I think this would make a lot of important and useful data merges way more feasible and therefore lead to more cool results.

Maybe I’ll make this.

Value-Added Measures Unstable: Responses from Rockoff and Kane

Columbia‘s Center for Public Research and Leadership hosted a panel last night entitled Evaluating K-12 Teachers: What’s Left to Debate in the Wake of the MET and Chetty-Friedman-Rockoff Studies?.

The panelists were Thomas Kane of the Gates Foundation‘s Measures of Effective Teaching (MET) project and Jonah Rockoff of the “Chetty-Friedman-Rockoff study” (The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood). Panelists Rob Weil of the American Federation of Teachers and Shael Polakow-Suransky of the New York City Department of Education responded after the two researchers presented their executive summaries.

There were three questions from the audience after the panelist’s remarks, of which mine was the last. It ran thus: It’s clear that there is some information in teacher value-added results in the aggregate, and more or less clear correlations can be illustrated if teacher results are heavily binned and averaged before creating, for example, the most popular plots in both the MET and C-F-R reports. However, what matters for K-12 teacher evaluation is the individual teacher level results, so we should be interested in their reliability. A preliminary MET report (page 18, table 5, prior-year correlation) is the only place that give any measure of this reliability for the MET study, despite (or perhaps because of) MET’s unquestioning use of VAM as a known-good model. The highest year-over-year correlation in teacher scores is for math, at 0.404. (I was generous in not mentioning that it’s less than half that for ELA.) If you square the math correlation, you can conclude that teacher value-added scores are essentially four parts noise for every one part signal. They change a lot from year to year – more than we can reasonably expect that teachers actually change. Professor Kane had used the metaphor of a bathroom scale for VAM. If your bathroom scale measures you one year at 240 pounds and the next year at 40 pounds, you start to question the usefulness of your bathroom scale.

After all this setup, my actual question was in two parts. First, how much random noise is an acceptable level for high-stakes teacher evaluations? Second, are the new VAM calculations that the New York State Department of Education (NYSED) commissioned from the American Institute for Research (AIR) any more reliable than the measures of the MET project and other VAM, which have similarly poor year-over-year stability?

The panelists answered neither of my questions, but they did say some things in response.

Rockoff said that he’d thought a lot about the issue. Then he offered that batting averages are similarly unstable for players, year over year. He also suggested that sales figures for salespeople are probably unstable as well.

I don’t know much about baseball or sales, so I don’t know if his claims are true. Even if they are, I’m not particularly moved by an argument of the form “everything is bad, therefore let’s call it good”. Further, I think there’s a big difference between VAM and a statistic like batting average (hits over at-bats, I looked it up) or sales (just add them, I imagine) – these are simple calculations that everyone can follow, and moreover they are quite directly linked to what the person in question actually does. They’re also pretty hard to fake. And importantly, if somebody else has a high or low batting average, it doesn’t affect your batting average. You can still get a high or low batting average regardless of other players. Value-added, on the other hand, is almost entirely a zero-sum game: the reason a teacher has a higher score is because their students’ test scores went up more relative to other teachers’ test score gains – it’s not on some absolute scale of learning. And I don’t know of any laws that evaluate baseball players based on just their batting average and nothing else. I have a hard time getting from Rockoff’s statements to a satisfying argument for VAM as a yearly teacher evaluation method.

Kane also indicated that he’d been anticipating such a question as mine. His response was that we shouldn’t be concerned with year-over-year correlation, we should be concerned with year-to-career stability, which is higher.

Well of course it’s higher. That’s basically the Law of Large Numbers. If you average a bunch of high-noise measurements, the measurements will look more like their average than they look like each other, on average. This reality doesn’t make the individual observations any better.

It’s not that I don’t like the Law of Large Numbers – certainly we can trust VAM more (relatively) when looking at three years of results, compared to just one year’s results. And professor Kane did kind of hint that it would be good for a principal to use three years of data before trying to make a tenure decision. I believe the C-F-R study also used more than one year to estimate teacher effectiveness. But the fact remains that the one-year point estimates are what people have been looking at and will be looking at when they’re evaluating teachers. I don’t think these numbers are good enough. It’s bad science. We can’t trust the measurement.

I didn’t get to ask my follow-up question, which was this: Douglas Harris, author of Value-Added Measures in Education (on Harvard Education Press) advocated in a September 2012 article specifically against including value-added measures with other measures in a composite measure of teacher effectiveness. He argued for using value-added as a screen, in the same way that a doctor gives many people an inexpensive screening test and then follows up more carefully with the patients who are indicated, while knowing that many of those patients may have had false positive results. Could we do something like this?

Notes from “Educational Data Mining: Predict the Future, Change the Future”

Data Mining doesn’t sound as sexy as Data Science these days, but CUNY’s initiative has pulled together a fantastic series of talks focusing on whatever you call it, as applied to education. Ryan Baker opened the series earlier today with an excellent overview of the whole field. CUNY will be posting video of the talk, as well as references to papers mentioned, but there was so much good stuff that I wanted to explore and leave some links to interesting things here.

Professor Baker started with a brief history of big data, which has been used for years in physics, biology, and meteorology. Now it’s popular across the web, largely in very clearly commercial applications, but it’s becoming very relevant to education, largely because educational software can collect so much data as students interact with it.

The applications of EDM described were very diverse, from predicting standardized test scores to automatically detecting student (dis)engagement, “gaming” of educational systems, or emotion broadly. An amusing example is detecting “WTF behaviors,” where “WTF” stands for “Without Thinking Fastidiously” – for example, if students are exploring a largely unsupervised 3D world, how do you know who’s really on task and who’s just carrying bananas to the toilet for kicks? (True story.)

It’s an exciting time! As Professor Baker says, “The data’s all there, you just have to find ways to link it.” And, of course, ways to learn from it. What follows will be a collection of things that you can link and learn from right now:

There are two major societies in this field:

Other interesting things:

School District Demographics System: Probably just the tip of the things-I-didn’t know-were-on-the-NCES-web-site iceberg. Neat interactive mapping and downloadable data sets by school districts.

Zombie Division: A game, apparently popular in the UK, of the ubiquitous “first-person destroy-numbered-skeletons-with-weapons-corresponding-to-their-divisors” genre.

I keep hearing more about Reasoning Mind, an online math ed thing. I haven’t fallen in love with it yet, but who knows.

PSLC DataShop: “The world’s largest repository of learning interaction data.” Sort of like a cross between the UCI Machine Learning Repository and ICPSR, intersect education. It’s at LearnLab, the Pittsburgh Science of Learning Center.

Professor Baker revealed that he’s planning a Coursera (MOOC) course in big data and education for the fall 2013 semester, which should be announced in August. I’m looking forward to it!

Signals” at Purdue: A really neat project that identifies college students at risk of not succeeding, as early as the second week of classes, and automatically recommends interventions as simple as an email lightly customized by a professor. Really shiny web site, too!

It would be nice if projects like Signals published more about their methods. Apparently universities in Europe, particularly Spain, are less competitive and so publish more about the programs they develop to help their students. Of course, it’d be nice if Knewton would publish too. Pearson does, for goodness sake.

Largely, it seems like the best ed tech follows the “all dashboards should be feeds” advice, not just showing data but identifying the actionable bits and even making recommendations for actions to take. Assistments, for example, will send an email summary of online homework completion, letting a teacher know which question students had the most difficulty with, and other highlights.

Speaking of Assistments, apparently Science Assistments has spun off and switched to a much less memorable name.

And finally, I found interesting an off-the-cuff top three of the kinds of things that Educational Data Mining studies. Here it is as I understood it:

  1. Knowledge modeling: “What do students know?”
  2. Knowledge structure modeling: “What’s the deal with these things we know?”
  3. Emotion/engagement modeling: “How are students feeling?”

I’m pretty interested in number two, but they’re all good. Looking forward to more talks at CUNY!

NYU Large Scale Machine Learning (Big Data) Lecture Two: More Methods and also LBFGS

[Slides and video from the lectures are online!]

After being packed into room 101 for week one, the class has been relocated to the expansive 109, which is big enough that we have become “sparse” – and much more comfortable!

My notes have become much longer and less coherent. I almost didn’t put this up; you can, after all, see the complete slides and video online. I’ve decided to just review my notes and leave summarized or interesting tid-bits here, removing most of the detail.

Professor LeCun presided over the first hour, starting from “(Almost) everything you’ve learned about optimization is (almost) irrelevant.” This is kind of funny, but also true in the sense that in machine learning, you aren’t generally in the position of just adding significant digits to a known good solution – you’re trying to maximize some performance which will be hurt by overfitting. So you aren’t really concerned with how fast you can get more digits in your solution. (“asymptotic convergence speed is irrelevant”)

Ah – a name I missed in the first lecture was “Léon Bottou“. He may be coming in to give a guest lecture. I guess he’s the guru of stochastic gradient descent. Professor LeCun very much enjoyed putting up an image from Pink Floyd’s The Wall to dramatize the way learning time goes up seemingly exponentially when you want greater precision from SGD, which is the issue from the previous paragraph.

Another interesting device employed by Professor LeCun was his comparison of online SGD to the Linux (open source, bazaar vs. cathedral) development model. SGD gets a quick imperfect release out quickly, and then improves it with feedback and iterations, as opposed to a batch method that has to look at all the data and really think about it for a while before it makes any model at all. Interesting.

And then we can understand why SGD takes forever to converge to high precision – looking at observations (training examples) one at a time exposes you to the raw noise of those individual observations. Professor LeCun calls this “gradient noise”.

An epoch is a complete pass through your training data. Online algorithms update the model many times per epoch, whereas batch algorithms update the model only once per epoch. Professor LeCun went on to show that batch gradient descent is slow to approach a solution through areas of low loss-function curvature. Online SGD is noisier (jumps around more) but gets closer to the solution faster. (You could omit “online” before “SGD” without changing the meaning, I think.)

What followed was an interesting discussion of optimizing your learning rate and how Newton’s method is not practical (O(n3)) but equivalent to gradient descent in a particularly warped space. Unfortunately the warping is also expensive to compute.

There was also some interesting connection to PCA, and something else, and then the final conclusion seemed to be that averaging your recent models after you’ve done SGD for a while is a good idea because you’re pretty sure you’re near the solution and you can hopefully just average out the remaining noise and get even closer to the solution than just stopping SGD somewhere arbitrarily and taking whatever slightly noisy model it was on.

Then there were some pretty neat demos showing the behavior of various algorithms with neat visualizations, all coded up in Torch. I’m not sure who did them – some student? I also hadn’t realized that Torch was all done with Lua. And the plotting was done in gnuplot, which I hadn’t really seen used since I read Data Analysis with Open Source Tools. Neat!

After a break, the lecture was handed off to Professor Langford, who was actually going to talk about a batch algorithm, which is a little out of character for him (and the class, perhaps). But it’s a particularly neat algorithm! I was dismayed for a while to not know what BFGS stood for, so it was good to find that it’s just a bunch of names. Yes four names of people who published independently but in the same year (1970) papers describing the method. Professor Langford described it as “an algorithm whose time had come.” The L came later (1980) and is just Limited-memory. So there’s the acronym dealt with!

Professor Langford did introduce some of the possible upsides of batch algorithms – namely that they tend to be more mature, perhaps better-implemented, and also possibly better for a parallel computing environment, where online methods can be hard.

The actual method is fairly elegant and I thought Professor Langford’s slides and explanation were quite good. BFGS basically approximates difficult things like the inverse Hessian using just simple things that you have just empirically.

You do have to start BFGS with something – Professor Langford recommended always starting with online learning. Just do some online learning to get more or less close to some sort of solution, and then (L)BFGS is good at getting the higher-precision digits.

It is a little interesting that the lecture started by discounting the importance of higher-precision digits, but came back to a method for finding some. To paraphrase Professor Langford, “If you’re dealing with ad display issues for major companies, it turns out that the fourth digit can actually matter.” (That could be a perfect quote; check it against the video if you want.)

I got to feel slightly smart when I knew the answer to one of the questions Professor Langford offered – but the question and answer have much the same character as the old joke, “Doctor, it hurts when I move my arm like this!” “So don’t move your arm like that!” The question was, “What do you do if you update your model and your loss actually goes up?” And the answer is basically, “Don’t update your model like that!” Really the answer was, only update your model half as far as you were going to and see if that fixes it, and repeat until you reduce loss vs. the old model, but you get the idea. It’s nice when there’s a question that I can handle.

Professor Langford did go on to warn against overfitting, saying that “these things are made to optimize, and that means they will overfit if they can.” He recommended some sort of regularized loss which I don’t have decent notes on, and then did some demos with vw. The immediate connection back to the machine is nice in this class. It isn’t just theory; there are working implementations that show you how it works. There was some discussion of how to stop and restart an algorithm (for example, with new data) by saving just some things about it. I’m really not sure why you can’t just save the entire working state of the algorithm; it may have to do with good ideas regarding the Limiting of memory.

Addressing convergence, Professor Langford took a pretty pragmatic view, saying that there are some theorems, but it’s rarely really quadratic, and you never perform exact line search. Also, it’s an interesting thing that LBFGS is robust enough to optimize even when your function doesn’t have a second derivative, and so in some sense it is (empirically) even better than Newton.