Should everyone be pushed toward college as if it's the only acceptable path? Probably not. More vocational education seems like a fine idea. I agree there's a signaling component (possibly large) to the personal benefit of education, and that many people don't learn as much as it seems like they should.
Caplan turns his back on equality of opportunity when he recommends letting the market decide on education. I'm not entirely comfortable with how casually he cites Jensen (page 60) or ignores that effects vary substantially by race even in work he cites (page 150).
I think it's inconsistent to treat education as essentially daycare but not include the provision of that service in the accounting of economic benefits.
I think the interesting thing is to consider Caplan's critiques and think about what really effective education could look like. What really would build human capital? What would make people wiser, happier, and more productive?
"Bryan Caplan, the foremost whistle-blower in the academy, argues persuasively that learning about completely arbitrary subjects is attractive to employers because it signals students' intelligence, work ethic, desire to please, and conformity—even when such learning conveys no cognitive advantage or increase in human capital." (cover blurb from Stephen J. Ceci)
"Learning doesn't have to be useful. Learning doesn't have to be inspirational. When learning is neither useful nor inspirational, though, how can we call it anything but wasteful." (page 2)
"Few jobs require knowledge of higher mathematics, but over 80% of high school grads suffer through geometry. Students study history for years, but history teachers are almost the only people alive who use history on the job." (pages 6-7)
Geometry is higher mathematics? But more importantly, isn't history important, at least in theory, to having informed voters for democracy? Caplan later points out that years of history don't usually yield informed voters in the current system anyway, but it seems inappropriate to completely neglect civic objectives for education.
"...statistics and econometrics courses at elite colleges emphasize mathematical proofs, not hands-on statistical training." (page 11)
That cites Undergraduate Econometrics Instruction: Through Our Classes, Darkly, which includes the following in its abstract:
"Questions of research design and causality still take a back seat in the classroom, in spite of having risen to the top of the modern empirical agenda. This essay traces the divergent development of econometric teaching and empirical practice, arguing for a pedagogical paradigm shift."
"Higher education is the only product where the consumer tries to get as little out of it as possible." (page 26, quoting Arnold Kling, sort of)
The citation is College Customers vs. Suppliers, in which Kling writes:
"I recall seeing a quote somewhere else to the effect that higher education is the only product where the consumer tries to get as little out of it as possible."
So who knows where the quote comes from!
Kling proposes separating testing from teaching, to align students and professors in the direction of rigor rather than easy A's. Maybe so?
"Basic literacy and numeracy are virtually the only book learning most American adults possess." (page 40)
"Though college grads spend at least seventeen years in school, under a third have the level of literacy and numeracy we assume of every college freshman." (page 43)
"Note: Statisticians routinely rely on the approximate equality between logged variables and percentages. However, when coefficients are large, this approximation breaks down, so I convert results to percentages for clarity." (page 302, endnote 8 from page 73)
I don't think this is very clear. What it's getting at is that if you're fitting a linear model, then a logged outcome variable makes the additive model multiplicative in the original outcome, so a 0.02 change in the regression outcome is a 2% change in the original outcome. Gelman has a decent intro, maybe. Or this, which notes that "small changes in the natural log of a variable are directly interpretable as percentage changes, to a very close approximation" and even a nice table showing how the approximation breaks down.
"Most strikingly, the standard measure of "fatalism," the Rotter Internal-External Locus of Control Scale, is a four-question personality test. (page 303, endnote 24 from page 75)
This is misleading. The Rotter instrument has 29 questions. But Caplan is citing a 2006 paper that uses NLSY79 data, which does indeed include a "four-item abbreviated version of the Rotter Internal External Locus of Control Scale"—sort of.
NLSY79 included Rotter questions 28, 13, 15, and 25 (in that order) and in addition to asking respondents to choose one of two statements for each, also asked whether their choice was "much closer" or "slightly closer" to their view. So it's an eight-question instrument (or at least, it yields eight bits of information).
There are multiple tests for internal-external locus of control, some shorter than 29 questions, some longer. If you do want a four-question instrument, consider Kovaleva's IE-4 (four five-point Likert scale questions; see Table 26 on page 81).
"Full-length intelligence tests have a very high reliability; the reliability of the AFQT, for example, is .94 (ASVAB 2015). Short intelligence tests, in contrast, have markedly lower reliability—.74 in the case of the General Social Survey's ten-word IQ test (Caplan and Miller 2010, p. 645)." (page 307, endnote 98 from page 93)
"You'll never apply most of what you study, but so what? Academic success opens doors. A dysfunctional game, but if you refuse to play, the labor market brands you a loser." (page 108)
"Signals can affect pay even after employers know the truth. Employer learning researchers speak as if the payoff for signaling ends as soon as employers know a worker's true worth. They should be more circumspect. For starters, firms often give new workers valuable on-the-job training. As a result, signaling can indirectly boost your productivity. Step 1: Signal in school. Step 2: Land a good job. Step 3: Learn useful job skills on the job. Step 4: Persistently profit. If your signal modestly overstates your skill, your imployer may soon wish they'd hired someone else. By the time they spot their mistake, however, your new marketable skills permanently justify higher pay." (page 112)
Following the endnote:
"While research on this story is sparse, Heisz and Oreopoulos 2006 report the payoff for high-ranked M.B.A. and law degrees increases with experience, even correcting for individual ability, because high-ranked degrees lead to good first jobs, which lead to even better jobs down the line." (page 312)
"The Higher Education Research Institute has questioned college freshmen about their goals since the 1970s. The vast majority is openly careerist and materialist. In 2012, almost 90% called "being able to get a better job" a "very important" or "essential" reason to go to college. Being "very well-off financially" (over 80%) and "making more money" (about 75%) are almost as popular. Less than half say the same about "developing a meaningful philosophy of life." These results are especially striking because humans exaggerate their idealism and downplay their selfishness. Students probably prize worldly success even more than they admit." (page 126)
"A cynic isn't someone who puts a price on the sacred; a cynic is someone who puts a low price on the sacred." (page 126)
"The High School Survey of Student Engagement, probably the single best study of how high school students feel about school, reports that 66% of high school students say they're bored in class every day. Seventeen percent say they're bored in every class every day. Only 2% claim they're never bored in class. Why so bored? Eighty-two percent say the material isn't interesting; 41% say the material isn't relevant. Another research team gave beepers to middle school students to capture their feelings in real time. During schoolwork, students were bored 36% of the time, versus 17% for all other activities. No wonder a major Gates Foundation study ranked boredom the most important reason why kids drop out of high school." (page 135)
"How much does your alma mater's rank matter? Research is oddly mixed." (page 149)
The endnote points to Hoxby 2009, which is about college selectivity, not how much it matters to student outcomes. It's interesting though; here's the abstract:
"Over the past few decades, the average college has not become more selective: the reverse is true, though not dramatically. People who believe that college selectivity is increasing may be extrapolating from the experience of a small number of colleges such as members of the Ivy League, Stanford, Duke, and so on. These colleges have experienced rising selectivity, but their experience turns out to be the exception rather than the rule. Only the top 10 percent of colleges are substantially more selective now than they were in 1962. Moreover, at least 50 percent of colleges are substantially less selective now than they were in 1962. To understand changing selectivity, we must focus on how the market for college education has re-sorted students among schools as the costs of distance and information have fallen. In the past, students' choices were very sensitive to the distance of a college from their home, but today, students, especially high-aptitude students, are far more sensitive to a college's resources and student body. It is the consequent re-sorting of students among colleges that has, at once, caused selectivity to rise in a small number of colleges while simultaneously causing it to fall in other colleges. This has had profound implications for colleges' resources, tuition, and subsidies for students. I demonstrate that the stakes associated with choosing a college are greater today than they were four decades ago because very selective colleges are offering very large per-student resources and per-student subsidies, enabling admitted students to make massive human capital investments."
"Their [Dale and Krueger's] most amazing discovery is that students who submit lots of applications to high-quality schools enjoy exceptional career success whether or not they attend such schools." (page 150)
The citation is Dale and Krueger 2014, which has this abstract:
"We estimate the labor market effect of attending a highly selective college, using the College and Beyond Survey linked to Social Security Administration data. We extend earlier work by estimating effects for students that entered college in 1976 over a longer time horizon (from 1983 through 2007) and for a more recent cohort (1989). For both cohorts, the effects of college characteristics on earnings are sizeable (and similar in magnitude) in standard regression models. In selection-adjusted models, these effects generally fall to close to zero; however, these effects remain large for certain subgroups, such as for black and Hispanic students."
It's the last sentence that matters. Caplan only read the first half of that one, somehow, which I think is a substantial mistake.
"As long as your state's best public school admits you, there's no solid reason to pay more." (page 153)
If you're white. I mean, Dale and Krueger 2014 is his citation—how can he ignore that?
"American marriage is a diploma-based caste system." (page 156)
"Going to Harvard may not get you a better job but almost certainly puts you in an exclusive dating pool for life." (page 157)
"Most Ph.D. students have spent their entire lives at the top of the class, yet half wander off before they defend their dissertations." (pages 163-164)
"Yet common sense insists the best way to discover useful ideas is to search for useful ideas—not to search for whatever fascinates you and pray it turns out to be useful." (page 175)
He cites Niskanen 1997 in this opposition to basic research. This is the common sense of those who don't make the discoveries that enable the useful ideas of the future.
"The United States—and probably the rest of the world—is overeducated." (page 199)
And yet, still so apparently in need of more/better education.
"There really is no need for K-12 to teach history, social studies, art, music, or foreign languages." (page 205)
"Better retention efforts will not make Poor Students perform like Fair Students, Fair Students like Good Students, or Good Students like Excellent Students." (pages 211-212)
"Instead of treating the human capital model as an accurate description of education, they could treat it as a noble pre-_scription _for education. Let's transform our schools from time sinks to skill factories." (page 225)
"Doing any job teaches you how to do a job. If this seems a low bar, recall that almost half of dropouts and a third of high school graduates these days aren't even looking for work. Acclimating them to a any form of employment would be a step up." (page 232)
On page 239, Caplan quotes at some length from Malcolm X describing how he copied a dictionary in prison. It's a bit Hirsch-like, this advocacy for content, for volume...
"The straightforward story, though, is that high culture requires extra mental effort to appreciate—and most humans resent mental effort." (page 247)
"Measuring effects issue by issue neatly explains education's puzzlingly small impact on ideology and party. Since education simultaneously increases social liberalism and economic conservatism, its effect on "liberalism" is ambiguous. And while their social liberalism makes the well-educated more Democratic, their economic conservatism makes them more Republican, leaving partisanship nearly untouched." (page 333, endnote 36 from page 249)
"If a world of historical ignorance is scary, you should be scared already, because that's where we live." (page 250)
"When you run out of ideas, assign a random Wikipedia article. ... Start with the Bureau of Labor Statistics' figures on "employment by major occupational group" and "occupations with the most job growth." When you run out of ideas, have students check out an unfamiliar job from the Bureau of Labor Statistics' Occupational Outlook Handbook." (page 256)
I do kind of like some of this. In particular, I feel like I didn't learn in school much of anything about what jobs exist or what people really do in them.
"I'm cynical about students. The vast majority are philistines. The best teachers in the universe couldn't inspire them with sincere and lasting love of ideas and culture. I'm cynical about teachers. The vast majority are uninspiring; they can't convince even themselves to love ideas and culture, much less their students. I'm cynical about "deciders"—the school officials who control what students study. The vast majority think they've done their job as long as students obey." (page 259)
"I don't hate education. Rather I love education too much to accept our Orwellian substitute." (page 260)
"As Stanford education professor David Labaree remarks, "Motivating volunteers to engage in human improvement is very difficult, as any psychotherapist can confirm, but motivating conscripts is quite another thing altogether. And it is conscripts that teachers face every day in the classroom."" (page 260)
"Many idealists object that the Internet provides enlightenment only for those who seek it. They're right, but petulant to ask for more." (page 261)
"Most humans intrigued by abstract ideas and high culture are working adults. Instead of lamenting youthful apathy, passionate educators should redirect their energy to humans who are ready for enlightenment. There is little money in blogging, podcasting, or uploading lectures to YouTube. But if, like me, you love education to the depths of your soul, such efforts are their own reward." (page 261)
And that's how the book ends, apart from the five dispensable dialogues that follow.
]]>XGBoost | LightGBM | CatBoost |
---|---|---|
search missing high and low | search, then assign missing | specify missing high or low |
"normal" balanced trees | leaf-first tree growth | oblivious trees (tables) |
you handle categories | smart categorical ordering | permuted target coding |
weighted quantile sketch | sample high-grad examples | permuted boosting |
regularized objective | exclusive feature bundling | learns category interactions |
2016 paper | 2017 paper | 2017 paper |
This is close to correct, I think. It probably won't help you understand what's going on, but if you already know it might help jog your memory. The models all work pretty well.
]]>"There's a popular vision of heroic leadership that centers on extraordinarily productive individuals whose decisions change their company's future. Most of those narratives are intentionally designed by public relations teams to create a good story. You're far more likely to change your company's long-term trajectory by growing the engineers around you than through personal heroics. The best way to grow those around you is by creating an active practice of mentorship and sponsorship." (page 22)
"It might be addressing the sudden realization that your primary database only has three months of remaining disk space, and you can't upgrade to a larger size (in my experience, a surprisingly frequent problem at fast-growing startups)." (page 24)
"For mentorship and sponsorship, spend some time with Lara Hogan's What Does Sponsorship Look Like?, and for being glue, spend time with Tanya Reilly's piece that bore the phrase, Being Glue." (page 34)
"The first place to look for work that matters is exploring whether your company is experiencing an existential risk." (page 39)
"Foster growth" (page 40)
"Specific statements create alignment; generic statements create the illusion of alignment." (page 47)
"There's no such thing as winning, only learning and earning the chance to keep playing." (page 52)
Reminds me of Finite and Infinite Games.
"Premature processes add more friction than value and are quick to expose themselves as ineffective." (page 43)
"There's the old joke about Sarbannes-Oxley [sic]: it doesn't reduce risk; it just makes it clear to blame when things go wrong." (page 54)
"... adopting the "define errors out of existence" approach described in A Philosophy of Software Design." (page 54)
At a quick look, it seems like this is something like "give your functions defaults rather than exceptions" so if you ask for something out of range, for example, you get the last thing rather than throwing.
"Genuine best practice has to be supported by research, and the best source of research on this topic is Accelerate." (page 56)
"When it comes to complex systems and interdependencies, moving quickly is just optics. It's methodical movement that gets the job done." (page 69)
"I think this is the most important lesson I've learned over the past few years: the most effective leaders spend more time following than they do leading. This idea also comes up in the idea of the "the first follower creates a leader," but effective leaders don't split the world into a leader and follower dichotomy, rather they move in and out of the leadership and follower roles with the folks around them." (page 76)
"There's a well-worn model of genius encapsulated in the Feynman algorithm: "1) Write down a problem. 2) Think very hard. 3) Write down the solution." This mystical view of genius is both unapproachable and discouraging. It's also unrealistic, but it's hard for folks to know it's unrealistic if we don't write down our thinking process for others to follow. By writing down the process of finding an answer, as well as the rationale for the answer, folks around us can being to learn from our decisions rather than simply being directed by them" (page 85)
"Barbara Minto, whose The Pyramid Principle is the most influential work on effective business communication, is also a big fan of structure: "Controlling the sequence in which you present your ideas is the single most important act necessary to clear writing. The clearest sequence is always to give the summarizing idea before you give the individual ideas being summarized. I cannot emphasize this point too much."" (page 96)
"brag document" (page 106)
"Whether your company does ad-hoc promotions or uses a calibration process, promotions are a team activity and as Julia Grace, then of Slack, advised me once during a job search, "Don't play team games along, you'll lose."" (page 110)
"Share weekly notes of your work to your team and stakeholders in a way that other folks can get access to your notes if they're interested" (page 128)
"The flying wedge pattern of one senior leader joining a company and then bringing on their previous coworkers is a well-known and justifiably-despised pattern that relies on this built-in referrer-as-sponsor, but it doesn't have to be toxic if done sparingly." (page 139)
"There are some wonderful engineering leaders creating pockets of equitable access to Staff-plus roles, but those pockets can quickly turn into a Values Oasis that can't sustain itself once the sponsoring leader departs the company or changes roles." (page 139)
"Back in 2012, Patrick McKenzie wrote Salary Negotiation, which has since become the defacto [sic] guide to negotiating salaries for software engineers." (page 145)
"Staff-plus is all about enabling other people to do better work - to be a force multiplier." (page 184, from Bert Fan)
"This is not a meritocracy and your professional network is important." (page 186, from Bert Fan)
"To reach Staff Engineer, you have to know and do more than what you currently know." (page 209, from Ritu Vincent)
"A quote I love from Seneca is "Luck is what happens when preparation meets opportunity."" (page 267, from Damian Schenkelman)
"In the quest for efficiency over effectiveness, many companies trap their managers in a staggering amount of coordination and bureaucracy." (page 309)
]]>"When we talk about designing a Staff-plus engineer interview loop, the first thing to talk about is that absolutely no one is confident their Staff-plus interview loop works well." (page 312)
Say you have a categorical predictor, or a continuous predictor that you're going to bin into categories in order to make it easier to model nonlinear relationships, and a binary outcome like “defaulted on loan or not”. Then for each category, the WoE score is:
\[ \text{WoE} = \log \left( \frac{\text{count of positives for this category} / \text{count of all positives}} {\text{count of negatives for this category} / \text{count of all negatives}} \right) \]
import numpy as np
import pandas as pd
import category_encoders as ce
X = pd.DataFrame({'x': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b']})
y = [ 1 , 1 , 0 , 0 , 1 , 0 , 0 , 0 ]
woe = ce.woe.WOEEncoder(regularization=0)
woe.fit(X, y)
woe.transform(pd.DataFrame({'x': ['a', 'b']}))
## (0.5108256237659906, -0.587786664902119)
np.log((2/3)/(2/5)), np.log((1/3)/(3/5))
## (0.5108256237659906, -0.587786664902119)
It's usually written like that, sometimes with additive smoothing (add some small numbers to the counts) to avoid zeros and reduce variance. To make it even clearer that this is just log odds, rearrange to:
\[ \text{WoE} = \log \left( \frac{\text{count of positives for this category}} {\text{count of negatives for this category}} \right) - \log \left( \frac{\text{count of all positives}} {\text{count of all negatives}} \right) \]
np.log(2/2) - np.log(3/5), np.log(1/3) - np.log(3/5)
## (0.5108256237659907, -0.5877866649021191)
The WoE values are exactly the coefficients you'd get if you made indicator columns for your categorical variable, added an intercept column, and did a logistic regression. That design matrix isn't full rank, so there isn't a unique solution—unless you set the intercept coefficient to represent the overall log odds (as in the subtracted term above).
These WoE scores are monotonic with the category-specific percent positive, for example, so if you're going to use a tree-based model where spacing doesn't matter, the more immediately interpretable value might be preferable. If you're doing a simple logistic regression with a WoE predictor, the coefficient will be one. In combination with other features, it isn't obvious to me that WoE will always be an absolutely optimal transform, but it seems like a fine choice for multiple logistic regression as well. Using WoE instead of a categorical feature can prevent learning interactions with the affected categories, etc., which could be a consideration.
Like other transforms based on target statistics, WoE can leak label information into training data. Even in large datasets, if some categories appear rarely, this may be a problem. Cranking up the additive smoothing ("regularization") a bit might help, or consider alternatives as in the CatBoost paper etc.
I think WoE is interesting in part because it's sort of on the threshold between a simple pre-processing step and what might be considered model stacking. It's a reminder that even “fancy” models are just statistics with more steps; the mean is a model too.
If you're working on credit scoring, maybe you'll also do feature selection using “Information Value” (IV). I don't know... It reminds me a little bit of MIC, almost? That's not quite right. I'm not so interested in IV right now.
]]>“Determinants are difficult, non-intuitive, and often defined without motivation.” (Sheldon Axler, 1995)
It's common to talk about properties of the determinant, but then treat its formula as almost coming from nowhere, focusing more on mnemonics than meaning. I think it could be worth seeing that the determinant's formula pops out easily in a common setting.
A matrix is a transformation that maps origin to origin. If a matrix happens to map anything else to the origin, then that matrix isn't invertible, because no inverse can map the origin back to two different points. So we seek a solution to a system of equations.
\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]
Eliminating \( x \) and \( y \) (example solution below) yields the familiar \( ad - bc = 0 \). That is, if \( ad - bc = 0 \), there's some \( x \) and \( y \) (not both zero) that the matrix maps to the origin, and so the matrix is not invertible. The determinant is \( ad - bc \).
I worked through this also for the three-by-three case and was satisfied to recover the usual expression. There's probably a deeper proof or other connection to more linear algebra, but I don't recall ever being taught any rationale at all for the determinant taking the form it does, so I was just pleased that it's as straightforward as this to show at this level. It's slower, but if you happen to forget the formula for the determinant, you can always work it out again this way!
The matrix equation above is equivalent to Equations 1 and 2.
\[ ax + by = 0 \tag{1} \]
\[ cx + dy = 0 \tag{2} \]
Solving Equation 1 for \( x \) produces Equation 3.
\[ x = - \frac{b}{a} y \tag{3} \]
Substituting Equation 3 into Equation 2 yields Equation 4.
\[ - \frac{cb}{a}y + dy = 0 \tag{4} \]
Multiplying by \( a \), dividing by \( y \), and reordering then recovers the familiar form of the determinant in Equation 5.
\[ ad - bc = 0 \tag{5} \]
If I was just sleeping through linear algebra, please let me know what I missed! Also let me know if there are any problems with or alternatives to this method! Thanks!
The quote at top from Axler is from his 1995 paper, Down with Determinants!.
Thanks to Erica for helpful feedback!
]]>“I cannot believe that anything so ugly as multiplication of matrices is an essential part of the scheme of nature.” (Arthur Eddington, 1936)
Even Strang introduces matrix multiplication by saying “there is only one possible rule, and I am not sure who discovered it. It makes everything work.” This is not satisfying or helpful for understanding. The flow metaphor of graphical linear algebra makes matrix multiplication seem natural and helps provide intuition for understanding linear algebra.
The inputs x and y pass from left to right along the arrow paths—getting multiplied by a, b, c, and d, respectively—and then adding up to outputs i and j.
The diagram is equivalent to the usual notation for a simple vector-matrix multiplication.
\[ \begin{bmatrix} i \\ j \end{bmatrix} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} ax+by \\ cx+dy \end{bmatrix} \]
You can visualize inputs entering at the top and outputs going out the side. For example, the x enters the 2-by-2 matrix over the a, and then exits to the left to contribute ax to i. The y enters over the b and exits to the left to contribute by to i.
To compose two matrix diagrams, follow the paths from inputs to outputs. For example, to go from x to u, there's the ae path that goes via i, and there's the cf path that goes via j, so the simplified single path from x to u is ae + cf.
This is matrix multiplication that makes sense. After the input-output behavior above is established, there's no other way for matrix multiplication to come out.
\[ \begin{bmatrix} e & f \\ g & h \end{bmatrix} \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} ae+cf & be+df \\ ag+ch & bg+dh \end{bmatrix} \]
Why do the inner dimensions need to be the same for two matrices you're multiplying? Because you have to match up outputs from one with inputs for the other.
I find this metaphor really helpful, and I suspect it could help people learning linear algebra for the first time as well. I think it can be complementary with thinking about (and visualizing) vector spaces. Exposition might proceed from introducing the idea of linear combinations, to the diagram form, to the matrix notation, for example.
I think it's a good thing that the diagrams make it obvious that order matters. I'd rather have both the diagrams and matrix notation read naturally from left to right, but I don't think everybody will change how they write their linear algebra.
There's nothing special about the usual orientation. Both directions are “on” by default. You can multiply a row vector and matrix, which is like going right to left along the diagram for the matrix. You can see the result isn't equivalent unless you transpose.
Transpose is the operation that swaps inputs with outputs For example, \( (A B)^T = B^T A^T \). Visualize rotating two connected pieces of pipe.
For a given order, sometimes it feels more natural to think about going left to right rather than right to left. For example, in the \( y = X \beta \) of linear regression, it feels better to me to think of moving a row of the data matrix \( X \) through the coefficient transformation \( \beta \), rather than thinking of the data \( X \) as transforming the coefficients. Either way.
Note also that a column vector is not the same as a row vector, and neither are the same as a vector. We probably shouldn't say a vector is any particular matrix.
I started thinking about this based on §3.2.2.4 in Spivak's Category Theory for the Sciences. A complete, rigorous formulation as Graphical Linear Algebra is nicely introduced in Sobocinski's blog. The best brief introduction I've seen is in a presentation from Paixão. (Thanks to Spivak and Sobocinski for helpful pointers over email as well!)
The quote at top from Sir Eddington about the ugliness of matrix multiplication is from page 36 of his Relativity Theory of Electrons and Protons, published in 1936 by Cambridge University Press, as quoted in Macedo and Oliveira's Typing linear algebra.
]]>“So the way we normally teach linear algebra to students is we write down the formula for multiplying a matrix. Where does this formula come from? I mean everyone learns it in first year, right, so we don't question it, but it's actually very difficult to explain. I mean I have to teach linear algebra to first-year students and, you know, it takes them quite a bit of time to get it. And they learn it, they get it by memorizing, you know, these kinds of algorithms in their heads, but this is not the way we should teach maths. We shouldn't teach maths, you know, by telling people to memorize algorithms.” (Sobocinski)
Didau's titular thesis is that the goal of schools is to make kids cleverer, in the sense of crystallized intelligence, by teaching knowledge, broadly construed. When it comes time to discuss which knowledge, he pivots from arguing in support of cleverness to arguing in support of dead white men on the grounds that this knowledge is culturally valued—implicitly, valued by the culture he values.
I think it's possible to make an argument for some shared knowledge, in the tradition of Hirsch, but I think it's a different argument than arguing for knowledge that best helps students think more effectively in the sense of moving toward a global maximum. Similarly, the Lindy effect is about longevity, but not necessarily quality. I think it would be much more interesting to look at curriculum design by taking seriously the idea of giving students the best mental toolkit possible. This is not what Didau does.
Didau discusses the Flynn effect, and subscribes to the "scientific spectacles" interpretation that general skills with scientific abstraction explain increases in average IQ over time, but at the same time argues one-sidedly for teaching concrete knowledge, not general skills.
I do think that students can and probably should learn and remember much more, generally, than they sometimes do, but Didau is a problematic advocate, and I don't think his obsession with IQ is useful.
While this is always the case, I feel like it's especially important with a book like this for me to point out that selected quotes below do not indicate my agreement with or support for any particular quote.
"Over the course of this book, I will explain that, unlike many other qualities we might value, intelligence has the advantages of being malleable, measurable and meaningful." (page 7)
"By 'making cleverer' what I really mean, of course, is raising intelligence—increasing children's intellectual capacity." (page 7)
"Arthur Scargill, tub-thumping leader of the National Union of Mineworkers, who led the opposition to Margaret Thatcher's struggle to break the power of the trade unions, wrote, "My father still reads the dictionary every day. He says your life depends on your power to master words."" (page 9)
"Trying to develop children's ability by teaching generic skills directly is fundamentally unfair. Children with higher fluid intelligence and those from more advantaged backgrounds will be further privileged." (page 11)
"This is the central thesis of the book: more knowledge equals more intelligence." (page 11)
"It's my view that '21st century skills' depend on knowing things rather than on simply being able to look stuff up on the internet." (page 12)
"Developing children's character depends ont on attempting to explicitly teach some ephemeral set of 'non-cognitive' skills but on a combination of high expectations, accountability and modeling. As Kalenze suggests, probably the best way to teach resilience is to give children challenging work to do; the best way to teach respect and politeness is to model it; and the best way to teach children how to be functional, happy citizens is to set up systems which hold them to account for their behaviour." (page 25)
"[Kevin] Laland points out that "Humanity's success is sometimes attributed to our cleverness, but culture is actually what makes us smart. Intelligence is not irrelevant of course, but what singles out our species is an ability to pool our insights and knowledge and build on each other's solutions."" (page 43)
"Some children may be born with a greater capacity for solving problems and thinking critically than others. These children are lucky. At the same time, some children will possess more (and more useful) knowledge of the world on which to apply these skills. These children will tend to be from more privileged backgrounds. What happens in school matters far less to both these groups of children than it does to the less fortunate and the less advantaged. The killer argument against a curriculum that focuses on 21st century skills—or any other kind of generic competencies—is that it is inherently iniquitous." (page 45)
"The purpose of schools, as much as anything else, is to provide an environment where children are made to attend to what they would otherwise prefer to avoid." (page 53)
"... the position I will advance in this book is that intelligence is as much a product of what we know as it is a mechanism for acquiring knowledge." (page 57)
"Although we might perceive some children to be more 'able' than others, this is unimportant because there's not really anything we can do about it. We can, however, do an awful lot about developing the quantity and quality of what children know." (page 60)
"While correlation is not proof that one thing causes another, causation is implied." (page 65)
This follows shortly after a section called "Correlation ≠ causation"...
Intelligence and class mobility in the British population by Nettle is cited on page 67.
"Using the environment to increase crystallised intelligence is central to making kids cleverer; fluid intelligence, and its associated individual differences, is largely a distraction." (page 77)
Can IQ change? by Howe is cited on page 77.
"The difference of natural talents in different men is, in reality, much less than we are aware of ... The difference between the most dissimilar characters, between a philosopher and a common street porter, for example, seems to arise not so much from nature as from habit, custom, and education." (page 85, quoting Adam Smith, The Wealth of Nations)
Schooling Makes You Smarter: What Teachers Need to Know about IQ by Nisbett is cited on page 88. Decent?
Socioeconomic status modifies heritability of IQ in young children by Turkheimer et al. is cited on page 89.
Mainstream Science on Intelligence: An Editorial With 52 Signatories, History, and Bibliography by Gottfredson is cited on page 95.
Black Americans reduce the racial IQ gap: evidence from standardization samples by Dickens and Flynn is cited on page 97.
"The IQ score of the same person taking a test on different days would produce a correlation of about 0.87." (pages 100-101)
And this is good reliability? Hmm.
"Whether or not this [set of recommendations for schools] results in a measurable increase in IQ is largely irrelevant. I think we can all agree that intellectual curiosity and a lifelong love of learning are things we want for all children, and these suggestions seems like a reasonable bet for getting what we want." (page 111)
But the title of the book is "Making kids cleverer"... that was the whole goal you were working on!
"The rule seems to be that education raises crystallised intelligence but not fluid intelligence." (page 118)
A Cross-Temporal Meta-Analysis of Raven's Progressive Matrices: Age groups and developing versus developed countries by Wongupparaj et al. is cited on page 128.
Flynn effect and its reversal are both environmentally caused by Bratsberg and Rogeberg is referenced on page 129.
"It's much more likely that a growth mindset follows from experiencing success." (page 131)
"As we've seen, motivation is a product of being successful." (page 267)
The earlier bit has no reference. The later cites The Relation of Academic Self-Concept to Motivation among University EFL Students, but that paper is reporting correlation, not causation. The closest they get to causation is actually in the reverse direction:
"results support numerous research findings that academic self-concept is an important determinant of students’ academic performance"
"Beliefs about the malleability of basic ability appear to be largely irrelevant: achievement is all about work." (page 135)
"However, if you want to, you can take an A level course in thinking skills." (140)
"Cognitive Acceleration through Science Education (CASE)" (page 140)
"The main reason children end up not learning what they're taught in school isn't that they're not capable of remembering it; it's that their teachers don't sufficiently value kids knowing stuff and don't use the sorts of consolidation strategies which would help them to remember." (page 157)
"I have to show you how to use a comma in a wide variety of contexts and then get you to practise writing correctly punctuated sentences." (page 179)
So his focus on "knowledge" includes skills and application; not just memorization.
"The general rule is that expert knowledge always trumps raw ability." (page 180)
"Knowledge is most truly flexible when it is automatised." (page 183)
On page 183, Didau references a student's test response as "empty and worthless". His book doesn't include the question prompt visible in the original. With that context, the answer doesn't seem so crazy to me. Doesn't seem like a very good question, really.
"So what does it mean to be skilled at making inferences? Nothing: it is indistinguishable from being knowledgeable." (page 186)
"We are unable to think with anything that we are dependent on looking up." (page 191)
"It's only when people ask us to explain what we think we know that we find out whether we know it." (page 197)
Many references to Michael Young's "powerful knowledge" idea. Here's one applied explainer I found. I also stumbled on this critique, which includes in its abstract: "The first part of the article focuses on the definitional connection that Young makes between 'powerful knowledge' and systematic relationships between concepts. It argues that most of the school subjects that Young sees as providing 'powerful knowledge' fall short on this requirement."
"Being able to quote Shakespeare or knowing Pythagoras' theorem may seem like trivia, but it enables us to access society in a way which would be impossible if we didn't know any of this." (page 209)
"One much chewed bone of contention is who gets to decide what knowledge children should learn. The assumption seems to be that there's some shadowy elite inflicting their preferences on the rest of us. This is nonsense. No one chooses; we all choose. No one person knows enough to make this choice but collectively we have access to the vast accumulation of human culture. The most important things to know are those things that last and which most influence other cultural developments; those things that inspire the most 'conversations' backwards and forwards through time and across space; those things that allow us to trace our cultural inheritance through threads of thought from the discoveries of modern science and the synthesis of modern art back to their ancient origins." (page 210)
This is a mess on multiple levels. Leaving aside the most obvious issues around who he means by "we", he's abandoned his original claim of arguing for making children cleverer: lasting a long time and being culturally popular does not imply useful for thought, and then we're back to who he means when he says "our cultural inheritance".
"On the face of it, building a curriculum around the thoughts and deeds of historically marginalised groups looks like a really good idea. Who wouldn't want children to know about the achievements of women and people of colour? The trouble is, this isn't shared knowledge. It doesn't allow access to the 'knowledge of power', and, crucially, it doesn't provide much cultural capital." (page 211)
What happened to making children cleverer? Say, by giving them access to the powerful idea that everyone can contribute to society, not just historically dominant (not to say oppressive) people?
Maybe he's arguing against a curriculum based only on historically marginalised groups, to the exclusion of his buddy Shakespeare, say?
"Does it add to children's knowledge of what others in society consider to be valuable?" (page 218)
This is the first in Didau's list of desiderata for what to teach.
"The epistemology of most sciences, for example, is often based upon experimentation and discovery and, since this is so, experimentation and discovery should be apart of any curriculum aimed at 'producing' future scientists. But this does not mean that experimentation and discovery should also be the basis for curriculum organization and learning-environment designing." (quoting Paul Kirschner, page 223)
I think the opposite (not teaching students about experimentation and discovery, and letting them try it at least a little) is also a mistake.
"We've already seen that the best way to learn the solutions to problems is not by solving problems. Problem solving is the means by which new knowledge might be added to the domain; it is not an effective means of learning the knowledge already within the domain." (page 224)
"Solving problems is an inefficient way to get better at problem solving." (page 236)
This sounds very strange from a math teacher's perspective, where the goal is often to teach students to solve problems, and some even say the only way to learn it is to solve problems. Here's Lockhart: "Mental acuity of any kind comes from solving problems yourself, not from being told how to solve them."
Didau's second requirement for a curriculum is that it is "Culturally rich. (Does the selected content conform to shared cultural agreements of what is considered valuable to know?)" (page 224)
On page 225 he summarizes "Some knowledge is more culturally rich than other knowledge — that is, more valued within society."
There is a separate case to be made, maybe, for knowing what other people know so that you can communicate with them, walk the halls of power, etc. But it is not, in my opinion, the same as a case for some knowledge making you cleverer. The argument that historically dominant knowledge is the best knowledge is problematic.
"As experts, we often assume that others share the same background knowledge as us and so it often goes unsaid. And where expert knowledge is stated, all too often it isn't understood. Experts are unaware of the extent of their knowledge and end up speaking in maxims. As we saw in Chapter 7, such maxims are easily understood by other experts but are meaningless to novices. Where a novice will be confused and frustrated by gaps in explanation, an expert fills them within even realising they're doing it. Such is the curse of knowledge. This lack of insight into the source of expertise can lead us into neglecting the teaching of the vital nuts and bolts on which our expert performances depend." (page 234)
"It's worth noting that we can't create mental representations just through study — we have to get our hands dirty by trying to do the thing we want to improve at." (page 240)
I happened to look up the expertise reversal effect on Wikipedia. I'll put in bold parts that correspond word for word.
Here's what Didau's book includes on page 245:
"The worked-example effect — worked examples (a problem statement followed by a step-by-step demonstration of how to solve it) are often contrasted with open-ended problem solving in which the learner is responsible for providing the step-by-step solution. Although novices benefit more from studying structured worked examples than from solving problems on their own, as knowledge increases, open-ended problem solving becomes more effective."
Here's Wikipedia:
"Interactions between levels of knowledge and the worked-example effect: Worked examples provide a problem statement followed by a step-by-step demonstration of how to solve it. Worked examples are often contrasted with open-ended problem solving in which the learner is responsible for providing the step-by-step solution. Low-knowledge learners benefit more from studying structured worked-out examples than from solving problems on their own. However, as knowledge increases, open-ended problem solving becomes the more effective learning activity."
I think a teacher grading a paper would have to call this plagiarism.
Domain-Specific Knowledge and Why Teaching Generic Skills Does Not Work by André Tricot and John Sweller is cited on page 245.
"We should always remember that novices are not less intelligent, they are less knowledgeable. Everyone gets cleverer the more they know and the more they practise." (page 246)
"The point of these desirable difficulties is to confront us with the illusion of knowledge and reveal the true extent of our ignorance." (page 254)
"If students simply struggle they will learn to hate school. If they struggle too much, or too soon, this will also be undesirable. Struggle is only desirable after success has been encoded." (page 254)
"The point is not that children should sink or swim, it's that they should all swim." (page 255)
- "Encode success.
- Promote internalisation.
- Increase challenge.
- Repeat." (page 255)
"Developing good explanations and accurate analogies is probably the key area of subject specialist knowledge teachers most need to develop." (page 257)
"Attempting to follow along in their own copy of a text while simultaneously having to listen as the text is read aloud is impossible. Children are forced to task switch between the printed material and the sound of the teacher's voice, meaning they lose track of what it is they're supposed to be reading and remember far less than if they had either read or listened without trying to do both at once." (page 261)
This is presented without citation. It would be interesting if it was supported by some evidence. I feel like I remember plenty of following along in texts while others read in my early education at least, and I don't recall it being difficult. References welcome.
"And thinking about what we teach is enhanced by remembering that our aim is to help children become more creative, be better problem solvers, think more critically and be more collaborative." (page 269)
"Some knowledge is both more powerful (allows for thinking more thoughts) and more culturally rich (has a higher cultural value) than other kinds of knowledge; as such, it results in more useful schemas." (page 276)
Here at least he recognizes a distinction.
]]>This harmful propaganda was not only published, but also hasn't been corrected since. The book is still for sale. Didau's site still says "Andrew Sabisky has elegantly debunked a series of the most enduring edu-myths about intelligence" in it. While in a later book Didau describes eugenics as "an unpleasant and inherently racist ideology" (page 96) he hasn't disavowed the contents of his earlier book, including when asked directly (1, 2).
Didau's chapters themselves dog-whistle a little more subtly, as when he smirks at political correctness by referencing an obscure incident (page 76) or (in his later book) disparages a feminist paper without considering its contents (page 204).
The book otherwise advocates for explicit instruction and trying to get kids to remember things via, for example, spaced repetition. Didau supports a knowledge-rich curriculum like Hirsch and wants everybody to stop hassling teachers so much. His thinking here reminds me of a meme from Seven Years of Spaced Repetition Software in the Classroom.
Approach with caution.
While this is always the case, I feel like it's especially important with a book like this for me to point out that selected quotes below do not indicate my agreement with or support for any particular quote.
"To make things even more challenging for us as learners and/or teachers, conditions of instruction or practice that appear to result in rapid progress and learning can fail to produce good long-term retention of skills and knowledge, or transfer of such skills or knowledge to new situations where they are relevant, whereas other conditions that pose challenges for the learner — and appear to slow the learning process — can enhance such long-term retention and transfer. Conditions of the latter type, which I have labelled "desirable difficulties", include spacing, rather than massing, repeated study opportunities; interleaving, rather than blocking, instruction or practice on the separate components of a given task; providing intermittent, rather than continuous, feedback to learners; varying the conditions of learning, rather than keeping them constant and predictable; and using tests, rather than re-presentations, as learning opportunities." (pages iv-v)
"These are the threshold concepts of the book:
- Seeing shouldn't always result in believing (Chapter 1).
- We are all victims of cognitive bias (Chapter 2).
- Compromise doesn't always result in effective solutions (Chapter 4).
- Evidence is not the same as proof (Chapter 5).
- Progress is a gradual, non-linear process (Chapters 6 and 7).
- Learning is invisible (Chapter 8).
- Current performance is not only a poor indication of learning, it actually seems to prevent it (Chapters 8 and 9).
- Forgetting aids learning (Chapter 9).
- Experts and novices learn diffferently (Chapter 10).
- Making learning more difficult can make it more durable (Chapter 11)." (page 2)
"Because curriculum time is always limited, we need to decide which is more important: teaching or learning." (page 3)
"This leads us to naive realism — the belief that our senses provide us with an objective and reliable awareness of the world." (page 16)
"For the most part 'anecdotal evidence' is an oxymoron." (page 18)
"The fusion of these beliefs is enactivism: there really is an objective reality out there, but we cannot perceive it directly. Instead we share in the generation of meaning; we don't just exchange information and ideas, we change the world by our participation in it. We take the evidence of our senses and construct our own individual models of the world. But because we start with the same objective reality, our individual constructed realities have lots of points of contact." (page 21)
I'm not sure that's quite what others mean by enactivism...
"This is what Michael Shermer calls "patternicity": the tendency to find meaningful patterns in random noise." (page 26)
This is citing "The Believing Brain: From Ghosts and Gods to Politics and Conspiracies—How We Construct Beliefs and Reinforce Them as Truths".
See also: apophenia.
"Our inability to think statistically causes us to routinely misinterpret what data tells us. In a survey of school results in Pennsylvania, many of the top performing schools were very small. Intuitively this makes sense — in a small school teachers will better know their students and will be able to give much more tailored support. This finding encouraged the Bill and Melinda Gates Foundation to make a $1.7 billion investment in founding a string of small schools. Sadly the project was a failure. The finding that smaller schools do better was a confound; the worst schools in the Pennsylvania survey were also small schools. Statistically small schools are not better. In fact, larger schools tend to produce better results due to the diversity of curriculum options they can offer. OUr desire to find patterns and explanations trips us up. We ignore the statistical fact that small populations tend to yield more extreme results than larger populations, and we focus instead on causes and narratives." (page 27)
This cites "Evidence That Smaller Schools Do Not Improve Student Achievement" by Wainer and Zwerling.
On page 55 he talks about the British system of "target grades", which is surprising and weird to me. Every student gets some explicit target grade for their big exams at the end of high school. Wacky. He cites a blog post.
After talking to Jay about this, it's possible having "target grades" isn't so different from how things often play out in the US, when student performance on standardized tests sorts them into tracked classes... Not exactly the same thing, but maybe not so foreign.
"Assigning numerical values to our preferences and biases gives them the power of data, but they're still just made up." (page 58)
"The philosopher Bertrand Russell pointed out, "in the modern world the stupid are cocksure while the intelligent are full of doubt"." (page 62)
The citation is Russell's "The Triumph of Stupidity".
I hadn't heard of "thought-showering" instead of "brain-storming"... The phrase seems to have had a short history. The citation of Productivity Loss in Brainstorming Groups: A Meta-Analytic Integration is more relevant. (page 76)
"Rushing students into situations where they are expected to behave like experts misses the fact that they don't yet know enough to do so. Simply making students work in groups will not create better workers." (page 77)
"The point of collaboration is that it opens us up to the ideas of others. But so does reading books." (page 77)
"If the evidence tells us that teacher led instruction is an effective way to teach and discovery learning is an ineffective way of teaching (and it does), why would you do a bit of both?" (101)
That highly opinionated bit contrasts with this nearby selection:
"Teaching cannot be child centered and teacher led at the same time. You have to make a choice. I'm not arguing that one position is better than the other, just that they are mutually exclusive." (page 102)
He's trying to make an argument here, and it might even be right, but he isn't making it very well.
"My own suspicion is that most teachers put on a child-centered show when observed and then revert to teaching from the front when the classroom door is closed. The only real effect that being told teaching should be relevant, active and collaborative has had is to make us feel guilty for teaching." (page 109)
This is quite a claim.
On page 111, I was interested to see the Cone of Experience/"Learning Pyramid" ("people remember X% of what they Y") debunked. I remember seeing a poster version back in middle school, and I never questioned it. Didau cites Multimodal Learning Through Media, which opens with a good critique of the "Learning Pyramid" thing.
"Making it harder to learn is more effective than making it easy." (page 115)
"Instead of endlessly seeking to find out new things, we should think more carefully about the things we've already found out." (page 127)
I like this, but maybe it's not an "instead of" but "in addition to"...
"It's not clear whether 'direct instruction' refers to generic teacher-led whole-class teaching or Siegfried Engelmann's Direct Instruction, in which lessons are scripted and which outperformed all other teaching methods in the largest and most expensive education study ever undertaken, Project Follow Through." (page 131)
If scripted lessons are so great, why is video not so great?
"Education researcher and author Geoff Petty says, "This strategy of top-down diktat does not work, it has been carefully evaluated and it fails. So if you are forced to do “evidence based teaching”, you are not doing Evidence Based Teaching! You are being bullied with an ineffective management strategy!"" (page 133)
"Coe has suggested this axiom: "Learning happens when people have to think hard.""
"Understanding and recognizing the most important conceptual areas of our subjects upon which all else rests might help us to make better decisions about both what and how to teach." (page 164)
"The notion of ability is as much about how we see ourselves and how others see us as it is about intelligence." (page 172)
"We want our students to have an understanding of the deep structure of a domain of knowledge, but we have to be patient. If we want someone to have an insight, simply telling them what the insight is 'meant to be' robs them of seeing it for themselves. Instead, we can tell them as much about the surface features of a problem as we can and wait for them to join our dots. Mimicry is a necessary waiting room in the chaos of liminal space. Feeling frustrated that children know, say, their times tables, but are unable to do long division, is silly. As they learn more facts, see more examples and get more practice, they will slowly but surely move towards an expert's understanding of the subject." (page 201)
For someone who doesn't like discovery learning, this seems similar to discovery learning...
"The path to mastery isn't smooth, but it becomes a lot less bumpy when we accept that it's hard and that we're supposed to struggle." (page 209)
"Testing can (and should) include some of the tricks and techniques we've been misusing and misunderstanding as Formative Assessment. In fact, it doesn't really matter how we test students as long as our emphasis changes. Testing should not be used primarily to assess the efficacy of your teaching and students' learning; it should be used as a powerful tool in your pedagogical armory to help them learn." (page 234)
"Trying harder makes a big difference. Getting students to understand what they should be doing is hard enough, but motivating them to actually do it is the master skill." (page 256)
For something that's "the master skill", this topic isn't given much attention... There is some though...
"The fourth reason [for students to decide to invest effort, "To improve their performance"] is the one we should seek to develop. How can we give feedback which harnesses students' desire to improve their performance? ... What we want is for students to see their success as being directly causes by their effort. ... The point of all this, as William concludes, is for students to believe that "It's up to me" (internal) and "I can do something about it" (stable)." (pages 259-261)
Quoting Dylan William:
"So, as a general rule, I advise teachers not to give feedback unless the first 10 to 15 minutes of the next lesson is allocated to students responding to the feedback." (page 267)
"One of my favorite models for classroom observation is the one taken by Doug Lemov and the Uncommon Schools network. The idea is ridiculously simple: you look at the data to find out which teachers have the best results and then you observe them to find out what they're doing. Lemov's teaching manual, Teach Like A Champion, is a compendium of some of the strategies common to these über-teachers which can be practiced and replicated by us mere mortals." (page 302)
"As we know, children are complex and classrooms more complex still. We're probably interested in more than students 'merely' acquiring new skills and knowledge within the domains of the subjects we teach. We may also have an interest in fostering a 'love of learning' and turning students into 'lifelong learners'. Whatever the current trend might be, we want our students to somehow be changed and improved by their experiences in school." (page 309)
Quoting Duckworth et al.:
"Grittier spellers engaged in deliberate practice more so than their less gritty counterparts, and hours of deliberate practice fully mediated the prospective association between grit and spelling performance." (page 310)
Quoting Richard Sennett:
"We share in common and in roughly equal measure the raw abilities that allow us to become good craftsmen: it is the motivation and aspiration for quality that takes people along different paths in their lives. Social conditions shape these motivations." (page 320)
(From here on is from Appendix 2: "Five myths about intelligence" by Andrew Sabisky, who seems to be a monster.)
"Differential psychology — the science that investigates the nature and causes of differences between individuals in their cognition ..." (page 391)
"Throughout the 19th century, most commentators on mental abilities assumed they were independent — a school of thought called 'faculty psychology'. [in the sense of multiple intelligences] Such a model remained untested until an English psychologist, Charles Spearman, found that boys' grades in school subjects were highly correlated; the boys who excelled at math were likely to be better than average in English and Latin. Spearman developed a novel statistical technique called factor analysis to analyze his data and proposed that one common factor, g, explained most of the variance in a battery of mental tests, but that each subtest also had its own specific variance (Spearman called this non-shared variation s factor). Spearman's original finding has since been modified by later analyses and the factors derived from analyses of mental tests are best thought to fit into a pyramidal structure, with g at the top (see Figure A2.1). The finding that all mental tests positively inter-correlate is probably the most replicated result in all of psychology, and factor analyses is now an extremely popular statistical tool used widely across the social and biological sciences." (pages 392-393)
"G does not appear to be a chimerical statistical artefact, as has sometimes been alleged by Stephen J. Gould and others, but a biological reality fundamental to cognition." (page 393)
Sabisky references, as an "excellent and readable introduction to behavior genetics", the article How much can we boost IQ and scholastic achievement? by Arthur Jensen. So while Sabisky is a little racist and eugenicist in his appendix, he's citing (and recommending, even) very racist, very eugenicist sources.
"The assumption is taken for granted that all groups are exactly equal in their ability and that any differences between them result from biased tests." (page 402)
I'm not commenting everywhere, but this is so egregious a straw man I couldn't skip it.
"The tests that best predict job performance also discriminate the most against non-Asian ethnic minorities, and especially blacks. The dilemma this gives rise to is known in industrial-organizational psychology as the diversity-validity trade-off." (page 403)
"Do we want to assess the intellectual ability of our students? Such tests are certainly of great value to universities and even more to many employers. Or do we want to assess the competence of our teachers by making sure that our students have in fact learned, with some reasonable proficiency, a core stock of knowledge that we value as a society?" (page 406)
I didn't know quite what "grammar school" means in the UK...
]]>"Large, permanent individual differences in talent are a fact of life and are not going to go away for the foreseeable future." (page 408)
1.1 -> ' 1 '
98.2 -> ' 98 '
111.7 -> '112 '
9_876 -> '9.9K'
12_345_678 -> ' 12M'
123_456_789_012 -> '123B'
999_123_456_789_012 -> '999T'
Thousands as "K" might be a little unfamiliar for some audiences, but overall I think this is pretty good, for example for labeling axes. Here's the code:
def tdn(number):
"""Short (Three-Digit Number) string representation"""
assert 0 <= number < 9.995e14, 'outside range of defined behavior'
if number < 1000:
return f'{round(number):3d} '
number /= 1000
for symbol in 'KMBT':
if number >= 1000:
number /= 1000
continue
if number >= 10:
return f'{number:3.0f}{symbol}'
return f'{number:1.1f}{symbol}'
return result
There's probably a nicer way to do it; feedback (and pull requests on GitHub) welcome!
]]>The major simplifying assumption here is that review delays double, so that after n days, the number of days that are contributing their cards to today's review is log_{2}(n).
Your review delays won't be exactly that, and you won't add cards evenly across days. For me, the estimate is close but slightly high after 336 days of using Anki.
The average number of cards added per day is an important multiplicative factor; my average is currently 11.3 cards per day and trending lower. It might be higher for full-time students, but I'd be surprised if many people stayed much above ten cards per day over multiple years. When you stop adding so many cards, daily reviews decrease rapidly.
I think spaced repetition study could reasonably be a lifelong practice, like daily exercise.
]]>The Pareto chart image at left is the same width as the table image shown at right.
Here's an example I found of a Pareto chart where both the individual bars and cumulative line use all the vertical space, which looks nice visually.
In my opinion the confusion introduced by having two different vertical axes is unacceptable.
This particular example is especially egregious because the right hand cumulative percentage scale starts at 20% rather than 0%, which makes it look like the leftmost category somehow accounts for almost none of the cumulative total.
This example also shows how text on visualizations can become a disaster by getting tiny. Don't make people squint.
The problem of having two vertical scales isn't necessary; the same scale can be used, and presented as both raw value and percent of total. But often, this introduces a lot of whitespace.
The vertical axis labels could be improved here, but the main visual effect is to foreground the cumulative percentage line, squishing the bars. Is the cumulative percentage line worth it?
Ordering things by size is interesting and useful for understanding what components are largest, but it isn't really a natural ordering. It could change, for example. So while there is a cumulative value in the sense of "the top 5 items account for X% of the total", these values aren't cumulative in the sense of summing over a natural order the way you might to get the probability of rolling a 3 or lower, for example.
There's also no meaning to the cumulative value "between" two categories, generally. Using a line to connect the cumulative sums suggests a smoothness of transition which is misleading.
The main use you might put the cumulative line to is identifying the number of categories that add up to some percentage of the total, like 50% or 80%. This leads immediately to trying to do this using the figure's axes, which is clumsy. Better to include the values, and if you do this via direct labeling, your figure is becoming a table with awkward alignment.
The main thing people love to hate about pie charts is the difficulty we have interpreting relative sizes of pieces of pie, and this is valid. But it's also a huge pain to label those slices.
If you take the easy way out with a separate legend, as in this extreme example, you create a miserable game of hide and seek for the viewer.
This example struggles even to label the percentages of the tiny slices, and matching labels to slices is very difficult.
Attempting direct labeling with a pie chart is better, as in this example comparing military spending.
There are two reasons direct labeling is possible here: country names tend to be short, and most countries get tossed into "Rest of World". In general, pie chart labeling won't work as well as this.
And of course the problems with using a circular representation remain as well.
A stacked bar chart should be better than a pie chart in that it allows viewers to compare linear distances, but it's still hard to label. I tried a couple different ways.
I tried mocking up direct labeling on a stacked bar, but I gave up before I finished adding all the lines because it was just a nightmare. Too many lines, too close together.
The alternative then is to use proximity. But what do you do with the labels that won't fit?
Here's one clumsy answer, just stacking the labels once there isn't space to put them in the right places. It starts to just look like a table with wasted space.
A variation puts all the unlabelable categories into a catch-all, which I don't really love either.
The horizontal text is at least readable, and it isn't a pie chart, but it's also still not really so easy to compare lengths between segments at the top and the bottom of the stacked bar.
With languages that write left-to-right or right-to-left, it's difficult to label a bunch of vertical bars. Especially with longer labels, it can get ridiculous.
Arguably, a big part of the value of a Pareto chart is the labels because they've been put in order so that the biggest factors come first. Readability is important. The Pareto chart generally fails to have highly readable labeling.
A horizontal bar chart with the cumulative line dropped could make for a better Pareto chart. You could adjust orientation and add numeric labeling so that viewers can read off useful values. At this point what you have is really a table with one column of SPARKLINE bars.
Using a table lets you easily align data in columns for easy reference and scanning, increasing information density without making an irregular Where's Waldo of numbers.
Categories are generally epiphenomena of human perception.
In this example, why is "Medicare" separate from "Health"? Why use the divisions here and not others? Sometimes big categories here are omitted by others as non-discretionary; why not do that?
Questions of this kind are probably a bigger deal than the visualization you choose, but are outside scope here.
Data is shown for a single year here, but changes over time are super interesting. "Income security" is the top category for 2020, but that's because of COVID-19: it was fifth for 2019. "National defense" was second in 2019. I'm not sure what the best way to show these changes over time; the trends page from the data provider isn't bad.
The data is from USAspending.gov's Data Lab Federal Spending by Category and Agency. I dropped "Offsetting Revenue Collected But Not Attributed to Functions" on the grounds that I don't understand how it's spending. Really I think their displays are pretty good.
A brief Python notebook is on GitHub. I mocked some stuff up in a Google Slides deck, and the main table is in a Google Sheet. All of these are fairly clumsy and I'd love to find better ways of doing this kind of thing.
]]>\[ A \cdot B = \cos \left( \gamma \right) a b \tag{1} \]
Equation 1 is invoked, for example, in defining cosine similarity, often without derivation.
If, in \( n \) dimensions, the Cartesian coordinates of a point \( X \) are \( x_1, x_2, \dotso, x_n \), then the dot product \( A \cdot B \) is the sum of element-wise products as in Equation 2.
\[ A \cdot B = \sum_{i=1}^n{a_i b_i} \tag{2} \]
The dot product is so simple, it's a little surprising a nice trigonometric function like cosine should come out of it in arbitrary high-dimensional spaces.
Notation has been chosen so as to write the Law of Cosines as usual in Equation 3.
\[ c^2 = a^2 + b^2 - 2 a b \cos \left( \gamma \right) \tag{3} \]
The \( c^2 \) in Equation 3 is the square of the distance from \( A \) to \( B \) and can also be written, using the Pythagorean theorem, as in Equation 4.
\[ c^2 = \sum_{i=1}^k{ \left( a_i - b_i \right)^2 } = \sum_{i=1}^k{ a_i^2 + b_i^2 - 2 a_i b_i } = a^2 + b^2 - 2 \sum_{i=1}^k{ a_i b_i } \tag{4} \]
Equating the right hands of Equations 3 and 4, and recognizing the dot product as in Equation 2, we've built Equation 1 from scratch. \( \Box \)
It's also possible to use two-dimensional right triangle visualization thinking to see how dot product projects one vector onto another, with scaling by the lengths of the vectors. 3Blue1Brown has a great video. It's easy to connect to the usual definition of cosine.
Since you can always rotate your coordinate frame to get one vector with all zero coordinates except for one, and the other vector with all zero coordinates except for that one and one more, it's clear that dot product should give cosine in that frame, but it isn't necessarily obvious (to me) that everything works out in other coordinate frames. The proof above gives me confidence that it does. (Not that I didn't believe it before, but it's nice to have a reason.)
I wasn't quite satisfied with cosine similarity via dot product for a long time. My satisfaction now and the proof above is based quite directly on the proof Hamming generously gives on pages 117-118 of The Art of Doing Science and Engineering. As he says there (italics in original):
"I have found it very valuable in important situations to review all the basic derivations involved so I have a firm feeling for what is going on."
He uses different (possibly better?) notation, and his figure is certainly nicer than mine.
]]>And then they talk about Kindle, Prime, Prime Video, and AWS in part two.
"In its first shareholder letter back in 1997, Amazon's first year as a public company, you'll find the phrases "Obsess Over Customers," "It's All About the Long Term," and "We will continue to learn from both our successes and our failures." One year later the term "Operational Excellence" entered the discussion, completing the four-faceted description of Amazon's corporate culture that endures today." (page xi)
"Of course, these four cultural touchstones don't quite get at the "how," that is, how people can work, individually and collectively, to ensure that they are maintained. And so Jeff and his leadership team crafted a set of 14 [now 16] Leadership Principles, as well as a broad set of explicit, practical methodologies, that constatntly reinforce its cultural goals. These include: the Bar Raiser hiring process that ensures that the company continues to acquire top talent; a bias for separable teams run by leaders with a singular focus that optimizes for speed of delivery and innovation; the use of written narratives instead of slide decks to ensure that deep understanding of complex issues drives well-informed decisions; a relentless focus on input metrics to ensure that teams work on activities that propel the business. And finally there is the product development process that gives this book its name: working backwards from the desired customer experience." (page xi)
Around page 12, they talk about how the Leadership Principles tried to capture the culture, not create it. They were first listed in 2004-2005, ten years after Amazon was founded.
"In my tenure at Amazon I heard him [Jeff Bezos] say many times that if we wanted Amazon to be a place where builders can build, we needed to eliminate communication, not encourage it." (page 61)
"The metrics used to measure progress were agreed upon. For example, In-stock Product Pages Displayed divided by Total Product Pages Displayed, weighted at 60 percent; and Inventory Holding Cost, weighted at 40 percent." (page 70)
Reading this quickly, it seems to make sense. But the values are on different scales: one is between zero and one (I hope) and the other is a cost, presumably measured in dollars. So what is this really?
Also, for product pages, an ebook is always in stock, so is there an incentive to show ebooks rather than physical books? Is that desirable?
"If you're good at course correcting, being wrong may be less costly than you think, whereas being slow is going to be expensive for sure." (page 71, quoting the 2016 Bezos shareholder letter)
"Sometimes it's best to start slow in order to move fast." (page 71)
"Fitness Functions Were Actually Worse Than Their Component Metrics
Two-pizza teams had been meant to increase the velocity of product development, with custom-tailored fitness functions serving as the directional component of each team's velocity. By pointing each team in the right direction and alerting them early if they drifted off course, fitness functions were supposed to align the team uniquely to its goals. We tried them out for more than a year, but fitness functions never really delivered on their promise for a couple of important reasons.
First, teams spent an inordinate amount of time struggling with how to construct the most meaningful fitness function. Should the formula be 50 percent for Metric A plus 30 percent for Metric B plus 20 percent for Metric C? Or should it be 45 percent for Metric A plus 40 percent for Metric B plus 15 percent for Metric C? You can imagine how easy it was to get lost in those debates. The discussions became less useful and ultimately distracting—just another argument that people needed to win.
Second, some of these overly complicated functions combined seven or more metrics, a few of which were composite numbers built from their own submetrics. When graphed over time, they might describe a trend line that went up and to the right, but what did that mean? It was often impossible to discern what the team was doing right (or wrong) and how they should respond to the trend. Also, the relative weightings could change over time as business conditions changed, obscuring historic trends altogether.
We eventually reverted to relying directly on the underlying metrics instead of the fitness function. After experimenting over many months across many teams, we realized that as long as we did the up-front work to agree on the specific metrics for a team, and we agreed on specific goals for each input metric, that was sufficient to ensure the team would move in the right direction. Combining them into a single, unifying indicator was a very clever idea that simply didn't work." (pages 73-74)
"What was originally known as a two-pizza team leader (2PTL) evolved into what is now known as a single-threaded leader (STL). The STL extends the basic model of separable teams to deliver their key benefits at any scale the project demands. Today, despite their initial success, few people at Amazon still talk about two-pizza teams." (page 75)
"When the retail, operations, and finance teams began to construct the initial Amazon WBR [Weekly Business Review], they turned to a well-known Six Sigma process improvement method called DMAIC, an acronym for Define-Measure-Analyze-Improve-Control." (page 124)
"Amazon takes this philosophy [of understanding how inputs affect outputs] to heart, focusing most of its effort on leading indicators (we call these "controllable input metrics") rather than lagging indicators ("output metrics")." (page 124)
"When Amazon teams come across a surprise or a perplexing problem with the data, they are relentless until they discover the root cause. Perhaps the most widely used technique at Amazon for these situations is the Correction of Errors (COE) process, based upon the "Five Whys" method developed at Toyota and used by many companies worldwide." (page 132)
"Anecdotes and exception reporting are woven into the [Weekly Business Review] deck." (page 135)
"Data Combined with Anecdote to Tell the Whole Story: Numerical data become more powerful when combined with real-life customer stories." (page 142)
"These stories remind us that the work we do has direct impact on customers' lives." (page 143)
"Data and anecdotes make a powerful combination when they're in sync, and they are a valuable check on one another when they are not." (page 145)
"Even the best process can only improve the quality of your decision-making; no process will make the decision for you." (page 159)
"With each modification [of the org structure], the scope of each leader's responsibilities would become narrower, but the intended scale of each role was greater. At most companies, reducing a leader's scope would be considered a demotion, and in fact there were many VPs and directors who saw each of these changes in that way. At Amazon, it was not a demotion." (page 174)
There's some interesting organizational psychology here; maybe it's obvious, but it seems interesting to me.
"Steve asked Gregg to build out a hardware organization, which he did with the code name Lab126 (the 1 and 26 stood for the letters A and Z) and earmarked a meaningful amount of capital to the effort." (page 181)
"... we learned from studies that the average consumer would only bother to connect their iPod to their PC once a year. That meant most people walked around without the latest music on their devices. It was known as the "stale iPod" syndrome." (page 183)
This kind of thing is fascinating - often I would never guess the real behavior of real people, on average.
"... in Amazonian terms, a "strong general athlete" (SGA)." (page 202)
Seems like this phrase shows up sometimes on Amazon job descriptions...
"Jeff and other Amazon leaders often talk about the "institutional no" and its counterpart, the "institutional yes." The institutional no refers to the tendency for well-meaning people within large organizations to say no to new ideas." (pages 203-204)
"Jeff said he wanted to build a moat around our best customers. Prime would be a premium experience for convenience-oriented customers." (page 208)
Wrapping up with recommendations:
]]>
- "Ban PowerPoint
- Establish the Bar Raiser hiring process
- Focus on controllable input metrics
- Move to an organizational structure that accomodates autonomous teams with single-threaded leaders
- Revise the compensation structure for leaders
- Articulate the core elements of the company's culture
- Define a set of leadership principles
- Depict your flywheel" (pulled from pages 261-262)
"Looking back on those days [growing up in Hungary] now, 50 to 60 years later, I appreciate and admire the atmosphere. Culture was not ridiculous, unusual, or sissy—it was taken for granted. Books and music were regarded as a part of everyone's common heritage. At school we discussed the latest Nick Carter, sure, but we talked about d'Artagnan too, and we could tell each other that we had just read something without being considered odd." (page 7)
"I think by writing." (page 8)
"Anyway, back to chemistry: laboratory work seemed to me an uninspiring and messy waste of time. I was never told and I never caught on that a laboratory could teach you new facts and insights; I regarded it as just one of those chores (like irregular verbs in German and finger exercises for the piano) that the world assigns to apprentices before allowing them to become journeymen. I knew in advance what each experiment was intended to prove, and I proved it; I cooked the books mercilessly. Before the year was over I knew that chemistry was not for me, and I arranged to be transferred to the general liberal arts curriculum." (pages 23-24)
"The [calculus] text was the infamous Granville, Smith, and Longley that, according to rumor, brought each of its authors a royalty income of many thousands of dollars for at least 20 years. It was very bad. The explanations were not explanations—they were neither clear nor correct—they were cookbook instructions, no more. The selling virtue of the book was that it had many exercises, amost all of the the routine mechanical kind." (pages 26-27)
"[In graduate school] I did not understand (never even dreamt of) the idea of a "structure", in the sense in which, later, Bourbaki used that word, and I was stumped by the infinitesimal subtlety of epsilontic analysis. I could read analytic proofs, remember them if I made an effort, and reproduce them, sort of. but I didn't really know what was going on." (page 47)
This is interesting to me... What about modern mathematicians? Are they working because of or in spite of Bourbaki? I know I took some courses that presented hyper-formal versions of things as if there were no alternative, and certainly no real intuition... I'm not sure it's the right way to teach, at least for a first introduction.
"Music and poetry are more important than carburetors and calculus, because both the bus driver and I would be better human beings if we had more in common and because we could then collaborate better to live in a saner world." (page 30)
"Another way in which I wish I had then followed the advice I give now has to do with the old adage about all work and no play. I don't believe it in; I think all work and no play is the only way to get anything done. Having made my point in as shocking a way as I could think of, I'm ready to take it back and modify it. I am not talking about a lifetime of torture and slavery, and I am not excluding the relaxing tennis game, detective story, dinner in Chinatown with a bunch of friends, or Saturday night movie. What I am saying is that the work of a scholar is not torture that would be insupportable without distraction, and that, for most of us, two consuming passions are one too many." (page 53)
"If I had to describe my conclusion [about how to study] in one word, I'd say examples. They are, to me, of paramount importance. Every time I learn a new concept (free group, or pseudodifferential operator, or paracompact space), I look for examples—and of course, non-examples. The examples should include, whenever possible, the typical ones and the extreme degenerate ones." (pages 61-62)
"I did read the first 10 or 20 pages of all those books, and I dipped into other parts, skipping back and forth among them. (I wish I had read the first 10 pages of many more books—a splendid mathematical education can be acquired that way.)" (page 65)
"It's been said before and often, but it cannot be overemphasized: study actively. Don't just read it; fight it! Ask your own questions, look for your own examples, discover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case? What about the degenerate cases? Where does the proof use the hypothesis?" (page 69)
"With no official pre-arrangement, I simply tacked up a card on the bulletin board in Fine Hall saying that I would offer a course called "Elementary theory of matrices", and I proceeded to offer it. I prepared for it carefully, a goodly number of students (something like a dozen) attended regularly, and a couple of them took notes. ... As far as money goes, Princeton was agreeable to accepting my services free; I was slightly surprised when, at the end of the term, those services were officially recognized. I received an official request to assign grades in the course—as far as Princeton University and the students were concerned, the course carried graduate credit. From my point of view, as teacher, the course was splendid. No one came unless he wanted to; there was no nonsense about prerequisites and distribution requirements. There was no syllabus; we talked about what we wanted to talk about. There was no homework, there were no exams. When I had to give grades, I did so on the basis of subjective impressions acquired during class and during between-class discussions." (pages 95-96)
Halmos similarly just showed up at the Institute for Advanced Study with a friend of his who was actually invited, and effectively joined by proximity. Bold!
"To write a book based on the notes of a course is a good way to write a book. The most important single feature of good writing—of clear communication of any kind—is organization. If you know the right order in which a sequence of things should be said, and if you know the extent to which you need to emphasize some parts and play others down, your communication battle is more than half won." (page 96)
"I readily admit—I'd like to be among the first to insist—that expository writing should not deviate from currently accepted standard English; in such writing even puns and other attempts at humor, and, of course, outright vulgarity, are badly out of place. Why? Because they are irrelevant, they are distracting, they interfere with the clear reception of the message. Expository writing must not be sloppy in either content or form and, of course, it must not be misleading.; it must be dignified, correct, and clear. Within these guidelines, however, expository writing should be written in a living, colloquial style, it should be evocative in the same sense in which poetry is, and it should not be stuffy, but friendly and informal. The purpose of writing is to communicate, and style is a tool for communication. It should be chosen so as to put the reader at his ease and make the subject seem as easy to him as it already is to the author." (page 113)
"What is important in communication, in lecturing for example, is not what message the speaker sends but what message the listener receives. A part of the art of lecturing is to know when and how to lie. Don't insist on protecting yourself by being cowardly legalistic, but lead the audience to the truth." (page 114)
"..., John Isbell (who, God forgive him, became a categorist), ..." (page 155)
"I give most of the credit [for still remembering Spanish after many years] to the saturation method—do everything, do it all at once, and do it every minute you can possibly spare—the best way of learning there is." (page 172)
"By now I think fondly of it [Uruguay] and wish it well; it gave me something and I left part of me there; I am glad I went, and I am glad I am not there now." (page 199)
I feel similarly about South Korea.
"Both the logician and, say, the harmonic analyst, look for a certain kind of structure, but their kinds of structures are psychologically different. The mathematician wants to know, must know, the connections of his subject with other parts of mathematics. The intuitive similarities between quotient groups and quotient spaces are, at the very least, valuable signposts, the constituents of "mathematical maturity", wisdom, and experience. A microscopic examination of such similarities might lead to category theory, a subject that is viewed by some with the same kind of suspicion as logic, but not to the same extent." (page 205)
"The logician" here is not "the mathematician"...
"He [Carl E. Linderholm] became famous for a brilliant, witty, extended mathematical in-joke, a book called Mathematics Made Difficult [PDF]. The book treats high school trigonometry, for instance, from the point of view of category theory, for instance—I recommend it highly." (page 222)
As reviewed on Amazon by Easwaran, with review title "Categories for the Non-Working Mathematician" (itself a reference to Categories for the Working Mathematician):
This book is a wonderfully humorous satire of the project (possibly pushed by Mac Lane, Lawvere, Grothendieck, and others, though never as far as one might think) to reformulate all of mathematics on category-theoretic foundations. As such, many of the jokes will be lost on a reader with no familiarity with the language of category theory. But there are plenty of other jokes that even a high schooler should be able to appreciate. There's also some entertaining national stereotypes of French mathematicians and others that probably date the book a bit.
Highly recommended for a math graduate student who needs distraction from work.
On page 223 I learned that Halmos reviewed Mathematics and Plausible Reasoning, which I haven't finished reading.
"As for quality [of books being higher than that of articles], that's just a feeling I have. I think that, with extremely rare exceptions, even the book of lowest quality is likely to be correct most of the time, and to be well enough organized and expounded that studying it would add to your mathematical wealth. Articles are more often wrong and very often so badly done that reading them is more work than it is worth. They must exist—don't misunderstand me—I am not advocating that we stop publishing current research papers and go back to the days when "publish" was synonymous with "publish a book". No, papers are absolutely necessary, but books are better." (page 234)
"I am convinced (by faith) that if I knew everything about Boolean algebras I would be very close to knowing everything about analysis—or, to be a little more precise about such a vague article of religion, that I would be as close as a person who knows everything about measurable sets is to knowing everything about measurable functions. Close, yes, but not yet there." (page 245)
I'm not sure I really understand what he's getting at here, but it sounds interesting.
Halmos advocates the Moore method (e.g., page 257). I'd be curious to see the documentary about it, "Challenge in the Classroom"... (Did I see it, once?)
"I think an automobile transmission mechanic should try to be the best automobile transmission mechanic he has the talent to be, and butlers, college presidents, shoe salesmen, and hod-carriers should aim for perfection in their professions. Try to rise, improve conditions if you can, and change professions if you must, but as long as you are a hod-carrier, keep carrying those hods. If you set out to be a mathematician, you must learn the profession, every part of it, and then work at it, profess it, live it as best you can. If you keep asking "what's there in it for me?", you're in the wrong business. If you're looking for comfort, money, fame, and glory, you probably won't get them, but if you keep trying to be a mathematician, you might." (pages 264-268)
I like this whole section and put it up as a separate page: How to be a pro.
"On a Ph.D. oral he [Tamarkin] asked the candidate about the convergence properties of a certain hypergeometric series. "I don't remember", said the student, "but I can always look it up if I need it." Tamarkin was not pleased. "That doesn't seem to be true", he said, "because you sure need it now."" (page 272-273)
"But let's get back to teaching by challenging. An intrinsic aspect of the method at all levels, elementary or advanced, is to concentrate attention on the definite, the concrete, the specific. Once a student understands, really and truly understands, why 3x5 is the same as 5x3, then he quickly gets the automatic and obvious but nevertheless exciting unshakable generalized conviction that "it goes the same way" for all other numbers. We all have an innate ability to generalize; the teacher's function is to call attention to a concrete special case that hides (and, we hope, ultimately reveals) the germ of the conceptual difficulty." (page 272)
"(Do all readers know that I reject "Hal-mush", some people's notion of the "right" way to pronounce me? Please, please, say "Hal-moss".)" (page 292)
"How to do almost everything" (title of chapter 14, page 319)
"Mathematics is not a deductive science—that's a cliché. When you try to prove a theorem, you don't just list the hypotheses, and then start to reason. What you do is trial and error, experimentation, guesswork. You want to find out what the facts are, and what you do is in that respect similar to what a laboratory technician does, but it is different in its degree of precision and information. Possibly philosophers would look on us mathematicians the same way as we look on the technicians, if they dared." (page 321)
"I love to do research, I want to do research, I have to do research, and I hate to sit down and begin to do research—I always try to put it off just as long as I can." (page 321)
"It is important to me to have something big and external, not inside myself, that I can devote my life to. Gauss and Goya and Shakespeare and Paganini are excellent, their excellence gives me pleasure, and I admire and envy them. They were also dedicated human beings. Excellence is for the few but dedication is something everybody can have—should have—and without it life is not worth living." (pages 321-322)
"Despite my great emotional involvement in work, I just hate to start doing it; it's a battle and a wrench every time. Isn't there something I can (must?) do first? Shouldn't I sharpen my pencils perhaps? In fact I never use pencils, but "pencil sharpening" has become the code phrase for anything that helps to postpone the pain of concentrated attention. It stands for reference searching in the library, systematizing old notes, or even preparing tomorrow's class lecture, with the excuse that once those things are out of the way I'll really be able to concentrate without interruption.
"When Carmichael complained that as dean he didn't have more than 20 hours a week for research I marvelled, and I marvel still. During my productive years I probably averaged 20 hours of concentrated mathematical thinking a week, but much more than that was extremely rare. The rare exception came, two or three times in my life, when long ladders of thought were approaching their climax. Even though I never was dean of a graduate school, I seemed to have psychic energy for only three or four hours of work, "real work", each day; the rest of the time I wrote, taught, reviewed, conferred, refereed, lectured, edited, travelled, and generally sharpened pencils all the ways I could think of. Everybody who does research runs into fallow periods. During mine the other professional activities, down to and including teaching trigonometry, served as a sort of excuse for living. Yes, yes, I may not have proved any new theorems today, but at least I explained the law of sines pretty well, and I have earned my keep." (page 322)
There's more good stuff here... Almost worth putting up the whole section ("How to do research") but I'll stop short of that. In particular interesting that he describes his process as largely writing-driven: "I sit down at my desk, pick up a black ball-point pen, and start writing..." (page 323)
"For Dieudonné the important result is, I think, the powerful general theorem, from which it is easy to infer all the special cases you want; for me the greatest kind of step forward is the illuminating central example from which it is easy to get insight into all the surrounding sweeping generalities." (page 325)
"André Weil's logarithmic law (first-rate people choose first-rate people, but second-rate people elect third-rate ones) works the same way whether the vote concerns a minor addition to the teaching staff or the elevation of a colleague to leadership." (page 349)
Ah; there's a similar thing on page 123 too:
"André Weil suggested that there is a logarithmic law at work: first-rate people attract other first-rate people, but second-rate people tend to hire third-raters, and third-rate people hire fifth-raters. If a dean or a president is genuinely interested in building and maintaining a high-quality university (and some of them are), then he must not grant complete self-determination to a second-rate department; he must, instead, use his administrative powers to intervene and set things right. That’s one of the proper functions of deans and presidents, and pity the poor university in which a large proportion of both the faculty and the administration are second-raters; it is doomed to diverge to minus infinity." (page 123)
A similar sentiment is often attributed to Steve Jobs: "A level people hire level A people, B level people hire C level people."
"Napolean said that the only thing worse than a bad general is two good ones, and I agree: a bad chairman will hurt a department less than a committee can, even if it consists of competent people full of good intentions." (page 351)
"I believe that the work of the world is done by people, not by committees. Socrates, the teacher, was not a committee, nor was Archimedes, the inventor and research mathematician. The great strides forward (in administration and in finance, as well as in science and in the humanities) have always been made by people, not by committees; Lincoln, Rothschild, Newton, and Goethe bear witness." (page 352)
"I read the Times book reviews because I don't have the time to read all the books that come out, and reading reviews keeps me closer in touch with modern culture than not reading anything at all." (page 375)
"The purpose of the review is to provide a first approximation in three or four pages to what the book does (or could have done) in three or four hundred." (page 375)
"In other words, exposition is intended to attract and describe more than to explain and instruct." (page 390)
"The most difficult technical problem of written communication (whether the intended result is a long novel, a short biography, a research paper, or a recipe for cherry pie) is that of linear order. The way we usually receive information about the universe is multidimensional. We learn about something (or somebody) through signals that come to us simultaneously through our many senses. Our sense of balance tells us one thing, and the way our muscles stretch another; we see it, hear it, feel it, smell it, and taste it; we find it warm or cold, wet or dry. A lecturer uses words, but, at the same time, he controls how fast he speaks and how loudly; his facial expressions, his gestures, and his tone of voice are all part of the show. The most highly distilled form of verbal communication is writing. The only raw material a writer has is his vocabulary, and the presentation of his words in a total order is the only way he has to produce an effect." (page 394)
"You can't be perfect, but if you don't try, you won't be good enough." (page 400)
"Archimedes taught us that a small quantity added to itself often enough becomes a large quantity (or, in proverbial terms, that every little bit helps). When it comes to accomplishing the bulk of the world's work, and, in particular, the work of a mathematician, whether it is proving a theorem, writing a book, teaching a course, chairing a department, or editing a journal, I claim credit for the formulation of the converse: Archimedes's way is the only way to get something done. Do a small bit, steadily every day, with no exception, with no holiday. As an example, I mention the first edition of my Hilbert Space Problem Book, which had 199 problems. I wrote most of the first draft during my Miami year, and I forced myself, compulsively, to write a problem a day. That doesn't mean that it took 199 days to write the whole book—the total came to about three times that many." (pages 401-402)
Somewhere, I'm not sure where now, Halmos talks about being a teacher versus being an example... Interesting to think about: How are these different? What's more useful, when? Is this an apprentice model? How can that be done well? It reminds me of when I heard Botstein talk about how he wouldn't hire a professor who wasn't doing research. At the time I was surprised; I knew so little.
]]>]]>Anna, one of the mathematics department secretaries at Chicago, complained to me once when we met at a party and she had already consumed several gin and tonics. She was a good typist, and she didn't have much trouble picking up the art of technical, mathematical, typing—but she hated every minute of it. You have to look at every symbol separately, you have to keep changing "typits" or Selectric balls, you have to shift a half space up or down for exponents and subscripts, and you have no idea what you're doing. "I'm not going to spend my life flipping that damned carriage up and down", she said.
Another time Bruno, a well-known mathematician, asked me: "How did you manage to make your ten lectures at the CBMS conference all the same length? Isn't it true that some things take longer? When I did it, some of my talks were 45 minutes, and some 75."
More recently still I asked Calvin, a colleague, "Can you give a graduate student a ride to this month's Wabash functional analysis seminar?" "No", he said, "you better get someone else. I've been away, giving colloquium talks twice this month, and that's enough travelling."
I didn't make up any of this (except the names). Do you see what these three stories have in common? To me it's obvious, it jumps to the eye, and it horrifies me. What Anna, Bruno, and Calvin share and express is widespread and bad: it's the "me" attitude. It is the attitude that says "I do only what's important to me, and I am more important to me than the profession."
I think an automobile transmission mechanic should try to be the best automobile transmission mechanic he has the talent to be, and butlers, college presidents, shoe salesmen, and hod-carriers should aim for perfection in their professions. Try to rise, improve conditions if you can, and change professions if you must, but as long as you are a hod-carrier, keep carrying those hods. If you set out to be a mathematician, you must learn the profession, every part of it, and then work at it, profess it, live it as best you can. If you keep asking "what's there in it for me?", you're in the wrong business. If you're looking for comfort, money, fame, and glory, you probably won't get them, but if you keep trying to be a mathematician, you might.
I didn't preach to Calvin, but if I had done so, my sermon would have said that you support an activity such as the Wabash seminar (I'll tell more about that seminar later) without even thinking about it—it's an integral part of professional life. You go to such seminars the way you get dressed in the morning, nod amiably to an acquaintance you pass on the street, or brush your teeth before you go to bed at night. Sometimes you feel like doing it and sometimes not, but you do it always—it's a part of life and you have no choice. You don't expect to get rewarded if you do and punished if you don't, you don't think about it—you just do it.
A professional must know every part of his profession and (we all hope) like it, and in the profession of mathematics, as in most others, there are many parts to know. To be a mathematician you have to know how to be a janitor, a secretary, a businessman, a conventioneer, an educational consultant, a visiting lecturer, and, last but not least, in fact above all, a scholar.
As a mathematician you will use blackboards, and you should know which ones are good and what is the best way and the best time to clean them. There are many kinds of chalk and erasers; would you just as soon make do with the worst? At some lecture halls you have no choice—you must use the overhead projector, and if you don't come prepared with transparencies, your audience will be in trouble. Word processors and typewriters, floppy disks and lift-off ribbons—ignorance is never preferable to bliss. Should you ditto your preprints, or use mimeograph, or multilith, or xerox? Who should make decisions about these trivialities—you, or someone who doesn't care about your stuff at all?
From time to time you'll be asked for advice. A manufacturer will consult you about the best shape for a beer bottle, a dean will consult you about the standing of his mathematics department, a publisher will consult you about the probably sales of a proposed textbook on fuzzy cohomology. Possibly they will be genteel and not even mention paying for your service; at other times they will refer delicately to an unspecified honorarium that will be forthcoming. Would they treat a surgeon the same way, or an attorney, or an architect? Would you? Could you?
I am sometimes tempted to tell people that I am a real doctor, not the medical kind; my education lasted a lot longer than their lawyer's and cost at least as much; my time and expertise are worth at least as much as their architect's. In fact, I do not use tough language, but I've long ago decided not to accept "honoraria" but to charge fees, carefully spelled out and agreed on in advance. I set my rates some years back, when I was being asked to review more textbooks than I had time for. I'd tell the inquiring publisher that my fee is $1.00 per typewritten page or $50.00 per hour, whichever comes to less. Sometimes the answer was: "Oh, sorry, we didn't really want to spend that much", and at other times it was, matter-of-factly, "O.K., send us your bill along with your report". The result was that I had less of that sort of work to do and got paid more respectably for what I still did. My doctor, lawyer, and architect friends tell me that prices have changed since I established mine. The time has com, they say, to double the charges. The answers, they predict, will remain the same: half "no" and half "sure, of course".
A professional mathematician talks about mathematics at lunch and at tea. The subject doesn't have to be hot new theorems and proofs (which can make for ulcers or ecstasy, depending)—it can be a new teaching twist, a complaint about a fiendish piece of student skullduggery, or a rumor about an error in the proof of the four color theorem. At many good universities there is a long-standing tradition of brown-bag (or faculty club) mathematics lunches, and they are an important constituent of high quality. They keep the members of the department in touch with one another; they make it easy for each to use the brains and the memory of all. At a few universities I had a hand in establishing the lunch tradition, which then took root and flourished long after I left.
The pros go to colloquia, seminars, meetings, conferences, and international congresses—and they use the right word for each. The pros invite one another to give colloquium talks at their universities, and the visitor knows—should know—that his duty is not discharged by one lecture delivered between 4:10 and 5:00 p.m. on Thursday. The lunch, which gives some of the locals a chance to meet and have a relaxed conversation with the gues, the possible specialists' seminar for the in-group at 2:00, the pre-lecture coffee hour at 3:00, and the post-lecture dinner and evening party are essential parts of the visitor's job description. It makes for an exhausting day, but that's how it goes, that's what it means to be a colloquium lecturer.
Sometimes you are not just a colloquium lecturer, but the "Class of 1909 distinguished Visitor", invited to spend a whole week on the campus, give two or three mathematics talks and one "general" lecture, and, in between, mingle, consult, and interact. Some do it by arriving in time for a Monday afternoon talk, squeezing in a Tuesday colloquium at a sister university 110 miles down the road, reappearing for a Wednesday talk and a half of Thursday (spent mostly with the specialist crony who arranged the visit), and catching the plane home at 6:05 p.m. Bad show—malfeasance and nonfeasance—that's not what the idea is at all. When Bombieri came to Bloomington, I understood much of his first lecture, almost none of the second, and gave up on the third (answered my mail instead)—but I got a lot out of his presence. I heard him hold forth on meromorphic functions at lunch on Monday, explain the Mordell conjecture over a cup of coffee Wednesday afternoon, and at dinner Friday evening guess at the probably standing of Euler and Gauss a hundred years from now. We also talked about clocks and children and sport and wine. I learned some mathematics and I got a little insight into what makes one of the outstanding mathematicians of our time tick. He earned his keep as Distinguished Visitor, not because he is a great mathematician (that helps!) but because he took the job seriously. It didn't take his talent to do that; that's something us lesser people can do too.
Mathematical talent is probably congenital, but aside from that the most important attribute of a genuine professional mathematician is scholarship. The scholar is always studying, always ready and eager to learn. The scholar knows the connections of his specialty with the subject as a whole; he knows not only the technical details of his specialty, but its history and its present standing; he knows about the others who are working on it and how far they have reached. He knows the literature, and he trusts nobody: he himself examines the original paper. He acquires firsthand knowledge no only of its intellectual content, but also of the date of the work, the spelling of the author's name, and the punctuation of the title; he insists on getting every detail of every reference absolutely straight. The scholar tries to be as broad as possible. No one can know all of mathematics, but the scholar can succeed in knowing the outline of it all: what are its parts and what are their places in the whole?
These are the things, some of the things, that go to make up a pro.
"for someone that doesn't have a lot of experience with machine learning, what books would you recommend starting with? machine learning for dummies?"
Intro to Statistical Learning: The friendly version of Elements has MOOC videos and a 2nd edition coming soon, all free online. Is there a better intro? Maybe using Python?
Deep Learning with Python: The creator of Keras provides pretty good explanations and code, as I recall. Is there a better intro? Maybe something using PyTorch?
Statistical Rethinking has great explanations, examples, and philosophical commentary for understanding Bayesian approaches.
Mining of Massive Datasets is much better than its cover, with good coverage of both more and less commonly discussed techniques (and a MOOC).
Thanks to a couple of colleagues who inspired this collection of recommended books!
What else would you add/change in this list?
]]>Ways for the universe to end:
"I knew I would only ever really be able to accept the kind of truth I could rederive mathematically." (page 4)
If this is the standard, you will only accept mathematical truths. The book seems to care quite a bit about observation and measurement, so I assume she's being rather loose with language here.
"Translating redshifts to speeds, the pattern Hubble detected meant that the more distant a galaxy, the faster it's receding from us." (page 57)
As explained in the book, it seems like this could be read "the faster it was receding from us" because light from far away is quite old. I didn't catch it if there was an explanation of how we work out how fast we think those distant things are moving away now.
This is particularly perplexing because the book explains that it looks like (by redshift) very distant things are receding fastest, less distant things look to be receding less fast, and the nearest things look to be coming toward us. So my naive thought is: How do we know that the distant things aren't currently coming toward us now?
There is presumably some clever way to work this out?
"We glossed over this a bit in the previous chapter, but it turns out that even just determining the past expansion rate is far more difficult than it seems like it has any right to be." (page 115)
That introduces the distance ladder, but I think still doesn't address the above.
"Just like a curved fiber-optic cable can make the light inside it turn a corner, a massive object bending space can cause light to curve around it." (page 68)
It isn't quite "just like" that, I think... I mean, the trampoline analogy isn't good either...
"Measuring distance accurately over billions of light-years, however, is a lot harder [than measuring redshift]." (page 69)
This, I think, is very true. What does distance even mean?
The play Arcadia, quoted in the epigraph of Chapter 4, seems neat.
"To throw some more terminology into the mix, an evolving (i.e., nonconstant) dark energy is often called quintessence, after the "fifth element," a mysterious something-or-other that was popular to philosophize about in the Middle Ages and is not really much more precisely specified now." (page 80)
Leeloo Dallas Multipass
"As of this writing, supernova measurements allow us to measure the Hubble Constant to an accuracy of 2.4 percent."
"Which is weird, because the number we get totally disagrees with the value of the exact same number we derive from looking at the cosmic microwave background." (pages 125-126)
How does the data suggest that we're in a local minimum vacuum? How can we tell that?
"This has led physicists to suggest solutions ranging from abstract arguments aimed at narrowing down the total range of possible universes to philosophical debates about how to make advances in areas of theory in which experimental evidence may never appear." (page 161)
The epigraph for Chapter 8 quotes Hozier's "No Plan" but I prefer their "Nobody." The drums are bananas.
""By thinking about the end of the universe, just like with its beginning, you can sharpen your own thinking about what you think is happening now, and how to extrapolate. I feel like extrapolations in fundamental physics are essential," says Hiranya Peiris, a cosmologist at University College London." (page 179)
]]>"But personally, I still feel there's a big difference, in some emotional sense, between "we go on forever" and "we don't." Nima Arkani-Hamed feels the same way. "At the absolute, absolute deepest level ... whether or not people explicitly admit to thinking about it or not (and if they don't they're all the poorer for it) ... If you think there is a purpose to life, then I at least don't know how to find one that doesn't connect to something that transcends our little mortality," he tells me. "I think a lot of people at some level—again, either explicitly or implicitly—will do science or art or something because of the sense that you do get to transcend something. You touch something eternal. That word, eternal: very important. It's very, very, very important."" (page 207)
"Thinking generatively—how the data could arise—solves many problems. Many statistical problems cannot be solved with statistics. All variables are measured with error. Conditioning on variables creates as many problems as it solves. There is no inference without assumption, but do not choose your assumptions for the sake of inference. Build complex models one piece at a time. Be critical. Be kind." (page 553)
Notes by chapter:
statistics, models, science
"What researchers need is some unified theory of golem engineering, a set of principles for designing, building, and refining special-purpose statistical procedures. Every major branch of statistical philosophy possesses such a unified theory. But the theory is never taught in introductory—and often not even in advanced—courses. So there are benefits in rethinking statistical inference as a set of strategies, instead of a set of pre-made tools." (page 4)
I think there's a little bit of linguistic confusion between scientific hypotheses ("I think the world works like...") and statistical hypotheses (also called "Statistical models" in Figure 1.2).
"... deductive falsification never works." (page 4)
I think this is too strong.
"modus tollens, which is Latin shorthand for “the method of destruction.”" (page 7)
also: importance of measurement
He has some neat examples of evidence that's not trivial to infer from - whether the ivory-billed woodpecker was extinct, and faster-than-light neutrinos... To these, also add the Piltdown Man fraud.
Bayes' theorem
Neat counting example in 2.1! I tried to write up a similar example (inspired by Gelman) some years ago...
In note 41, from page 24, McElreath advocates Cox-style probability.
I also wrote up (still years ago) a cute example trying to explain Bayes' rule, but I think it's pretty crummy relative to his development through sections 2.1.2 and 2.1.3.
I kind of miss seeing "evidence" in Bayes' rule... Maybe I like this, with "explanation" for the other term?
P(explanation|evidence) = P(evidence|explanation) * P(explanation)
----------------------------------------
P(evidence)
(The P's everywhere would kind of obfuscate the nice counting development he was using, but still...)
Then, note that P(evidence|explanation)
is the "likelihood" of the
evidence, and that we're going to talk about that a lot.
Ah, here on page 37 is his version:
Posterior = Probability of the data * Prior
-------------------------------
Average probability of the data
Also nice! He reminds me that I'm using "evidence" (above) in a different way from using it to mean the denominator there...
There's also the way of doing it that's more like this:
Posterior = Probability of the data * Prior
-------------------------------
Average probability of the data
And then we can talk about the first term as the likelihood ratio, which is kind of nice, but makes it less clear that the denominator is a normalizer that can often be mostly ignored...
Likelihood ratio is a nice thing to think about, especially in connection with Polya's "plausible reasoning"... Evidence that is only consistent with the explanation (and no other) increases confidence a lot.
There's also a nice connection to the error mode of getting the denominator wrong and jumping to conclusions when you don't know of another possible explanation. "I didn't think you were planning a surprise party!" etc.
I don't really like "average probability of the data" as a term, I think...
On page 39, he doesn't include Hamiltonian Monte Carlo as one of the "engines"... Is it a type of MCMC? Ah, yes.
Oh no! Very ugly page break from 42 to 43, with the header of a table separated from its contents...
Interesting; really not explaining what's going on with dbeta
,
conjugate priors, etc... Probably fine?
Wow! I do not understand how this Metropolis algorithm on page 45 works! I guess I can wait until Chapter 9.
Ooh fun, some people have problem solutions online... Here's one:
priors and posteriors
I'm reading Ellenberg's How Not to Be Wrong, and he says on page 49: "In mathematics, you very seldom get the clearest account of an idea from the person who invented it."
I have that feeling in connection with Gelman and Pearl (not sure they completely invented things they're associated with, but still): I feel like McElreath is doing a better job of explaining things, and it's super nice.
Also Ellenberg:
"If a tiny state like South Dakota experiences a rash of brain cancer, you might presume that the spike is in large measure due to luck, and you might estimate that the rate of brain cancer in the future is likely to be closer to the overall national number. You could accomplish this by taking some kind of weighted average of the South Dakota rate with the national rate. But how to weight the two numbers? That's a bit of an art, involving a fair amount of technical labor I'll spare you here." (pages 70-71)
I think he's referring to multi-level modeling, in the Gelman style.
This common medical testing scenario appeared in a recent LearnedLeague one-day on statistics:
"Suppose that 1% of a population has a particular genetic mutation, and a test for the mutation is 99% accurate for both positive and negative cases. In other words, if someone with the mutation takes the test, there is a 99% chance that the test comes back positive; if someone without the mutation takes the test, there is a 99% chance that the test comes back negative. If a randomly-selected person takes the test and gets a positive result, what is the probability that the person actually has the mutation? (Express your answer as a fraction in lowest terms.)"
I solved it by seeing that 0.01 * 0.99 == 0.99 * 0.01, which is sort of like what McElreath says is called "frequency format" or "natural frequencies." I definitely thought of it in terms of "quantity," but as percentages rather than counts. I was surprised when Erica referred to the problem as "the Bayesian" problem, because I hadn't thought of it that way. So I agree with McElreath that it isn't uniquely Bayesian.
"Changing the representation of a problem often makes it easier to address or inspires new ideas that were not available in an old representation. In physics, switching between Newtonian and Lagrangian mechanics can make problems much easier. In evolutionary biology, switching between inclusive fitness and multilevel selection sheds new light on old models. And in statistics, switching between Bayesian and non-Bayesian representations often teaches us new things about both approaches." (page 50)
"I avoid discussing the analytical approach [of conjugate priors, etc.] in this book, because very few problems are so simple that they have exact analytical solutions like this [the beta-binomial conjugate prior]." (page 560, note for page 51)
The "Why statistics can't save bad science" box on page 51 is neat.
Just to establish equivalence between R and Python...
dbinom(6, size=9, prob=0.5)
## [1] 0.1640625
import scipy.stats
scipy.stats.binom(n=9, p=0.5).pmf(6)
## 0.16406250000000006
Interesting: using "compatibility interval" rather than "credible interval" (or "confidence interval") in the sense of "compatible with the model and data." (page 54)
"Overall, if the choice of interval type [percent interval or highest posterior density interval] makes a big difference, then you shouldn't be using intervals to summarize the posterior." (page 58)
"There is no way to really be sure that software works correctly." (page 64)
Hmm; his HPDI (Highest Posterior Density Interval) implementation itself relies on the implementation in coda...
How hard is this really to implement? If you have a histogram or just sorted counts, every left point determines one interval, so you could do it in one pass with a little farting around to find the right point each time, and a running smallest interval... Really not so computation-intensive.
birth1 <- c(1,0,0,0,1,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,1,0,0,0,1,0,
0,0,0,1,1,1,0,1,0,1,1,1,0,1,0,1,1,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,
1,1,0,1,0,0,1,0,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1,1,1,0,0,1,0,1,1,0,
1,0,1,1,1,0,1,1,1,1)
birth2 <- c(0,1,0,1,0,1,1,1,0,0,1,1,1,1,1,0,0,1,1,1,0,0,1,1,1,0,
1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,0,1,1,0,1,1,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,0,1,1,
0,0,0,1,1,1,0,0,0,0)
table(birth1, birth2)
## birth2
## birth1 0 1
## 0 10 39
## 1 30 21
So really, what is up with this data?
linear regression
"Linear regression is the geocentric model of applied statistics." (page 71)
Frank's The common patterns of nature seems pretty neat, getting into how common distributions come from processes and information considerations...
"Multiplying small numbers is approximately the same as addition." (page 74)
On page 76 he shows "precision" as τ, meaning 1/σ^2, and it shows up in the equation for the Gaussian with π, which is an example of notation that doesn't work particularly well with the tau manifesto.
"procrustean" (on page 77): "(especially of a framework or system) enforcing uniformity or conformity without regard to natural variation or individuality."
I like the spark histograms!
Oh, neat, they're even called "histosparks"...
And I might have guessed... they're from Hadley.
So there are unicode characters that do blocks of various sizes, by eighths... It looks like Hadley only uses some of them:
sparks <- c("\u2581", "\u2582", "\u2583", "\u2585", "\u2587")
# 1/8 2/8 3/8 5/8 7/8
Can look these up for example here: https://www.fileformat.info/info/unicode/char/2581/index.htm
" ▁▂▃▄▅▆▇█" has all the heights, with a normal blank at the beginning.
So why does Hadley only use some of the available heights? Not sure.
Oh look at that! In my terminal those all look fine, but in a browser (maybe it depends on font?) the half and full blocks go lower than the others! Still doesn't explain why 6/8 is missing from Hadley's list... Maybe it looks bad in other fonts?
Let's try it fixed-width:
▁▂▃▄▅▆▇█
Yup, looks much nicer in fixed width.
Here's another nice place to see these: https://en.wikipedia.org/wiki/Block_Elements
"E. T. Jaynes (1922-1988) called this the mind projection fallacy, the mistake of confusing epistemological claims with ontological claims." (page 81)
And a fun reference to Monkeys, Kangaroos, and N:
"... I think you will find that 90% of the past confusions and controversies in statistics have been caused, not by mathematical errors or even ideological differences; but by the technical difficulty that the two parties had different problems in mind, and failed to realize this. Thinking along different lines, each failed to perceive at all what the other considered too obvious to mention." (Jaynes)
"There's also a tradition called dimensionless analysis that advocates constructing variables so that they are unit-less ratios." (page 94)
I haven't heard about this as such, I think. Dimension_al_ analysis is more well known, but not quite the same thing...
Interesting to recall that in the first edition, what's now called
quap
(quadratic approximation posterior / a posteriori?) was called
map
(maximum a posteriori?)
"My experience is that many natural and social scientists have naturally forgotten whatever they once knew about logarithms." (page 98)
"... most social and natural scientists have never had much training in probability theory and tend to get very nervous around ∫'s." (page 106)
He repeats it in different ways here and there, but I noted it again on page 107: I like his effort at clarity between "small world" and "large world" claims, where small world is "assuming the model" or "in the world of the model."
When doing the quadratic example, he z-scores but does not decorrelate... The default behavior in R (using poly) is to "compute orthogonal polynomials"... I'm not sure how common that is elsewhere.
Okay I'll look at sklearn... Here's somebody with a nice Python implementation: http://davmre.github.io/blog/python/2013/12/15/orthogonal_poly But as far as I can tell there isn't anything "built in" for Python...
page 111: weight.s
is used in one listing, while weight_s
is used
in another, which is a very mild kind of inconsistency. (PR)
"We should feel embarrassed to use [linear models], just so we don't become satisfied with the phenomenological explanations they provide." (page 113)
I really liked section 4.5.2 on splines; I don't think I ever saw a good explanation of splines before.
"Matrix algebra is a stressful topic for many scientists." (page 119)
In both R listings 4.76 and 4.79, it's a little unintuitive to me in
that it doesn't seem obvious that w
is a vector. In w ~ dnorm(0,
10)
, that dnorm
returns just one number. Somewhere quap
is
figuring out how many elements it needs, I guess?
For question 4H8 on page 122, it asks what else would have to change if the intercept was removed from the model. I think the answer is just the priors on the other coefficient(s), since they'd have to get the mean all the way to where it needs to be by themselves then. And/or maybe the data couldn't be centered, because making the mean zero would really hurt the ability to have the result be right? It would still be okay if both x and y were centered, at least for simple designs.
"... introduce graphical causal models as a way to design and interpret regression models." (page 124)
"About one thing, however, there is general agreement: Causal inference always depends upon unverifiable assumptions." (page 124)
"Think before you regress" (page 128)
In the first paragraph of 5.1.1, I don't really see how Figure 5.2 tells us that only one of the predictor variables has a causal influence...
I really like dagitty. Learning about it is one of the best things in the book, in that I was wishing to find such software while reading The Book of Why but didn't.
It is a little weird that the web interface uses ⊥ (falsum)... Hmm;
looking it up, it seems it's the same symbol as \perp
, which is
used for independence. Ah! The "double tack up" (⫫) is for conditional
independence! The web interface still uses ⊥ for both kinds of
independence.
"This is very weird notation and any feelings of annoyance on your part are justified." (page 130)
The coeftab
visualization (see page 133) is pretty nice.
It took me a little bit to understand what he was getting at with the "predictor residual plots" (page 135) but I'm glad I did, since it connects to one of his main points about how multiple regression is about how much a variable adds given all the other variables.
"Usually answers to large world questions about truth and causation depend upon information not included in the model." (page 139)
The section 5.2 "Masked relationship" is neat.
"Taking the log of a measure translates the measure into magnitudes." (page 148)
What use of "magnitude" is this? Hmm... Looks like star brightness is done via a log that is called magnitude... Just weird, because in other domains "magnitude" refers to the un-logged value...
Seems like this is less surprising to others, and it makes sense as "order of magnitude."
"A set of DAGs with the same conditional independencies is known as a Markov equivalence set." (page 151)
I was unfamiliar with Melanesia.
One page 155, he makes index variables seem fundamentally different
from indicator variables. Their notation in quap
is different (and
nicer) but fundamentally the only difference is with index variables
you drop the intercept term (or equivalently, you have separate
intercept terms for each thing). Just reading through, I initially
thought his index variables were a real novelty, but they're not. (I'm
still curious about where he says on page 156 "It is also important to
get used to index variables, because multilevel models (Chapter 13)
depend upon them.")
"The mistake of accepting the null hypothesis." (page 158)
Question 5E3 on page 159 jokes (I think?) about the effects of amount of funding and size of laboratory on time to PhD, but I'm not sure I know what he thinks is funny...
On page 161 he starts with Berkson's Paradox, suggesting "selection-distortion effect" as a better name. Dave suggested a nice example: shorter basketball players in the NBA are better 3 point shooters.
"Let's bein with the least of your worries: multicollinearity." (page 163)
This can "smear out" your estimates, because it isn't clear which variables to put beta weight on.
Section 6.2 (page 170) uses "post-treatment bias" to refer to what I might call a mediator, and what he later calls a "pipe" situation (page 185).
"The "d" [in d-separation] stands for directional." (page 174)
"You'll often see the "d" in d-separation defined as "dependency." That would certainly make more sense. But the term d-separation comes from a more general theory of graphs. Directed graphs involve d-separation and undirected graphs involve instead u-separation." (page 563)
"No statistical procedure can substitute for scientific knowledge and attention to it." (page 175)
"If a procedure cannot figure out the truth in a simulated example, we shouldn't trust it in a real one." (page 177)
"So I'm sorry to say that we also have to consider the possibility that our DAG may be haunted." (page 180)
The thing haunting it (as in the title of the chapter) is unmeasured causes that induce collider bias, which means conditioning on things we have can induce bias about effects we're trying to measure.
I like "the four elemental confounds" on page 185. They're pretty similar to the cases I included in What should be in your regression?.
The explanation of shutting the back-door on pages 184-185 is better than others I've seen, I think. And then on page 186 he shows how everybody's favorite dagitty can do it automatically.
Front-door isn't mentioned until page 460.
On page 186, he says "Conditioning on C is the better idea, from the perspective of efficiency, since it could also help with the precision of the estimate of X➔Y." This seems reasonable since C is closer to Y, but I feel like a little more explanation wouldn't have been a bad thing here.
"In fact, domain specific structural causal models can make causal inference possible even when a DAG with the same structure cannot decide how to proceed." (page 188)
Say more? Like, a footnote? An endnote? Some kind of reference to more information? Seems so mysterious!
"Sometimes, in order to avoid multicollinearity, people inspect pairwise correlations among predictors before including them in a model. This is a bad procedure, because what matters it the conditional association, not the association before the variables are included in the model." (page 189)
overfitting
McElreath has slides and video of his lectures online.
"This chapter describes some of the most commonly used tools for coping with this trade-off." (page 191, referring to the trade-off between simplicity and accuracy)
There's some parallel between statistical models and scientific models generally; see Characteristics of good theories.
"... when we design any particular statistical model, we must decide whether we want to understand causes or rather just predict." (page 192)
"Stargazing" is a cute way to criticize fixation on stars that represent statistical significance. (page 193)
On page 194 he uses "hominin" which I wasn't familiar with. Hominins refers to humans and chimps. Add gorillas and you get hominines. Add orangutans to that and you get hominids.
"In fact, Carl Friedrich Gauss originally derived the OLS procedure in a Bayesian framework." (page 196)
He loves pointing out this kind of thing.
"The point of this example is not to praise R^{2} but to bury it." (page 197)
This alludes to Shakespeare's famous Marc Antony speech in Julius Caesar: "I come to bury Caesar, not to praise him."
"This means the actual empirical variance, not the variance that R returns with the
var
function, which is a frequentist estimator and therefore has the wrong denominator." (page 197)
Saucy!
"... model fitting can be considered a form of data compression. ... This view of model selection is often known as minimum description length (MDL)." (page 201)
Wikipedia says "In its most basic form, MDL is a model selection principle: the shortest description of the data as the best model."
McElreath points to Grünwald's The Minimum Description Length Principle.
He's trying to develop "out-of-sample deviance" in 7.2 "Entropy and accuracy" starting page 202.
"Likelihood" as in the likelihood of the data, given the model, on page 204. And he really likes it:
"If you see an analysis using something else, either it is a special case of the log scoring rule or it is possibly much worse."
Interesting "Rethinking" box on page 204:
"Calibration is overrated. ... The problem is that calibrated predictions do not have to be good."
It has an endnote on page 563 that includes:
"Strictly speaking, there are no "true" probabilities of events, because probability is epistemological and nature is deterministic."
On page 207 he points out that when probability is zero, L'Hopital's rule gives us 0*log(0) = 0.
Endnote 110 on page 564 begins:
"I really wish I could say there is an accessible introduction to maximum entropy, at the level of most natural and social scientists' math training. If there is, I haven't found it yet."
On page 207 he just says:
"So Bayesian updating is entropy maximization."
He just says "divergence" to mean Kullback-Leibler divergence, and adds in endnote 111 on page 564:
"For what it's worth, Kullback and Leibler make it clear in their 1951 paper that Harold Jeffreys had used this measure already in the development of Bayesian statistics."
There he goes again!
"In plainer language, the divergence is the average difference in log probability between the target (p) and model (q). This divergence is just the difference between two entropies: The entropy of the target distribution p and the cross entropy arising from using q to predict p."
"At this point in the chapter, dear reader, you may be wondering where the chapter is headed." (page 209)
"It's as if we can't tell how far any particular archer is from hitting the target, but we can tell which archer gets closer and by how much." (page 209)
"To compute this [log-probability] score for a Bayesian model, we have to use the entire posterior distribution. Otherwise, vengeful angels will descend upon you." (page 210)
His package has lppd
for "log-pointwise-predictive-density."
"It is also quite common to see something called the deviance, which is like a lppd score, but multiplied by -2 so that smaller values are better. The 2 is there for historical reasons."
There's more explanation in endnote 112 on page 564:
"In non-Bayesian statistics, under somewhat general conditions, a difference between two deviances has a chi-squared distribution. The factor of 2 is there to scale it the proper way."
(Recall we're interested in the difference between these things; they don't have a meaningful scale on their own.)
I was briefly befuddled by the positive log-likelihoods on page 210, but of course it's the point density, not probability, and the density can be greater than one.
On page 211 he talks about log_sum_exp
which "takes all the
log-probabilities for a given observation, exponentiates each, sums
them, then takes the log. But it does this in a way that is
numerically stable."
I had cause to do this recently! It comes down to this:
import math
def sum_log_prob(a, b):
return max(a, b) + math.log1p(math.exp(0 - abs(a - b)))
I based that on a post from Kevin Karplus.
McElreath's is:
log_sum_exp <- function( x ) {
xmax <- max(x)
xsum <- sum( exp( x - xmax ) )
xmax + log(xsum)
}
(Found on rdrr.)
"That [two-parameter] model does worse in prediction than the model with only 1 parameter, even though the true model does include the additional predictor. This is because with only N=20 cases, the imprecision of the estimate for the first predictor produces more error than just ignoring it." (page 213)
"When you encounter multilevel models in Chapter 13, you'll see that their central device is to learn the strength of the prior from the data itself. So you can think of multilevel models as adaptive regularization, where the model itself tries to learn how skeptical it should be." (page 216)
"Statisticians often make fun of machine learning for reinventing statistics under new names. But regularization is one area where machine learning is more mature. Introductory machine learning courses usually describe regularization. Most introductory statistics courses do not." (page 216)
Section 7.4 (page 217) is "predicting predictive accuracy."
"It is a benign aspect of the universe that this importance [of individual examples] can be estimated without refitting the model." (page 217)
"PSIS" is "Pareto-smoothed importance sampling cross-validation." (page 217)
"For ordinary linear regression with flat priors, the expected overfitting penalty is about twice the number of parameters." (page 219)
"AIC is of mainly historical interest now." (page 219)
It seems like WAIC can only be used when you have a posterior distribution, since it relies on variance of those predictions...
"... in the natural and social sciences the models under consideration are almost never the data-generating models. It makes little sense to attempt to identify a "true" model." (page 221)
"Watanabe recommends computing both WAIC and PSIS and contrasting them. If there are large differences, this implies one or both criteria are unreliable.
"Estimation aside, PSIS has a distinct advantage in warning the user about when it is unreliable." (page 223)
"A very common use of cross-validation and information criteria is to perform model selection, which means choosing the model with the lowest criterion value and then discarding the others. But you should never do this." (page 225)
Endnote 133 references The Decline of Violent Conflicts: What Do The Data Really Say? which is interesting.
"This chapter has been a marathon." (page 235)
And then the chapter summary doesn't even mention cross-validation!
Acronyms:
Practice problem 7E1: State the three motivating criteria that define information entropy.
Practice problem 7E2: Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads 70% of the time. What is the entropy of this coin?
Well, entropy is the negative sum of p*log(p), so:
import math
# Truth (as in Problem 7E1)
p = [0.7, 0.3]
# Entropy, H(p)
H = lambda p: -sum(p_i * math.log(p_i) for p_i in p)
H(p) # 0.6108643020548935
# Candidate "models"
q = [0.5, 0.5]
r = [0.9, 0.1]
# Cross-Entropy, H(p, q), xH here because Python
xH = lambda p, q: -sum(p_i * math.log(q_i) for p_i, q_i in zip(p, q))
xH(p, q) # 0.6931471805599453
xH(p, r) # 0.764527888858692
# KL Divergence, D(p, q)
D = lambda p, q: sum(p_i * math.log(p_i/q_i) for p_i, q_i in zip(p, q))
D(p, q) # 0.08228287850505178
D(p, r) # 0.15366358680379852
# D(p, q) = H(p, q) - H(p)
D(p, q) == xH(p, q) - H(p) # True
# We wish we could do this (use D) but we can't, because we don't have p.
# Data
d = [0, 0, 1]
# Log probability (likelihood) score
S = lambda d, p: sum(math.log(p[d_i]) for d_i in d)
S(d, q) # -2.0794415416798357
S(d, r) # -2.513306124309698
# True vs. predictive
S(d, p) # -1.917322692203401
S(d, [2/3, 1/3])
# -1.9095425048844388
# Deviance
deviance = lambda d, p: -2 * S(d, q)
# Positive log likelihoods! Gasp!
# Note the log probabilities here are really probabilities, because
# I'm just using point estimates, not real distributions. Really,
# you'll have densities, which can be greater than one.
"Information criteria construct a theoretical estimate of the relative out-of-sample KL divergence." (page 219)
And he really likes them, largely forgetting about cross-validation.
interactions
Propeller marks on manatees are unpleasant, but DID YOU KNOW you see those marks so much because they don't kill the manatees, so they're still around to be seen? Manatees are mostly killed by blunt force thwacking by the hulls of boats, not their propellers.
"Using GDP to measure the health of an economy is like using heat to measure the quality of a chemical reaction." (endnote 138, page 565)
Why not split data to condition on some categorical variable? (page 241)
On page 245, he explains (again?) that using indicator variables is bad in the sense that it implies more uncertainty in the indicated class (uncertainty of baseline, plus uncertainty of indicator's coefficient).
On using fancy Greek letters in your model specification:
"If your reader cannot say the symbol's name, it could make understanding the model harder." (page 249)
Section 8.2 (page 250) on "Symmetry of interactions" is pretty neat.
"There is just no way to specify a simple, linear interaction in which you can say the effect of some variable x depends upon z but the effect of z does not depend upon x." (page 256)
In endnote 142, McElreath recommends Grafen and Hails' Modern Statistics for the Life Sciences, saying "It has a rather unique geometric presentation of some of the standard linear models." The book has the somewhat surprising subtitle of "Learn to analyse your own data".
Main effects vs. interaction effects.
On weakly informative priors:
"If you displayed these priors to your colleagues, a reasonable summary might be, "These priors contain no bias towards positive or negative effects, and at the same time they very weakly bound the effects to realistic ranges."" (page 260)
"While you can't see them in a DAG, interactions can be important for making accurate inferences." (page 260)
"Researchers rely upon random numbers for the proper design of experiments." (page 263)
In an endnote, McElreath recommends Kruschke's Doing Bayesian Data Analysis, and it seems like it might be good.
"the combination of parameter values that maximizes posterior probability, the mode, is not actually in a region of parameter values that are highly plausible." (page 269)
"we need MCMC algorithms that focus on the entire posterior at once, instead of one or a few dimensions at a time like Metropolis and Gibbs. Otherwise we get stuck in a narrow, highly curving region of parameter space." (page 270)
"It appears to be a quite general principle that, whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought." (page 270, quoting E. T. Jaynes)
"[The U-turn problem] just shows that the efficiency of HMC comes with the expense of having to tune the leapfrog steps and step size in each application." (page 274)
"Fancy HMC samplers ... choose the leapfrog steps and step size for you ... by conducting a warmup phase in which they try to figure out which step size explores teh posterior efficiently. If you are familiar with older algorithms like Gibbs sampling, which use a burn-in phase, warmup is not like burn-in." (page 274)
"Indeed, it may be that no one fully understands [the principle of maximum entropy]." (page 303)
"[The exponential] distribution is the core of survival and event history analysis, which is not covered in this book." (page 315)
"... no regression coefficient ... from a GLM every produces a constant change on the outcome scale. ... every predictor essentially interacts with itself, because the impact of a change in a predictor depends upon the value of the predictor before the change. More generally, every predictor variable effectively interacts with every other predictor variable, whether you explicitly model them as interactions or not." (page 318)
"Link functions are assumptions." (page 319)
He suggests sensitivity assumptions, presumably including trying different link functions, which I think is the closest he comes to talking about probit regression.
"... even a variable that isn't technically a confounder can bias inference, once we have a link function." (page 320)
"Parameter estimates do not by themselves tell you the importancce of a predictor on the outcome." (page 320)
"... a big beta-coefficient may not correspond to a big effect on the outcome." (page 320)
He also points out on page 320 that with a different likelihood (and so link) function, you can't compare log likelihoods (etc.) any more because there's an (unknown) normalization constant that's different between them.
GLMs for counts
"As described in Chapter 10, the Poisson model is a special case of binomial." (page 323)
This is a little loose, maybe; Poisson is the limit of binomial, which isn't quite "a special case" I think...
Logistic regression as a special case of binomial regression, okay.
"There are many ways to construct new variables like this, including mutant helper functions." (page 327)
Mutant helper functions? Is this a common term?
"Let's look at these on the outcome scale:" (page 330)
He shows a table that includes logistic regression coefficients, but there's really no attempt to interpret them directly, which is different from some presentations of logistic regression. He goes directly to showing things on the probability scale. He does then show some plots on the coefficient scale, and describes them as being "on the logit scale," but still not a lot of effort spent on connecting them to changes in odds etc.
"counting the rows in the data table is no longer a sensible way to assess sample size." (page 340)
(When using data that is counts of outcomes.)
"This isn't to say that over-parameterizing a model is always a good idea. But it isn't a violation of any statistical principle." (page 345)
"Keep in mind that the number of rows is not clearly the same as the "sample size" in a count model. The relationship between parameters and "degrees of freedom" is not simple, outside of simple linear regressions." (page 347)
"Any rules you've been taught about minimum sample sizes for inference are just non-Bayesian superstitions." (page 347)
He really seems to want to make gamma-Poisson happen (replacing negative binomial).
Probit doesn't appear anywhere! (At least, I haven't seen it and it isn't in the index.)
"In general, more than two things can happen." (page 359)
"The conventional and natural link in this context is the multinomial logit, also known as the softmax function." (page 359)
"Another way to fit a multinomial/categorical model is to refactor it into a series of Poisson likelihoods. That should sound a bit crazy." (page 363)
"It is important to never convert counts to proportions before analysis, because doing so destroys information about sample size." (page 365)
over-dispersion, ordered categories
"Just be sure to validate it [your model] by simulating dummy data and then recovering the data-generating process through fitting the model to the dummy data." (page 369)
"continuous mixture models in which a linear model is attached not to the observations themselves but rather to a distribution of observations." (page 370)
"... Poisson distributions are very narrow. The variance must equal the mean, recall." (page 373)
"You should not use WAIC and PSIS with these [beta-binomial and gamma-Poisson/negative binomial] models, however, unless you are very sure of what you are doing. The reason is that while ordinary binomial and Poisson models can be aggregated and disaggregated across rows in the data, without changing any causal assumptions, the same is not true of beta-binomial and gamma-Poisson models. The reason is that a beta-binomial or gamma-Poisson likelihood applies an unobserved parameter to each row in the data. When we then go to calculate log-likelihoods, how the data are structured will determine how the beta-distributed or gamma-distributed variation enters the model." (page 375)
"In the sciences, there is sometimes a culture of anxiety surrounding statistical inference. It used to be that researchers couldn't easily construct and study their own custom models, because they had to rely upon statisticians to properly study the models first. This led to concerns about unconventional models, concerns about breaking the laws of statistics. But statistical computing is much more capable now. Now you can imagine your own generative process, simulate data from it, write the model, and verify that it recovers the true parameter values. You don't have to wait for a mathematician to legalize the model you need." (page 376)
This could almost be a summary of the book, maybe.
"Just treating ordered categories as continuous measures is not a good idea."
He offers the cumulative link function.
"This kind of vector, in which all the values sum to one (or any other constant), has a special name, a simplex." (page 394)
It's the varying effects ("random effects") chapter! Multilevel models!
"Anterograde amnesia is bad for learning about the world." (page 399)
"this prior is actually learned from the data." (pages 399-400)
"When some individuals, locations, or times are sampled more than others, multilevel models automatically cope with differing uncertainty across these clusters. This prevents over-sampled clusters from unfairly dominating inference." (page 400)
This is a pretty cool property to have. The problem of data imbalance is a challenge for many machine learning algorithms. Considering multilevel models as a kind of solution is interesting. Not obvious that it can be easily applied e.g. to vision models, but still, it's interesting.
"When it comes to regression, multilevel regression deserves to be the default approach. There are certainly contexts in which it would be better to use an old-fashioned single-level model. But the contexts in which multilevel models are superior are much more numerous. It is better to begin to build a multilevel analysis, and then realize it's unnecessary, than to overlook it." (page 400)
Is this really the case? It would be neat to see an example where a multilevel model isn't obviously needed but is better.
Costs of multilevel models (page 400, paraphrase):
Synonyms (page 401):
With parameters of multilevel models "most commonly known as random effects". An endnote cites section 6 of Gelman's Anova paper but I didn't find it as "entertaining" as promised. It does include the origin of "varying effects" as a proposed better name than "random effects":
"We define effects (or coefficients) in a multilevel model as constant if they are identical for all groups in a population and varying if they are allowed to differ from group to group." (page 20 in Gelman)
(A "group" could be an individual, depending on the nature of the data.)
I don't love that "hyperparameter" is used for parameters that are learned from the data, even if they're a level up, because it conflicts with the usual ML usage of "hyperparameter". It seems fair that their priors are called hyperpriors, though.
Reasons for using a Gaussian prior (page 403):
"Rethinking: Varying intercepts as over-dispersion. ... Compared to a beta-binomial or gamma-Poisson model, a binomial or Poisson model with a varying intercept on every observed outcome will often be easier to estimate and easier to extend." (page 407)
Oh my! A coefficient for every observation! Take that, frequentist statisticians!
It would be interesting to see a direct comparison, using e.g. beta-binomial on the one hand and multilevel on the other...
Page 408 itemizes three perspectives:
This in particular reminds me of How Not To Sort By Average Rating, which inspired in part my How To Sort By Average Rating advocating Laplace smoothing instead of Wilson bounds.
If you use the grand average to determine the Laplace binomial values, this is just like partial pooling via multilevel model, only much less rigorous, less obviously extensible to multivariate settings, and far easier.
I did a version of Laplace smoothing back when I was helping use survey data to determine how well various medical facilities were satisfying their patients. A ranking was desired, but ranking by raw scores ("no pooling") made the most extreme scores nearly always associated with the locations that had the fewest survey responses.
"Note that the priors are part of the model when we estimate, but not when we simulate. Why? Because priors are epistemology, not ontology. They represent the initial state of information of our robot, not a statement about how nature chooses parameter values." (page 409)
I enjoy that my preferred way of writing the logistic function is used on page 411.
"Partial pooling isn't always better. It's just better on average in the long run." (page 413)
"As soon as you start trusting the machine, the machine will betray your trust." (page 416)
"If the individual units are exhcnagable—the index values could be reassigned without changing the meaning of the model—then partial pooling could help." (page 419)
"Recall that HMC simulates the frictionless flow of a particle on a surface." (page 420)
"Algebra makes many things possible." (page 425)
Ah! Here's where he mentions Mister P: Multilevel Regression and Post-stratification. (page 430)
"Selection on the outcome variable is one of the worst things that can happen in statistics." (page 431)
"... the general varying effects strategy: Any batch of parameters with exchangeable index values can and probably should be pooled. Exchangeable just means the index values have no true ordering, because they are arbitrary labels." (page 435)
"a way to pool information across parameter types—intercepts and slopes" (page 436)
"Finally, we'll circle back to causal inference and use our new powers over covariance to go beyond the tools of Chapter 6 [The Haunted DAG & the Causal Terror], introducing Instrumental Variables." (pages 436-437)
That doesn't reflect the actual order, which has IV in the middle of the chapter...
"In conventional multilevel models, the device that makes this [modeling the joint population of intercepts and slopes by modeling their covariance] possible is a joint multivariate Gaussian distribution for all of the varying effects, both intercepts and slopes." (page 437)
"... we are always forced to analyze data with a model that is misspecified: The true data-generating process is different than the model." (page 441)
"how you fit the model is part of the model." (page 447)
"This [fewer effective than actual parameters] is a good example of how varying effects adapt to the data. The overfitting risk is much milder here than it would be with ordinary fixed effects." (page 451)
Estimates are pooled/shrunk, so parameters don't fit "tightly" to the data...
"Our interpretation of this experiment has not changed. These chimpanzees simply did not behave in any consistently different way in the partner treatments." (page 452)
This chimpanzee example continues to be fairly dull, for the level of complexity... I guess it's an example of sensitivity analysis, in a sense, looking at it in multiple different ways? But it would be more interesting if there were sometimes different (or any) results.
"There is an obvious cost to these non-centered forms: They look a lot more confusing. Hard-to-read models and model code limit our ability to share implementations with our colleagues, and sharing is the principal goal of scientific computation." (page 454)
"This last line ["Q cannot influence W except through E"] is sometimes called the exclusion restriction. It cannot be strictly tested, and it is often implausible."
The introduction to instrumental variables is based on the classic Does Compulsory School Attendance Affect Schooling and Earnings?
"Remember: With real data, you never know what the right answer is." (page 456)
"Instrumental variables are hard to understand. But there are some excellent tools to help you. For example, the
dagitty
package contains a functioninstrumentalVariables
that will find instruments, if they are present in a DAG." (page 459)
"The instrumental variable model is often discussed with an estimation procedure known as two-stage least squares (2SLS). This procedure involves two linear regressions. The predicted values of the first regression are fed into the second as dta, with adjustments so that the standard errors make sense. Amazingly, when the weather is nice, this procedure works. ... Some people mistake 2SLS for the model of instrumental variables. They are not the same thing. Any model can be estimated through a number of different procedures, each with its own benefits and costs." (page 460)
"Instrumental variables are natural experiments that impersonate randomized experiments." (page 460)
Discussing the front-door criterion, he points to a blog post and paper.
"First, the correlation changes if we switch the A/B labels." (page 462)
This is a little puzzling. Swapping axes shouldn't change correlation.
Ahhh... It doesn't swap the axes (unless there are only two participants, or an even number that all pair off sufficiently nicely, or the relabeling is otherwise sufficiently "nice"...
Why does this happen...
Some labeling is essentially arbitrary, so that "giver" and "receiver" switch.
Consider a three-point graph. Our "point of view" node is attached to two others. Label them however you want, the give/receive with us remains the same. But when you switch those two, give/receive change direction between them, and if they're not equal, that will send a point over the diagonal and change the correlation.
Cool.
"Social Relations Model, or SRM" (page 462)
"The general approach is known as Gaussian Process regression. This name is unfortunately wholly uninformative about what it is for and how it works." (page 468)
I like the phrase the author uses to describe GP regression: "continuous categories".
"phylogenic, or patristic, distance." (page 481)
"Pagel's lambda" (page 482)
"Biologists tend to use phylogenies under a cloud of superstition and fearful button pushing." (page 482)
"Gaussian processes represent a practical method of extending the varying effects strategy to continuous dimensions of similarity, such as spatial, network, phylogenic, or any other abstract distance between entities in the data." (page 485)
The Stan documenation has more on fitting GP regressions.
I think the thing that keeps this kind of GP from fitting the data perfectly, as is often the case with GPs, is the eta term...
But really, why doesn't it fit the data perfectly? In the primates example, there's a correlation matrix that clearly includes ones...
Oh! It's because the kernel matrix doesn't enter into the mean! ... Well, that's the case for the primates example, anyway...
So the effect of just changing the covariance matrix is like this:
install.packages('mvtnorm')
library(mvtnorm)
data <- c(1, 1, -1, -1)
# the mean here defaults to c(0, 0, 0, 0)
# "standard" 4d normal (identity for covariance matrix)
dmvnorm(data)
# 0.003428083
# covariance matrix that expects clustering
sigma <- matrix(c(1, 0.5, 0, 0,
0.5, 1, 0, 0,
0, 0, 1, 0.5,
0, 0, 0.5, 1), nrow=4)
dmvnorm(data, sigma=sigma)
# 0.008902658 (more likely than when assuming independence)
So when expecting clustering, you don't have to explain via the mean as much...
For the primates example, he gets a significant coefficient on group size, then he makes it go away via covariance, and then he uses a different covariance and gets it back...
"This [the second] model annihilates group size—the posterior mean is almost zero and there is a lot of mass on both sides of zero. The big change from the previous model suggests that there is a lot of clustering of brain size in the tree and that this produces a spurious relationship with group size, which also clusters in the tree." (page 482)
This is a little weird, isn't it? Just because the relationship clusters in the tree, that doesn't mean there isn't a relationship, right? There are at least two interpretations: (a) bigger groups and bigger brains co-evolved, in this part of the tree, and (b) this part of tree just happens to have both bigger groups and bigger brains. I guess it's a potential confound?
In the final example he gets less covariance and the coefficient on brain size comes back. Which model is more right? Doesn't seem very obvious to me.
Ah: For the earlier example, it's a Poisson regression anyway, so it's not obvious it could fit perfectly anyway, because of the link function.
And the multi-variate normal bit has mean zero! It can only pull out of the mean zero distribution with given covariance (which is constrained by prior so variance isn't very big). So there's really no chance of fitting perfectly.
"A big advantage of Bayesian inference is that it obviates the need to be clever. ... There's no need to be clever when you can be ruthless." (page 489)
(The ruthlessness is ruthlessness in applying rules of conditional probability.)
"And that's the real trick of the Bayesian approach: to apply conditional probability in all places, for data and parameters." (page 490)
"Bayes is an honest partner. It is not afraid to hurt your feelings." (page 491)
"The big take home point for this section is that when you have a distribution of values, don't reduce it down to a single value to use in a regression." (page 497)
"This [considering covariance between errors] is computationally similar to how we did instrumental variable regression in the previous chapter." (page 498)
It sounds like instrumental variables are often (originally?) about measurement error, but I don't completely understand how...
"Use your background knowledge to write down a generative model or models, simulate data from these models in order to understand the inferential risks, and design a statistical approach that can work at least in theory." (page 499)
"So there will be a posterior distribution for each missing value." (page 505)
In the model, when we have data, the distribution we enter is interpreted as a likelihood, but when we don't have data (it's missing), the distribution is interpreted as a prior... Neat!
"Implementing an imputation model can be done several ways. All of the ways are a little awkward, because the locations of missing values have to be respected, and that means plenty of index management." (page 506)
"Doing better is good." (page 511)
"If you aren't comfortable dropping incomplete cases, then you shouldn't be comfortable using multiple imputation either." (page 511)
This is maybe a little strong; he's explaining here that multiple imputation is an approximation of the technique he's advocating, after all.
He refs this paper, which has some missing data: Complex societies precede moralizing gods throughout world history.
"HMC just doesn't do discrete variables." (page 516)
"This all sounds too good to be true. It is all true. But implementing it is not at all obvious." (page 517)
"This chapter highlights the general principles of the book, that effective statistical modeling requires both careful thought about how the data were generated and delicate attention to numerical algorithms. Neither can lift inference alone." (page 521)
beyond GLMs
"GLMs (or GLMMs)" (page 526)
"GLMM" is "Generalized linear mixed model" where "mixed" means adding "random effects" in addition to "fixed effects" which means doing something hierarchical, essentially. Varying effects, per individual, group, etc.
"Useful mathematical modeling typically involves ridiculous assumptions." (page 527)
The 1985 "Consider a Spherical Cow: A Course in Environmental Problem Solving" doesn't seem to be the origin of the spherical cow, but it's still fun.
Three cites here:
"One of the major advantages of having a scientifically inspired model is that the parameters have meanings." (page 528)
"The key, as always is to think generatively." (page 531)
Learning curves and teaching when acquiring nut-cracking in humans and chimpanzees
"no lag beyond one period makes any causal sense." (page 543)
I think this is too strong, and he walks it back a little...
"Sometimes all this nonsense is okay, if all you care about is forecasting. But often these models don't even make good forecasts, because getting the future right often depends upon having a decent causal model." (page 543)
This particular model is a famous one, the Lotka-Volterra Model. It models simple predator-prey interactions and demonstrates several important things about ecological dynamics. Lots can be proved about it without using any data at all. For example, the population tends to be unstable, cycling up and down like in Figure 16.6. This is interesting because it suggests that, while nature is more complicated, all that is necessary to see cyclical population dynamics is captured in a stupidly simple model. (page 544)
"The hidden states are the causes. The measurements don't cause anything." (page 549)
conclusion
"Thinking generatively—how the data could arise—solves many problems. Many statistical problems cannot be solved with statistics. All variables are measured with error. Conditioning on variables creates as many problems as it solves. There is no inference without assumption, but do not choose your assumptions for the sake of inference. Build complex models one piece at a time. Be critical. Be kind." (page 553)
"Philosophers of science actually have a term, the pessimistic induction, for the observation that because most science has been wrong, most science is wrong." (page 554)
"Even retracted papers continue to be cited." (page 555)
This makes me wonder whether there could be some proactive system to inform authors of such issues... "I see you cited this paper; did you know?"
]]>"The data and its analysis are the scientific product. The paper is just an advertisement." (page 555)
"our judgment isn’t limited by knowledge nearly as much as it’s limited by attitude." (page x)
"In “Persuasion,” we saw that law students who are randomly assigned to one side of a moot court case become confident, after reading the case materials, that their side is morally and legally in the right. But that confidence doesn’t help them persuade the judge. On the contrary, law students who are more confident in the merits of their own side are significantly less likely to win the case—perhaps because they fail to consider and prepare for the rebuttals to their arguments." (page 27)
This cites Eigen and Listokin, and I don’t have access or inclination to read the paper just now, but it seems like there’s a claim here like “confidence causes bad performance” and I wonder whether possible confounds have been considered. To me, “lower quality lawyer causes both confidence and bad performance” seems plausible.
"Having an accurate map doesn’t help you very much when you’re allowed to travel only one path." (page 40)
Page 45 starts an exploration of Kahan’s famous paper on scientific polarization increasing with education. Perhaps in an effort to avoid alienating any readers, the tools of the scout are not applied to settle the question of whether global warming is real. The opportunity to engage one way or another with the idea of naive realism is not taken.
In Harford’s treatment of Kahan, I really appreciated the emphasis on curiosity being essential for “scout-like” thinking.
I think I hadn’t seen the idea of Blind Data Analysis mentioned on page 55, citing Nuzzo. Nice! Will keep this in mind.
"It [being critical of a study with undesirable results] prompted me to go back through the studies I had been planning to cite in my favor, and scrutinize their methodology for flaws, just as I had done with the pro-soldier mindset study. (Sadly, this ended up disqualifying most of them.)" (page 68)
I’m not sure to what extent this is a joke; I thought it was funny as I read it. But seriously, I wish people generally would say more about determinations such as this.
I might be a soldier on this, but I don’t love quantifying uncertainty in the manner of the calibration game introduced starting on page 75. I thought a little bit about why.
I’m not sure I have any really coherent argument here. I agree with the general idea of being aware of how sure you are. Somehow I don’t like the exercise of writing down numbers for it.
There is an interesting topic of decision-making in the face of low confidence. What do you do when you know you’re not sure? (Ramble, seems to be my answer.) Maybe out of scope for the book.
"The reality is that there’s no clear divide between the “decision-making” and “execution” stages of pursuing a goal. Over time, your situation will change, or you’ll learn new information, and you’ll need to revise your estimate of the odds." (page 110)
I really agree with this. Planning can be valuable, but following the plan to the letter is often not.
Galef discusses low (10%, 30%) early estimates of “success” from Musk and Bezos (starting page 111). Exploring why they would take such chances, she mentions both expected value (10% of huge is still big) and the idea that even “failure” would be fairly positive. I think expected value is almost always the wrong way to think about significant choices (especially one-shot choices with unclear odds) and I don’t really believe it’s how people tend to think (or should). I think the question of whether something is worth doing, even if it fails is the right question. So I think the balance of emphasis is off here. Expected value is a simple tool, a hammer that people reach for too often, simplifying problems too far. I wouldn’t even mention it in this setting.
"You might think these principles sound obvious and that you know them already. But “knowing” a principle, in the sense that you read it and say, “Yes, I know that,” is different from having internalized it in a way that actually changes how you think." (page 144)
"In his book Sources of Power, decision researcher Gary Klein cites this [explaining away signs of a problem] as one of the top three causes of bad decisions. He calls it a “de minimus error,” an attempt to minimize the inconsistency between observations and theory. Each new piece of evidence that doesn’t fit a doctor’s medical diagnosis can be explained away or dismissed as a fluke, so the doctor never realizes her initial diagnosis was wrong." (page 165)
Exposure to opposing views on social media can increase political polarization cited in chapter 12.
Keep your identity small cited in chapter 14.
"They [a group of citizen scientists] also dove into the politics of government research, familiarizing themselves with how funding was structured and how the drug trials were conducted. The disorganization they discovered alarmed them. “It sort of felt like reaching the Wizard of Oz,” said one activist named Mark Harrington. “You’ve gotten to the center of the whole system and there’s just this schmuck behind a curtain.”" (page 212)
Cites How to Survive a Plague.