I was preparing for system design interviews for machine learning engineering positions, and sort of related design questions on ML for data science positions. Xu's book is on more general software engineering, but I thought it was still interesting and valuable. MLE interviews can have "regular" SWE system design components too. For system design content more specific to machine learning, I thought Educative had some decent materials.
I think system design interviews are often trying to evaluate not just abstract or academic knowledge, but experience, thinking, and communication skills. Reading a book like this helps jumpstart and review your knowledge, and is definitely valuable, but is probably not sufficient for a truly outstanding performance. So you could start here, but don't stop!
]]>Your expertise should be clear from what you've done, not from some rating scale that you use on yourself. Please no star ratings or “An asterisk denotes *Expert level” etc.
Don't say “40 Hours/Week” for every job you've worked. Full time employment is assumed. I'm thinking about resumes for salaried positions, but even if it were hourly, I don't think I'd want to see “40 Hours/Week” repeated over and over.
Links and email addresses should not have oldschool underlined blue styling. Control the formatting to make it simple and beautiful.
If you have links in your resume, make sure you know where they go. I've seen resumes where, presumably because they used somebody else's resume as a starting point, the link text was updated but the link still took me to somebody else's LinkedIn and Twitter.
Pay special attention with things you're claiming expertise in: It's hard to believe you if you say you know “CentOs”—because it's CentOS.
Please get the names of your prior employers correct. There is no “Red American Cross”—it's “American Red Cross.”
If you are “quick to learn new tools and technics” why have you not learned how to spell “techniques?”
It's easier to not use color at all than to use it well. Particularly bad: “Wor” in blue, regular weight, with “k Experience” in bold black. Why do that?
Almost nobody should include GPA(s) in their resume. Possible exception: It's perfect, and you've never had a job. Particularly bad: Including a mediocre GPA for one degree, with GPA then conspicuously absent for one or more other degrees.
Notice how this actual section from a real resume is selfcontradictory:
Communication – Persuasive communicator, comfortable challenging the status quo when appropriate, whether to longstanding processes or to conventional thinking to drive greater efficiencies and outcomes. Able to adjust communications to a diverse audience to ensure understanding and clarity. Recognizes the importance of providing concise, complete visualizations of complex data.
Show, don't tell. Words without evidence are only demonstrating that you can waste time.
Don't use two spaces between sentences. We're using computers.
]]>I didn't make a separate resume for every position I applied for, but I did distribute at least six different versions for different kinds of roles.
Do resumes matter? I'm not sure. You probably still need to have one though. And it may be that the less they matter, the better the signal you provide by still having a good one. People who are great do the little things well.
If nothing else, thinking about your resume can focus your work: How would what I'm doing today look on my resume? If you're doing something that would seem pointless on a resume, maybe you should do something else.
]]>Capitalism can be stupid and mean: Several businesspeople seem Trumplike in their disregard for ethics in the pursuit of profit. It's hard to take a libertarian point of view with externalities like these. And radium dials at least look cool, but many radium businesses sold absolute snake oil. I noticed today that a company will sell you marijuana gummies for weight loss. Maybe not much has changed.
Ignorance and geography: It was much harder, it seemed, to find a competent doctor or lawyer in smalltown Illinois, compared to the New York metropolitan area. It seems there was more religiosity in the middle west as well. It also struck me as strange that terminal patients were generally not told that they were terminal. Is that still the case? Is ignorance really better in such cases?
Weaponizing “experts”: As with radium, so with cigarettes and opioids and climate change and on and on... The pursuit of knowledge pushed behind debateclub hesaid/shesaid and outright lies.
"But if you looked a little closer at all those positive publications, there was a common denominator: the researchers, on the whole, worked for radium firms. As radium was such a rare and mysterious element, its commercial exploiters in fact controlled, to an almost monopolizing extent, its image and most of the knowledge about it. Many firms had their own radiumthemed journals, which were distributed free to doctors, all full of optimistic research. The firms that profited from radium medicine were the primary producers and publishers of the positive literature." (page 49)
Dangers with delayed effects: Radium took years, usually, to have negative effects on the radium dial painters. What risks do we currently not know or underestimate?
Gender and class: Moore highlights inequities in societal attention and care. For example, "It seemed wealthy consumers were much more worthy of protection than workingclass girls..." (page 273) It reminds me of the 21stcentury popularization of Alice Neel, who similarly focused on working people and "the female gaze." How far is there still to go?
The legal system: The tales of slow and capricious legal processes, outofcourt settlements, rhetoric over reality, the influence of wealth, and general nonsense... The 1920s and 30s sound very much like the present.
For each initial word, find the other word which means the same or most nearly the same. In the example at right, “animal” is selected because it has the closest meaning to “beast.” 
beast

space
lift
concern
broaden
blunt
accustom
chirrup
edible
pact
solicitor
allusion
caprice
animosity
emanate
madrigal
cloistered
encomium
pristine
tactility
sedulous
Caveats:
Here's an excerpt on the history of WORDSUM:
In the early 1920s, Edward L. Thorndike developed a lengthy vocabulary test as part of the I.E.R. Intelligence Scale CAVD to measure, in his words, “verbal intelligence.” As in the modernday Wordsum test, each question asked respondents to identify the word or phrase in a set of five whose meaning was closest to a target word. Robert L. Thorndike (1942) later extracted two subsets of the original test, each containing twenty items of varying difficulty. For each subset, two target words were selected at each of ten difficulty levels. The ten items in Wordsum (labeled with letters A though J) were selected from the first of these two subsets.
It's not perfectly accurate to say Thorndike developed a lengthy vocabulary test. The I.E.R Intelligence Scale CAVD, copyright 1925 and 1926, is 17 levels (A through Q) and each level has ten vocabulary questions in multiple choice format, among other question types.
Thorndike 1942 is Two screening tests of verbal intelligence, which describes how two 20word tests “containing two words from each of the levels of CAVD from H through Q” (the more difficult end of the spectrum; levels A through E use pictures) were constructed.
The first subset then is Form A, and those are the questions included above. I don't know which ten words from Form A's twenty are in WORDSUM, however. As GSS Codebook Appendix D explains:
To minimize the admittedly small possibility that some form of publicity would affect the public's knowledge of the words included in the test, they are not reported here.
I think it's fascinating that a vocab test from the 1920s is part of the GSS, which continues to be administered in the 2020s. I think it's fascinating that WORDSUM has been used for evidence that Intelligence makes people think like economists, among many other things.
I think the benefit of knowing what the WORDSUM questions are far outweighs any risk “that some form of publicity would affect the public's knowledge of the words included in the test.” Frankly, I'm not sure WORDSUM deserves an encomium.
]]>I was a little crazy in 2021. Depending on the level of generosity in interpretation, I was somewhere from 1/4 to 4/4 on oneyear predictions. None of my longerterm predictions have materialized. Still no aliens.
From 2020: Hong Kong seems to be pretty well controlled by the mainland now.
]]>"Guiding your thoughts is one of the keys to selfperfection." (August 9)
Tolstoy is quite religious, pacifist, vegetarian, and backtotheearth. He mixes quotes and paraphrases with his own commentary and aphorisms. I can't quite recommend it all because I find some things objectionable, but I like the idea of having something to reflect on daily. Some of my most and least favorite selections are below.
"To accept the dignity of another person is an axiom." (April 16)
"Effort is the necessary condition of moral perfection." (July 23)
"When a person tries to apply his intellect to the question “Why do I exist in this world?” he becomes dizzy. The human intellect cannot find the answers to such questions." (July 29)
"Think good thoughts, and your thoughts will be turned into good actions. Everything begins in thought. Guiding your thoughts is one of the keys to selfperfection." (August 9)
"Real goodness is not something that can be acquired in an instant, but only through constant effort, because real goodness lies in constantly striving for perfection." (September 4)
"You should abstain from arguments. They are very illogical ways to convince people. Opinions are like nails: the stronger you hit them, the deeper inside they go." (November 4, quoting Decimus Junius Juvenalis)
"The more urgently you want to speak, the more likely it is that you will say something foolish." (November 4)
]]>"A marriage is a special obligation between two people, of opposite sexes, to have children only with each other. To break this pact is a lie, a deception, and a crime." (March 11)
"We can improve this world only by distributing the true faith among the world's people." (March 17)
"You should never feel depressed.
A man should always feel happy; if he is unhappy, it means he is guilty." (June 29)
"People know little, because they try to understand those things which are not open to them for understanding: God, eternity, spirit; or those which are not worth thinking about: how water becomes frozen, or a new theory of numbers, or how viruses can transmit illnesses." (July 27)
"Only religion destroys egoism and selfishness, so that one starts to live life not only for himself. Only religion destroys the fear of death, only religion gives us the meaning of life, only religion creates equality among people, only religion sets a person free from outer pressures." (August 18)
"It is dangerous to disseminate the idea that our life is purely the product of material forces and that it depends entirely on these forces." (August 22)
"Faith is the foundation on which all else rests; it is the root of all knowledge." (August 28)
"Though the mission of a woman's life is the same as that of a man's life and the service to God is fulfilled by the same means, namely love, for the majority of women the method of this service is more specific than for men. This is the birth and upbringing of new workers for the Lord throughout life." (December 1)
"There is nothing more natural for a woman than selfsacrifice." (December 1)
If you're dividing by something that can be close to zero, the results can get big, which affects variance. Dividing two standard normal distributions gives you a Cauchy distribution, for example, and the variance there is undefined. So if you're near zero, watch out! Variance may not make sense, and can at least be hard to estimate.
The approximation of the variance of a ratio using the delta method is:
\[ \text{var} \left( \frac{X}{Y} \right) \approx \frac{1}{\overline{Y}^2} \text{var} \left( X \right) + \frac{ \overline{X}^2 }{ \overline{Y}^4 } \text{var} \left( Y \right)  2 \frac{ \overline{X} }{ \overline{Y}^3 } \text{cov} \left( X, Y \right) \]
See Seltman for a derivation. Also presented in Kohavi et al. and Deng et al.
This is an approximation based on Taylor series expansion, and it isn't always super close, even for seemingly simple examples like this one in R, with uniform distributions:
approx_var_ratio < function(x, y) {
(1 / mean(y)^2) * var(x) +
(mean(x)^2 / mean(y)^4) * var(y) 
2 * (mean(x) / mean(y)^3) * cov(x, y)
}
x = runif(1000, min=14, max=26)
y = runif(1000, min= 4, max=16)
approx_var_ratio(x, y)
## 0.6286227
var(x/y)
## 1.189362
That's no fluke of sampling. The deltabased formula is consistently under 60% of the empirical value, for these parameters.
It isn't just a weirdness of the uniform distribution either. The approximation is agnostic to distribution—which is another hint that it can't be perfectly right, at least not always. Here's an example with Gaussians, truncated so there's no risk of outliers near zero:
x = rtruncnorm(1000, a=14, mean=20, b=26, sd=3)
y = rtruncnorm(1000, a= 4, mean=10, b=16, sd=3)
approx_var_ratio(x, y)
## 0.3323209
var(x/y)
## 0.5294384
That's still a substantial underestimate.
The approximation gets better when means are bigger relative to variances. For example:
x = rnorm(1000, mean=200, sd=3)
y = rnorm(1000, mean=100, sd=3)
approx_var_ratio(x, y)
## 0.004369166
var(x/y)
## 0.004381438
Now it's quite close.
A distribution with appreciable mass near zero has a large variance relative to its mean, but you don't have to be near zero for the relative spread of the distribution to affect the quality of the approximation. Especially clear of zero, I think Monte Carlo estimation can be better than using the formula based on the delta method.
In conclusion, approximations are approximations. Do you really need a ratio?
Stay safe out there!
]]>"In statistics, this [Overall Evaluation Criterion (OEC)] is often called the Response or Dependent variable (Mason, Gunst and Hess 1989, Box, Hunter and Hunter 2005); other synonyms are Outcome, Evaluation and Fitness Function (QuartovonTivadar 2006). Experiments can have multiple objectives and analysis can use a balanced scorecard approach (Kaplan and Norton 1996), although selecting a single metric, possibly as a weighted combination of such objectives is highly desired and recommended (Roy 2001, 50, 405429)." (page 7)
The Roy citation is "Design of Experiments Using The Taguchi Approach: 16 Steps to Product and Process Improvement". "Step 16" is "Case studies", so I think Roy has confused "step" with "chapter".
"However, Google's tweaks to the color scheme [the 41 blues test] ended up being substantially positive on user engagement (note that Google does not report on the results of individual changes) and led to a strong partnership between design and experimentation moving forward." (page 16)
This is an interesting take... They say (while saying they can't support it with evidence) that the 41 blues test was good in multiple ways, while the popular lore is mostly about how at least one designer cited that experimentation as the reason they quit Google. Hmm.
"One useful concept to keep in mind is EVI: Expected Value of Information from Douglas Hubbard (2014), which captures how additional information can help you in decision making." (page 24)
"If we use purchase indicator (i.e., did the user purchase yes/no, without regard to the purchase amount) instead of using revenueperuser as our OEC, the standard error will be smaller, meaning that we will not need to expose the experiment to as many users to achieve the same sensitivity." (page 32)
This strikes me as an oversimplification... The measurements aren't on the same scale, for one, so what does it mean to have a smaller standard error, exactly? The two are testing different things... You could imagine a world where the experimental condition convinces 100% of users to make a $1 purchase, but stops the 5% of users who were previously making $100 purchases. That's not good. I bet the authors meant something more precise, and I wish they would have said what.
"In the analysis of controlled experiments, it is common to apply the Stable Unit Treatment Value Assumption (SUTVA) (Imbens and Rubin 2015), which states that experiment units (e.g., users) do not interfere with one another." (page 43)
"Sample Ratio Mismatch (SRM)"
"Zhao et al. (2016) describe how Treatment assignment was done at Yahoo! using the FowlerNollVo hash function, which sufficed for singlelayer randomization, but which failed to properly distribute users in multiple concurrent experiments when the system was generalized to overlapping experiments. Cryptographic hash functions like MD5 are good (Kohavi et al. 2009) but slow; a noncryptographic function used at Microsoft is Jenkins SpookyHash (www.burtleburtle.net/bob/hash/spooky.html)." (page 47)
"For Bing, over 50% of US traffic is from bots, and that number is higher than 90% in China and Russia." (page 48)
"Goal metrics, also called success metrics or true north metrics, reflect what the organization ultimately cares about."
"Driver metrics, also called sign post metrics, surrogate metrics, indirect or predictive metrics, tend to be shorterterm, fastermoving, and moresensitive metrics than goal metrics." (page 91)
"Guardrail metrics guard against violated assumptions and come in two types: metrics that protect the business and metrics that assess the trustworthiness and internal validity of experiment results." (page 92)
"Between 1945 and 1960, the federal Canadian government paid 70 cents a day per orphan to orphanages, and psychiatric hospitals received $2.25 per day, per patient. Allegedly, up to 20,000 orphaned children were falsely certified as mentally ill so the Catholic Church could get $2.25 per day, per patient (Wikipedia contributors, Data dredging 2019)." (page 101)
"... many unconstrained metrics are gameable. A metric that measures ad revenue constrained to space on the page or to a measure of quality is a much better metric to ensure a highquality user experience." (page 101)
"Generally, we recommend using metrics that measure user value and actions." (page 101)
"Combining Key Metrics into an OEC"
"Given the common situation where you have multiple goal and driver metrics, what do you do? Do you need to choose just one metric, or do you keep more than one? Do you combine them all into single combination metric?"
"While some books advocate focusing on just one metric (Lean Analytics (Croll and Yoskovitz 2013) suggest the One Metric that Matters (OMTM) and The 4 Disciplines of Execution (McChesney, Covey and Huling 2012) suggest focusing on Wildly Important Goal (WIG)), we find that motivating but an oversimplification. Except for trivial scenarios, there is usually no single metric that captures what a business is optimizing for. Kaplan and Norton (1996) give a good example: imagine entering a modern jet airplane. Is there a single metric that you should put on the pilot's dashboard? Airspeed? Altitude? Remaining fuel? You know the pilot must have access to these metrics and more. When you have an online business, you will have several key goal and driver metrics, typically measuring user engagement (e.g., active days, sessionsperuser, clicks peruser) and monetary value (e.g., revenueperuser). There is usually no simple single metric to optimize for."
"In practice, many organizations examine multiple key metrics, and have a mental model of the tradeoffs they are willing to accept when they see any particular combination. For example, they may have a good idea about how much they are willing to lose (churn) users if the remaining users increase their engagement and revenue to more than compensate. Other organizations that prioritize growth may not be willing to accept a similar tradeoff."
"Oftentimes, there is a mental model of the tradeoffs, and devising a single metric — an OEC — that is a weighted combination of the such objectives (Roy 2001, 50, 405429) may be the more desired solution. And like metrics overall, ensuring that the metrics and the combination are not gameable is critical (see Sidebar: Gameability in Chapter 6). For example, basketball scoreboards don't keep track of shots beyond the two and threepoint lines, only the combined score for each team, which is the OEC. FICO credit scores combine multiple metrics into a single score ranging from 300 to 850. The ability to have a single summary score is typical in sports and critical for business. A single metric makes the exact definition of success clear and has a similar value to agreeing on metrics in the first place: it aligns people in an organization about the tradeoffs. Moreover, by having the discussion and making the tradeoffs explicit, there is more consistency in decision making and people can better understand the limitations of the combination to determine when the OEC itself needs to evolve. This approach empowers teams to make decisions without having to escalate to management and provides an opportunity for automated searches (parameter sweeps)."
"If you have multiple metrics, one possibility proposed by Roy (2001) is to normalize each metric to a predefined range, say 01, and assign each a weight. Your OEC is the weighted sum of the normalized metrics." (pages 104105)
"GoodUI.org summarizes many UI patterns that win [A/B tests] repeatedly." (page 113)
"Experiment randomization can also act as a great instrumental variable." (page 114)
Hmm! I guess this would be the case if you found the experiment had some effect on X, and then you were interested in further effects of X on other things. See: A simple Instrumental Variable.
The Effect of Providing Peer Information on Retirement Savings Decisions (ref page 119; abstract shown here)
We conducted a field experiment in a 401(k) plan to measure the effect of disseminating information about peer behavior on savings. Lowsaving employees received simplified plan enrollment or contribution increase forms. A randomized subset of forms stated the fraction of agematched coworkers participating in the plan or agematched participants contributing at least 6% of pay to the plan. We document an oppositional reaction: the presence of peer information decreased the savings of nonparticipants who were ineligible for 401(k) automatic enrollment, and higher observed peer savings rates also decreased savings. Discouragement from upward social comparisons seems to drive this reaction.
Hmm! Usually peer effects are supposed to be so great...
"For example, Bing and Google's scaledout human evaluation programs are fast enough to use alongside the online controlled experiment results to determine whether to launch the change." (page 131)
"What customers say in a focus group setting or a survey may not match their true preferences. A wellknown example of this phenomenon occurred when Philips Electronics ran a focus group to gain insight into teenagers' preferences for boom box features. The focus group attendees expressed a strong preference for yellow boom boxes during the focus group, characterizing black boom boxes as “conservative.” Yet when attendees exited the room and were given the chance to take home a boom box as a reward for their participation, most chose black (Cross and Dixit 2005)." (page 132)
"Note that sophisticated modeling may be necessary to infer the impact, with an online example of ITS [Interrupted Time Series] being Bayesian Structural Time Series analysis (Charles and Melvin 2004)." (page 140)
"Interleaved Experiments" (page 141) are when you have two ranking methods and you interleave their results (removing duplicates) and see which ones get more clicks. Seems neat.
"More active users are simply more likely to do a broad range of activities. Using activity as a factor is typically important." (page 148)
In character for a book on RCTs, they point out Refuted Causal Claims from Observational Studies. (pages 147149)
"Indeed, the most difficult part of instrumentation is getting engineers to instrument in the first place." (page 165)
"The real measure of success is the number of experiments that can be crowded into 24 hours." (quoting Thomas A. Edison, page 171)
"The visualization tool is not just for perexperiment results but is also useful for pivotting to permetric results across experiments. While innovation tends to be decentralized and evaluated through experimentation, the global health of key metrics is usually closely monitored by stakeholders." (page 181)
"Assuming Treatment and Control are of equal size, the total number of samples you need to achieve 80% power can be derived from the power formula above, and is approximately as shown in Equation 17.8 (van Belle 2008):
\[ n \approx \frac{16 \sigma^2 }{ \delta^2 } \]
where, \( \sigma^2 \) is the sample variance, and \( \delta \) is the difference between Treatment and Control." (page 189)
"How can we ensure that Type I and Type II errors are still reasonably controlled under multiple testing? There are many well studied approaches; however, most approaches are either simple but too conservative, or complex and hence less accessible. For example, the popular Bonferroni correction, which uses a consistent but much smaller pvalue threshold (0.05 divided by the number of tests), falls into the former category. The BenjaminiHochberg procedure (Hochberg and Benjamini 1995) uses varying pvalue thresholds for different tests and it falls into the latter category." (page 191)
Benjamini–Hochberg (wiki, howto) doesn't seem so bad, either. Sort of the flavor of a QQ plot, almost?
Page 192 (section on "Fisher's Meta analysis") has a bunch on how to combine pvalues from multiple experiments.
Pages 194195 discuss ratio metrics and figuring out the variance of a ratio using the delta method, as referenced in Deng et al. §4.2. See also Seltman's note deriving the result. It is a little bit of a weird formula, but the book makes it seem rather fancier than it really is, I think, and their motivation doesn't seem to be strictly relevant.
On page 197 they mention CUPED.
"While you can always resort to bootstrap for conducting the statistical test by finding the tail probabilities, it gets expensive computationally as data size grows. On the other hand, if the statistic follows a normal distribution asymptotically, you can estimate variance cheaply. For example, the asymptotic variance for quantile metrics is a function of the density (Lehmann and Romano 2005). By estimating density, you can estimate variance." (page 199)
The citation is Testing Statistical Hypotheses.
"When conducting ttests to compute pvalues, the distribution of pvalues from repeated trials [of A/A tests] should be close to a uniform distribution." (page 200)
"Bing uses continuous A/A testing to identify a carryover effect (or residual effect), where previous experiments would impact subsequent experiments run on the same users." (page 201)
"We highly recommend running continuous A/A tests in parallel with other experiments to uncover problems, including distribution mismatches and platform anomalies." (page 201)
This is wild:
"The book A/B Testing: The Most Powerful Way to Turn Clicks into Customers (Siroker and Koomen 2013) suggests an incorrect procedure for ending experiments: “Once the test reaches statistical significance, you'll have your answer,” and “When the test has reached a statistically significance conclusion ...” (Kohavi 2014). The statistics commonly used assume that a single test will be made at the end of the experiment and “peeking” violates that assumption, leading to many more false positives than expected using classical hypothesis testing.
Early versions of Optimizely encouraged peeking and thus early stopping, leading to many false successes. When some experimenters started to run A/A tests, they realized this, leading to articles such as “How Optimizely (Almost) Got Me Fired” (Borden 2014). To their credit, Optimizely worked with experts in the field, such as Ramesh Johari, Leo Pekelis, and David Walsh, and updated their evaluations, dubbing it “Optimizely's New Stats Engine” (Pekelis 2015, Pekelis, Walsh and Johari 2015). They address A/A testing in their glossary (Optimizely 2018a)." (page 203)
Their whole job!
"Always run a series of A/A tests before utlizing an A/B testing system. Ideally, simulate a thousand A/A tests and plot the distribution of pvalues. If the distribution is far from uniform, you have a problem. Do not trust your A/B testing system before resolving the issue." (page 205)
Rubin causal model, page 226
]]>If you want to see what other kinds of things I'm up to, check out the blog! 화이팅! 😄
"The biggest contribution of genetics to the social sciences is to give researchers an additional set of tools to do basic research by measuring and statistically controlling for a variable—DNA—that has previously been very difficult to measure and statistically control for." (page 192)
The Genetic Lottery has received criticism. Here's Henn et al.:
"Ultimately, [Harden's] focus on genetics as a fundamental cause of social inequality reduces her version of social justice to benevolent paternalism."
I think Harden does a fair job of being clear that genetics can be a cause, but certainly not the only cause, and not a cause that can't be redressed. On “benevolent paternalism,” I think Henn et al. intend the phrase to have negative connotation, but couldn't any attempt at social justice (or social safety nets) be referred to in this way?
As in the “equality vs. equity” cartoon on page 162, I take it as a given that everyone should be able to see over the fence—the hard question is how high the fence that we're trying to get everybody over is. What's the level that should be guaranteed, and what are exceptions for extreme cases?
Bird has issues with the presentation of the science, and I agree I would have liked more detail and precision in the presentation, but Harden is also trying to reach a broad audience and cover a lot of material. I think Bird is incorrect in his accusation that Harden doesn't engage with the history of eugenics.
Harden's point that genetic confounds can affect optimal policy recommendations seems meaningful to me. If parents with lots of books in their home have kids who learn to read, does that imply giving everyone a stack of books is all we need to do?
I think Harden is right that if the wellintentioned don't engage with genetics, their impact is muted by confounding while the illintentioned advertise “forbidden knowledge” to the benefit of none. But I don't expect a major shift in Harden's lifetime.
"Building a commitment to egalitarianism on our genetic uniformity is building a house on sand." (page 19)
"A study of what is correlated with succeeding in an education system doesn't tell you whether that system is good, or fair, or just." (page 60)
Quoting the organizers of the Fragile Families Challenge:
“If one measures our degree of understanding by our ability to predict, then results ... suggest that our understanding of child development and the life course is actually quite poor.” (page 70)
"I think we must dismantle the false distinction between “inequalities that society is responsible for addressing” and “inequalities that are caused by differences in biology.”" (page 91)
Understanding and Misunderstanding Randomized Controlled Trials
Interesting unit: centimorgan
Heritability in the genomics era — concepts and misconceptions
"half of the additive genetic variance is between families and half is within families"
Personal genomes: The case of the missing heritability
Dang, I would like to see some worked examples for how heritabilities are calculated...
Individual Differences in Executive Functions Are Almost Entirely Genetic in Origin
"As Dostoevsky reminded us, “It takes something more than intelligence to act intelligently.”" (page 141, referring to Crime and Punishment)
Genetic analysis of socialclass mobility in five longitudinal studies
In its conclusion, a sentiment shared by Harding:
"A longterm goal of our sociogenomic research is to use genetics to reveal novel environmental intervention approaches to mitigating socioeconomic disadvantage."
Geneticallymediated associations between measures of childhood character and academic achievement
In Figure 7.3 of the book, a list based on that ref:
"The SNPs correlated with noncognitive skills were correlated with higher risk for several mental disorders, including schizophrenia, bipolar disorder, anorexia nervosa, and obsessivecompulsive disorder. This result warns us against viewing the genetic variants that are associated with going further in current systems of formal education as being inherently “good” things. A single genetic variant might make it a tiny bit more likely that someone will go further in school, but that same variant might also elevate their risk of developing schizophrenia or another serious mental disorder." (page 144)
"Unfortunately, the mistaken idea that genetic influences are an impermeable barrier to social change is also widely endorsed not just by those who are trying to naturalize inequality, but also by their ideological and political opponents." (page 155)
Strong genetic overlap between executive functions and intelligence
"I could quote the Bible verse from Thessalonians that was quoted to me as a child: “The one who is unwilling to work shall not eat.”" (page 212)
]]>"There is no measure of socalled “merit” that is somehow free of genetic influence or untethered from biology." (page 247)
The case of \( B = A + \delta \) (so that \( \delta = B  A \)) is central to the paired ttest, for example. The variance of \( A \) and \( B \) could each be large, but the variance of \( \delta \) can still be small, making it easier to reject the null for \( \textbf{δ} \), for example.
Be careful: We're trying to estimate a variance, so we're interested in the variance of the estimate of the variance, which can be confusing. Hopefully the language is clear enough here.
The obvious thing to do is to subtract and calculate the variance of the differences: \( \text{var}( \delta ) = \text{var}( B  A ) \). This is a good idea and what you should do. It's unbiased and has low variance (for the estimate of the true variance of \( \textbf{δ} \)).
Why not do it like this? Honestly I'm not sure. Maybe you don't have complete data, and you're looking for a \( \delta_1  \delta_2 \) where the means of \( A_1 \) and \( A_2 \) are assumed equal so you're using \( B_1  B_2 \) but you want the (smaller) variance of \( \delta_1  \delta_2 \)? So you'll estimate some things with complete cases even though the main effect is estimated with all \( B \) data? Why not still use this method with the complete cases? Maybe you're not using simple subtraction, but some more complex regression with multiple variables? In that case, why not use variance of the residuals directly? Do you just want a more complicated method?
By the variance sum law, \( \text{var}( \textbf{B} ) = \text{var}( \textbf{A} ) + \text{var}( \textbf{δ} ) \), if \( \textbf{A} \) and \( \textbf{δ} \) are uncorrelated, so \( \text{var}( \textbf{δ} ) = \text{var}( \textbf{B} )  \text{var}( \textbf{A} ) \). Real data is not generally perfectly uncorrelated, however: \( \text{var}(\delta) = \text{var}(B)  \text{var}(A)  2\text{cov}(A, \delta) \). The covariance term is zero in expectation, so estimating \( \text{var}(\delta) = \text{var}(B)  \text{var}(A) \) is unbiased. But the \( 2\text{cov}(A, \delta) \) is a kind of noise term, and adds variance to the estimate of \( \text{var}( \textbf{δ} )\).
It isn't obvious, but using \( \text{var}(B)(1  \text{corr}(A, B)^2) \) to estimate \( \text{var}( \textbf{δ} ) \) is much like using the Difference of Variances, but with a generally smaller (but always positive) error coming from \( \text{cov}(A, \delta) \). Proceeding in small steps:
\[ \text{var}(\delta) = \text{var}(B)  \text{var}(A)  2\text{cov}(A, \delta) \]
\[ \text{var}(\delta) = \text{var}(B)  \left( \text{var}(A) + 2\text{cov}(A, \delta) \right) \]
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \frac{\text{var}(A) + 2\text{cov}(A, \delta)}{ \text{var}(B) } \right) \]
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \frac{\text{var}(A)^2 + 2\text{var}(A)\text{cov}(A, \delta)}{ \text{var}(A) \text{var}(B) } \right) \]
This has been algebraic. Now add \( \text{cov}(A, \delta)^2 \) to the fractional term's numerator. (This is an error in the estimate, to get to the destination.)
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \frac{\text{var}(A)^2 + 2\text{var}(A)\text{cov}(A, \delta) + \text{cov}(A, \delta)^2 }{ \text{var}(A) \text{var}(B) } \right) \]
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \frac{ ( \text{var}(A) + \text{cov}(A, \delta))^2 }{ \text{var}(A) \text{var}(B) } \right) \]
The bit squared in the fraction's numerator is \( \sum{(A_i  \bar{A} )^2} + \sum{(A_i  \bar{A} )(\delta_i  \bar{\delta})} \), which is \( \sum{(A_i  \bar{A})(A_i  \bar{A} + \delta_i  \bar{\delta})} \). Since \( B_i = A_i + \delta_i \) and \( \bar{B} = \bar{A} + \bar{\delta} \), that's \( \sum{(A_i  \bar{A})(B_i  \bar{B})} \), which is \( \text{cov}(A, B) \).
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \frac{ \text{cov}(A, B)^2 }{ \text{var}(A) \text{var}(B) } \right) \]
So at last, by definition:
\[ \text{var}(\delta) = \text{var}(B) \left( 1  \text{corr}(A, B)^2 \right) \]
QED. The error introduced is \( \text{cov}(A, \delta)^2 / \text{var}(A) \), which will tend to be considerably smaller in absolute value than the \( 2\text{cov}(A, \delta) \) error in the Difference of Variances method because \( 0 \le \text{cov}(A, \delta) / \text{var}(A) \ll 2 \). This error is however positive in expectation, so the estimate is biased: it will be slightly too small.
Generating 1,000 datasets where each of \( A \) and \( \delta \) have 100 values drawn at random from Gaussians with variance 9 and 16, respectively, the three methods above generate estimates of \( \text{var}( \textbf{δ} ) \) as follows:
 Method  Mean var(d) estimate  var(estimate) 

 Variance of Differences  15.98  5.10 
 Difference of Variances  15.95  10.69 
 Using Correlation  15.82  5.12 
As expected, the method Using Correlation comes out more below the true value of 16 on average, but its variance is comparable to that of the Variance of Differences.
The bias is already small with just 100 samples, and gets smaller still as samples get bigger and sample correlations tend to be smaller.
Here's clumsy R code to do the experiment:
trials = 1000
samples = 100
set.seed(0)
est1 = c()
est2 = c()
est3 = c()
for (i in 1:trials) {
A = rnorm(samples, sd=3) # var=9
d = rnorm(samples, sd=4) # var=16
B = A + d # var=25, in population sense
est1 = c(est1, var(B  A)) # same as var(d)
est2 = c(est2, var(B)  var(A))
est3 = c(est3, var(B) * (1  cor(A, B)^2))
}
mean(est1)
## [1] 15.98799
mean(est2)
## [1] 15.9451
mean(est3)
## [1] 15.82234
var(est1)
## [1] 5.09602
var(est2)
## [1] 10.69393
var(est3)
## [1] 5.117983
]]>George has pulled together a lot of material, some of it good. He includes introductory Python and command line, enough SQL to be confused about SQL, examples with Bitcoin prices, an idiosyncratic survey of visualization, web scraping, statistics, and the big machine learning models, including the big three boosted tree algorithms, which I appreciate. He includes some NLP, and even some on ethics.
George's own list of omissions (page 571) illustrates what he thinks is almost in scope:
Maybe the moral is that “data science” is too big a topic for one book. Trying to pack so much in has a cost. Here's the complete section on “Paired t and ztests”:
One last type of t or ztest is the paired test. This is for paired samples, like beforeandafter treatments. For example, we could measure the blood pressure of people before and after taking a medication to see if there is an effect. A function that can be used for this is
scipy.stats.ttest_rel
, which can be used like this:
scipy.stats.ttest_rel(before, after)
This will return a tstatistic and pvalue like with other
scipy
ttest functions.
If you've never heard of a paired ttest before, it's great this book tells you about it. You can start to ask questions like: Why is this a separate test? Does it have some advantage over a regular ttest? Hopefully you also question some parts of the book, as when Bayesian methods are dismissed as “much more complex to implement than a ttest.”
This is a map that can point you in a lot of interesting directions, which is valuable!
]]>It should be clear that 4.3 with an uncertainty \( \sigma \times 0.1 \) is not the same as 4.34 with the same uncertainty. It's because Significant Digits can't say 4.34 without meaning the uncertainty is \( \sigma \times 0.01 \) that we round to 4.3, if the uncertainty is \( \sigma \times 0.1 \). But this has a cost: good information is dropped.
Consider adding 1 + 1.4 + 1.4. If we do it in one go, we get 3.8 and then round to 4 for significance. But what if this is done in two steps? Maybe one team does 1 + 1.4, correctly reports their result as 2, and then a second team builds on that, adding 1.4 to get 3. The rounding that Significant Figures requires degrades the quality of results.
At some point it isn't worth tracking every digit in a result, but Significant Figures often encourages dropping too many. It may even give people the incorrect idea that if our uncertainty suggests two significant figures, we can't have three figures in our best guess at what the value is. We absolutely can, but this requires a system more expressive than Significant Figures to report.
]]>Many school math exercises can be answered with calculators. But how do we know the calculator is correct? The difference between inductive (“the calculator has always been right”) and deductive (“this answer is proven correct”) is important. The answer itself isn't important; it's the demonstration that the answer follows from the question.
High school geometry is sometimes pointed to as the first place students encounter the idea of proofs. This need not be. Arithmetic is a process of proof, using theorems of singledigit operations and placebased algorithms to build new results. In this way, 2+2=4 is used to demonstrate that 22+22=44, and so on. These could be called constructive proofs.
“Work” as “proof” opens up the world of math. “Checking your work” is finding a second proof. For the advanced student who might have been bored with “showing work,” there is an invitation to find further proofs, understand algorithms more deeply, and be creative, rather than being limited to mechanical processes.
Computational fluency is not without value, but math is about much more. We waste opportunities to deliver on teaching mathematical ways of thinking if students don't realize that they're constructing logical arguments in all their math classes.
]]>"...statistical methods are a means of accounting for the epistemic role of measurement error and uncertainty..." (page x)
"...an effect—if it is found—is likely overstated and unlikely to be replicable, a paradox known as the “winner's curse.”" (page xi)
The common theme is that there would be no need to continually treat the symptoms of statistical misuse if the underlying disease were addressed.
I offer the following for consideration:
Hypothesizing after the results of an experiment are known does not necessarily present a problem and in fact is the way that most hypotheses are ever constructed.
No penalty need be paid, or correction made, for testing multiple hypotheses at once using the same data.
The conditions causing an experiment to be terminated are largely immaterial to the inferences drawn from it. In particular, an experimenter is free to keep conducting trials until achieving a desired result, with no harm to the resulting inferences.
No special care is required to avoid “overfitting” a model to the data, and validating the model against a separate set of test data is generally a waste.
No corrections need to be made to statistical estimators (such as the sample variance as an estimate of population variance) to ensure they are “unbiased.” In fact, by doing so the quality of those estimators may be made worse.
It is impossible to “measure” a probability by experimentation. Furthermore, all statements that begin “The probability is ...” commit a category mistake. There is no such thing as “objective” probability.
Extremely improbably events are not necessarily noteworthy or reason to call into question whatever assumed hypotheses implied they were improbable in the first place.
Statistical methods requiring an assumption of a particular distribution (for example, the normal distribution) for the error in measurement are perfectly valid whether or not the data “actually is” normally distributed.
It makes no sense to talk about whether data “actually is” normally distributed or could have been sampled from a normally distributed population, or any other such consideration.
There is no need to memorize a complex menagerie of different tests or estimators to apply to different kinds of problems with different distributional assumptions. Fundamentally, all statistical problems are the same.
“Rejecting” or “accepting” a hypothesis is not the proper function of statistics and is, in fact, dangerously misleading and destructive.
The point of statistical inference is not to produce the right answers with high frequency, but rather to always produce the inferences best supported by the data at hand when combined with existing background knowledge and assumptions.
Science is largely not a process of falsifying claims definitively, but rather assigning them probabilities and updating those probabilities in light of observation. This process is endless. No proposition apart from a logical contradiction should ever get assigned probability 0, and nothing short of a logical tautology should get probability 1.
The more unexpected, surprising, or contrary to established theory a proposition seems, the more impressive the evidence must be before that proposition is taken seriously.
Heavily influenced by Probability Theory: The Logic of Science by Edwin Jaynes.
"We can, for example, set s = 0.01 and by 99 percent sure. Bernoulli called this “moral certainty,” as distinct from absolute certainty of the kind only logical deduction can provide." (page 8)
"statistics is both much easier and much harder than we have been led to believe." (page 17, italics in original)
"Aristotle's Rhetoric described “the Probably” (in Greek, eikos, from eoika meaning “to seem”) as “that which happens generally but not invariably.” The context for this was his classification of the arguments one could use in a courtroom or legislative debate, where perfect logical deductions may not be available. He called this form of argument an enthymeme, to be distinguished from the purely logical form of argument known as the syllogism, which links together a set of assumed premises to reach deductive conclusions..." (page 22)
"Hume's general point [in An Enquiry Concerning Human Understanding], later referred to as the problem of induction, was that we have no way of knowing experience is a guide for valid conclusions about the future because if we did, that claim could be based only on past experience." (page 35)
I kind of like the names Clayton uses in his tables of calculations: "Prior probability" is normal, then "Sampling probability" is used for the likelihood of the data, and then he multiplies them together to get a "Pathway probability."
"Whether Bayes himself believed he had disproved Hume we have no way of knowing. Some historians such as Stephen Stigler at the University of Chicago have suggested that since Bayes did not find the counterexample sufficiently convincing because it relied on some assumptions he could not justify, he delayed publishing his results. When presenting Bayes's results to the world, Price did not shy away from emphasizing their philosophical and religious significance. Contemporary reprints of the essay show Price intended the title to be “A Method of Calculating the Exact Probability of All Conclusions founded on Induction.” In his publication, he added this preamble: “The purpose I mean is, to shew what reason we have for believing that there are in the constitution of things fixt laws according to which things happen, and that, therefore, the frame of the world must be the effect of the wisdom and power of an intelligent cause; and thus to confirm the argument taken from final causes for the existence of the Deity.” That is, somewhere in the calculation of probabilities for Bayes's rule, Price thought he saw evidence for God." (page 41)
"... logical deduction is just a special case of reasoning with probabilities, in which all the probability values are zeros or ones." (page 53)
"Jaynes's essential point bears repeating: probability is about information." (page 68, italics in original)
"Base rate neglect and the prosecutor's fallacy are the same thing, and both are examples of Bernoulli's Fallacy." (page 103)
"... a new general trend of collecting data in service to the social good. John Graunt, haberdasher by day and demographer by night, had made a breakthrough in London in 1662 when he used weekly mortality records to design an early warning system to detect outbreaks of bubonic plague in the city. Even though the system was never actually deployed, it opened people's eyes to the rich possibilities of data gathering and its usefulness to the state. By the 1740s, prominent thinkers such as the German philosopher Gottfried Achenwall had taken to calling this kind of data statistics (statistik in German), the root of which is the Latin word statisticum meaning “of the state.”" (page 109)
"His [Quetelet's] goal, perhaps antagonized by the Baron de Keverberg's skepticism, was to investigate analytically all the ways people were the same or different and to create a theory of social physics, a set of laws governing society that could be an equivalent of Kepler's laws of planetary motion and other immutable principles of the hard sciences." (page 113)
This reminds me of psychohistory.
"George Pólya gave it the lofty name the central limit theorem" (page 120)
Huh!
"He [Quetelet] would later be harshly ridiculed for his love of the normal distribution by statisticians like Francis Edgeworth, who wrote in 1922: “The theory [of errors] is to be distinguished from the doctrine, the false doctrine, that generally, wherever there is a curve with a single apex representing a group of statistics ... that the curve must be of the ‘normal’ species. The doctrine has been nicknamed ‘Quetelismus,’ on the ground that Quetelet exaggerated the prevalence of the normal law.”" (page 122)
Interesting/weird idea from Galton: "statistics by intercomparison." If you can only order people on some characteristic (say intelligence), then do that and then assume it's quantitatively normal. Sort of like QQ plots. Sort of. (page 136)
On page 142 it seems to be saying that Pearson's chisquared is for general testing of whether data comes from a certain distribution... Is that right? Does this just mean binning out data and comparing counts to expected? Maybe that's it?
"For an experimental scientist without advanced mathematical training, the book [Fisher's Statistical Methods for Research Workers] was a godsend. All such a person had to do was find the procedure corresponding to their problem and follow the instructions." (page 153)
"He [Fisher] proved what he called the fundamental theorem of natural selection: “The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”"
Is this in conflict with Fisher as eugenicist? It seems to be prodiversity? At least some kinds of diversity...
Interesting comparison between choosing one or twosided testing, and Bayesian priors: you're not really bringing zero information to the problem.
Huh  there really is a Social Science Statistics online wizard.
"There is no coherent theory to orthodox statistics, only a loose amalgam of halfbaked ideas held together by suggestive naming, catchy slogans, and folk superstition." (page 196)
"As Fisher wrote in Statistical Methods for Research Workers, “No human mind is capable of grasping in its entirety the meaning of any considerable quantity of numerical data. We want to be able to express all the relevant information contained in the mass by means of comparatively few numerical values. This is a purely practical need which the science of statistics is able to some extent to meet." (page 233)
The Fallacy of the NullHypothesis Significance Test
The Cult of Statistical Significance
"Harold Jeffreys first proposed the idea of Bayes factors in his Theory of Probability." (page 262)
Daryl Bem (who published in support of psi) amusingly wrote (quoted page 264):
To compensate for this remoteness from our participants, let us at least become intimately familiar with the record of their behavior: the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don't like, or trials, observers, or interviewers who gave you anomalous results, place them aside temporarily and see if any coherent patterns emerge. Go on a fishing expedition for something—anything—interesting.
That's from Writing the Empirical Journal Article.
The ASA Statement on pValues: Context, Process, and Purpose
Statistical Inference in the 21st Century: A World Beyond p < 0.05
"Significance testing was always based on a classification of results into significant/insignificant without regard to effect size or importance; no attempts to rehabilitate it now can change that fundamental aspect nor repair the damage significance testing has already caused. This yes/no binary has well and truly mixed things up." (page 275)
"The better, more complete interpretation of probability is that it measures the plausibility of a proposition given some assumed information. This extends the notion of deductive reasoning—in which a proposition is derivable as a logical consequence of a set of premises—to situations of incomplete information, where the proposition is made more or less plausible, depending on what is assumed to be known." (page 283)
"All probability is conditional." (page 284)
"Once we jettison the bureaucracy of frequentist statistics, we can spend more time doing actual science." (page 287)
"Getting rid of the useless concepts (significance testing, estimators, sufficient and ancillary statistics, stochastic processes) will amount to cutting out probably 90 percent of the standard statistics curriculum. It might even mean giving up on statistics as a separate academic discipline altogether, but that's alright. Probability as a topic should rightfully split time between its parents, math and philosophy, the way logic does. Bayesian statistical inference contains exactly one theorem of importance anyway, and its practical techniques can be taught in a single semesterlong course in applied math. There needn't be a whole university department dedicated to it, any more than there needs to be a department of the quadratic formula." (page 287)
"We should no more be teaching pvalues in statistics courses than we should be teaching phrenology in medical schools." (page 293)
"Joseph Berkson called this the “interocular traumatic test”; you know what the data means when the conclusion hits you right between the eyes." (page 297)
That's quoted from "Bayesian statistical inference for psychological research."
That source cites "J. Berkson, personal communication, July 14, 1958" and goes on:
"The interocular traumatic test is simple, commands general agreement, and is often applicable; wellconducted experiments often come out that way. But the enthusiast's interocular trauma may be the skeptic's random error. A little arithmetic to verify the extent of the trauma can yield great peace of mind for little cost." (page 217)
]]>"The results of experiments, particularly surprising or controversial ones, can be trusted noly if the experiments are known to be sound; however, as is often the case, an experiment is known to be sound only if it produces the results we expect. So it would seem that no experiment can ever convince us of something surprising. This situation was anticipated by the ancient Greek philosopher Sextus Empiricus. In a skepticism of induction that predated David Hume's by 1,500 years, he wrote: “If they shall judge the intellects by the senses, and the senses by the intellect, this involves circular reasoning inasmuch as it is required that the intellects should be judged first in order that the sense may be judged, and the senses be first scrutinized in order that the intellects may be tested [hence] we possess no means by which to judge objects.”" (page 301)
Say a number in Significant Figures with rightmost significant digit \( D \times 10^N \) has uncertainty with standard deviation \( \sigma \times 10^N \), and assume errors are always uncorrelated.
So the number 12.3, with three significant figures, has uncertainty \( \sigma \times 0.1 \), and 2.48 has \( \sigma \times 0.01 \). Adding them gives 14.8, which has the same uncertainty as 12.3. By the Variance Sum Law, the true uncertainty is \( \sigma \times 0.1005 \), but that's pretty close to \( \sigma \times 0.1 \). In this way, the usual rule for adding with significant figures is often reasonableseeming.
With many numbers of the same precision, however, the usual rules are more problematic. If you add 1.2 + 3.4 + 5.6 + 7.8, the result 18.0 implies \( \sigma \times 0.1 \), but in fact uncertainty has doubled to \( \sigma \times 0.2 \). Significant Figures has no way to convey this, because it only communicates in powers of ten.
Adding and subtracting 100 numbers with the same precision, then, should give a result with exactly one fewer significant figures. With 25 numbers the standard deviation could “round up” to the next power of ten, arguably. It may not be common to add so many numbers with significant figures, but even with just a few, Sig Figs is a course approximation of correct propagation of uncertainty.
]]>The final digit is “significant but not certain” — which shouldn't mean we know nothing about what the digit is likely to be, and shouldn't mean that there's no chance the neighboring digit could be wrong. A purely digitbased interpretation is unnatural.
A particular instrument (or other source) may generate measurements with a different distribution of uncertainty. In such cases, that information should be captured specifically, and Significant Figures alone is not sufficient. For measurements where there isn't more specific information, a Gaussian distribution is a good choice.
Even when given a precise interpretation, Significant Figures is a system tied to base ten numbers and manual measurement that asks a single number to convey both value and uncertainty. There's only so much one number can mean. Sig Figs is better than not tracking precision at all, but better yet is to be explicit about precision.
I would love to find other sources or interpretations that agree or disagree with this definition. So far I haven't found anything explicit enough to really compare one way or another. References (and feedback of any kind) are especially welcome here!
Visualization code is on GitHub.
]]>
I've wanted this to be standard functionality for a long time; six years ago I briefly started a project that included it in its proof of concept, as in this screenshot:
The interactive example above is the hvPlot Scatter example saved to HTML as in Saving plots (code is on GitHub) and then copypasted into my blog Markdown file. The interactive plots unfortunately don't show up in GitHub's Jupyter Notebook preview.
hvPlot is just one corner of HoloViz and related packages, but the interactive scatterplot functionality alone is enough for me to be a big fan.
]]>