# Give the model epistemology

Monday October 5, 2020

Jaskie has a nice paper with A Modified Logistic Regression for Positive and Unlabeled Learning. I like that it extends the usual logistic function in a way that captures the idea that labels may be incorrect. I wonder about extending the idea to multi-class softmax, and its relationship with label smoothing.

Here's the usual logistic function.

\[ \frac{ 1 }{ 1 + e^{-x} } \tag{1} \]

What Jaskie does is introduce another term in the denominator:

\[ \frac{ 1 }{ 1 + b^2 + e^{-x} } \tag{2} \]

The \( b \) term is squared to be non-negative, and learned during model fitting. The effect is that the maximum score for a positive example is less than one.

This makes sense in the positive unlabeled setting because Jaskie
differentiates between whether an example *is* positive and whether an
example is *labeled* positive. We only have the labels, and we know
that some positive examples may not be positively labeled, so the
probability based on the example that it is positively labeled is less
than one.

It's somewhat surprising that this seems to work quite well, as demonstrated (for example) in Jaskie's Figure 6. (SLR is Standard Logistic Regression, MLR is Jaskie's Modified Logistic Regression.)

Let's speculate about extending this idea to the multi-class setting. First, just moving things around, Equation 2 becomes Equation 3.

\[ \frac{ e^x }{ (1 + b^2) e^x + 1 } \tag{3} \]

That form makes it easy to see the analogy between the usual logistic function and the multi-class softmax. Then the natural thing to suggest is Equation 4.

\[ \frac{ e^{x_i} }{ \sum (1 + b_i^2) e^{x_i} } \tag{4} \]

The semi-supervised setting in which there's a pile of unlabeled data in addition to the labeled data is pretty common, especially when labeling is expensive, and if something relatively simple like this were to give similar performance benefits to Jaskie's binary case, it could easily become very popular. (Assume you have a "garbage" class as well, in the open set sense.)

There's probably a similar extension to keep the negative example scores from going all the way to zero, if you think some of the true negatives might have incorrectly positive labels (or otherwise wrong labels in the multi-class setting).

I wonder about the relationship between these techniques and
label smoothing... With smoothed labels you still have fixed (or
at least sort of balanced) target probabilities, whereas the above
techniques allow learning where the targets should be... But I think
you wouldn't get the limits on logit growth, with the above
techniques... It would be neat if the above techniques gave you good
calibration *and* informative logits in the distillation sense. Hmm!