# AkshayDiary

## April 2016

### Using Partitioning Hypotheses for Uniform Convergence

If we have a set of S specialists that partition the data, then the failure of each specialist's error estimate is independent of the failure of any other specialist's. So uniform convergence is a conjunction of independent events in this case. Suppose each such event has failure probability ${\displaystyle \delta \leq \frac{1}{2}}$, and ${\displaystyle S := \frac{C}{\delta}}$.

Then the chance of uniform convergence is ${\displaystyle (1 - \delta)^S = (1 - \delta)^{\frac{C}{\delta}} = \left((1 - \delta)^{\frac{1}{\delta}} \right)^C \geq 4^{-C} }$

And ${\displaystyle C = K \log \left( \frac{1}{1-\delta} \right)}$

## January 2016

Working on first part of ITA talk (second will depend on what does best experimentally). Also working on two main fronts towards submissions. First (boosting) is a definite submission. Second is the direction that includes multi-phase/active learning algorithms.

### Experiments with a boosting-like algorithm

Summary so far:

• Baselines: supervised boosting approaches, other approaches, and our algorithm in NIPS
• New algorithm is very simple:
• Assign pseudo-labels to unlabeled examples each iteration: clipped examples get assigned their sign, and hedged examples random noise.
• Concatenate this with the labeled dataset, and proportionally weight the points.
• Call the weak learner with this weighted dataset to get the next classifier in the ensemble.
• New algorithm (is it a boosting algorithm? I think so) beats all other approaches in most cases after enough iterations of boosting (~100 seems to be the median that's enough).

--Yoavfreund 13:12, 25 January 2016 (PST) The term "Boosting" is now associated with any kind of coordinate-wise gradient descent. A narrow definition of boosting is "An algorithm that can transform a weak learner (in the PAC sense) to a strong learner". This algorithm does not satisfy this criteria. Beyond that, this a question of PR. Not a bad idea to use the term adaboost or boosting, but we have to make sure that the name indicates clearly that this is a semi-supervised algorithm.

• Step size is decreasing: 1/sqrt(t) , which works fine across a range of settings.
• "Line search" for step size, i.e. taking tiny steps with one hypothesis until its edge falls below 0 -- does not work at all empirically!
• Variability in the convergence is caused by varying the regularization used for the weak learning class; too complicated a class is bad, but too simple also appears to lead to quite slow convergence, even with correction in terms of convergence bounds.
• Currently using decision trees as the weak learning class, and adjusting the maximum depth to get the above conclusions about class complexity. Decision stumps don't converge very quickly on the datasets I've tried, because there is a bug (or an oddity?) in the decision tree classifier class.
• Choosing from a finite ensemble, of the trees in a pre-trained random forest and their complements, does not work well.

--Yoavfreund 21:38, 25 January 2016 (PST) Please provide more details:

• on the choice of the weak learning class: Are you using nodes in a tree? How is the tree constructed? Are you trying something else like balls?
• On the experiments: information about the datasets you are experimenting on, how many examples, how many labeled, the dimension of the problem, the number of classes, the change in performance as the unlabeled set is improved, comparison to other learning methods.

Also working on paper writing here:

• I will first motivate the algorithm as being the coordinate descent update in the space of all classifiers in the ensemble. But since this can effectively be combinatorially big (and scales UP with the amount of unlabeled data), it is not a good way to get meaningful convergence bounds here.
• Instead we will use the drifting games analysis. First we solve the fixed-time analysis, showing that the potential is an expectation over random walks. Then we remark that the whole thing can be done in variable time, which is an analysis of a variant of this algorithm with a finite difference being used instead of a derivative (I can prove this). I may be able to bound the gap between the two using the convexity of the slack function.
• Finally, show very good experimental results as above. One possible bonus: I've started playing a bit with TensorFlow, and might be able to throw in deep nets as the classifiers and run a big job on EC2 (I have datasets in mind; to discuss this week).

### Experiments with a multi-phase algorithm

--Yoavfreund 18:19, 7 February 2016 (PST) I don't quite follow. What is the space partitioning and how is it generated? Here is a simple idea: run a tree learning algorithm to generate a tree with some MAX_DEPTH (using the labeled examples and the hallucinated labels). Consider each node in the tree as a specialist. Compute a lower bound on the true correlation b by subtracting from b something like:

depth of node / sqrt(number of labeled examples in support)


Add only those specialists for whom b>0. To add them use either a single gradient step or a line search for the minimum.

## Recent progress (Nov. 2015)

(As per what we discussed at Nov. 4 meeting, with many updates and additions)

Summary (see below for details on both):

1. Abstaining: almost done, had an idea for a basic greedy sampling strategy.
2. Incremental/algorithmic: Meanwhile in thinking about how a label query can deliver targeted information to improve predictions on other test examples, I realized something involving specialists. Given an ensemble H for the incremental algorithm, we can just consider all possible specialists (that aren't too small) derived from any hypothesis in H. This could be statistically plausible, and is computationally reasonable.

• Abstaining:
• Almost done polishing draft of theory paper (colt again?), but not final readable yet, need to reorganize it a bit. [1]
• It is interesting to observe the marginal effect of raising the abstain rate a small amount ${\displaystyle \epsilon }$ from zero. Calculations basically yield that (a) among minimax optimal strategies, this is equivalent to abstaining ${\displaystyle \epsilon }$ on the minimum-margin example only; (b) the error probability that would be incurred on this ${\displaystyle \epsilon }$ fraction of the data is always at least the error on the rest of the data.
• Active learning improves performance in two ways: (1) by querying "difficult" examples to avoid being judged on them, and (2) by intelligently using the label information. This shows that even if we discount the effect of the second, the first is significant (at least for a polynomial improvement in #labels). So we can loosely motivate a simple, basic greedy min-margin sampling strategy without assumptions on label structure.
• Incremental/algorithmic:
• Allow arbitrary specialists to be added by calling a weighted batch binary classification oracle each iteration to determine the next hypothesis to add. This happens to be like the weak learner in boosting.
• Specialists from trees / examples; add all of the combinatorially many which predict on "enough" data. Uniform convergence on these is guaranteed by PAC-Bayes analysis with uniform prior over examples, noting that the labeled quantity to be estimated (b^T sigma) can be written as an average over examples. (I have to iron out what role ||sigma||_1 plays here, since it's not a distribution; needs bounding separately). To be precise, should use a PAC-Bayes Bernstein inequality to calculate b, so that specialists that are correct on their whole domain need just 1 label.
• Analysis (to write up): I can prove that greedy coordinate descent like this converges at a O(1/t) rate after t examples, to the minimum of the slack function. If the hypothesis class is expressive enough, this will be (close to) zero, and we will have a strong learning algorithm. This is likely, because our hypothesis class contains all possible large specialists derived from a possibly infinite ensemble; it is very expressive.
• When estimating b, calculate deviation bounds for each leaf separately to minimize the effect of small leaves (this seems to improve performance a little)
• Early stopping regularization?

## Recent Progress

(all files mentioned are in the github repo abstain-mmx)

5/21/15

Cython+out-of-core almost working as of Thursday morning, rewrote the code to do so and it is a lot more linear now, dividing the process into 3 parts (labeled data, unlabeled data, and test) with the labeled data being further split into two:

• Train the forest (on just enough labeled data to learn a fine partitioning of the data, say into 1000 per tree).
• Calculate b (and ${\displaystyle \rho}$) from the rest of the training data

I am making this process, except the initial training step, completely sequential and out-of-core.

Visualizations and ancillary observations/statistics that I will include (pending space cuts):

• Margin distributions, on at least a few datasets
• Gap between V and true error as measured with unknown test labels. This is an indication of how precise the game is, and if the constraints are ~feasible, low gap => good self-bounding, this is a big deal because p << n!
• Illustration of what happens with a very fine partitioning, when statistics are ignored, and test labels are used, ignoring the statistics.
• Illustration of what happens with a lot of random noise thrown in, and assumed by the algorithm.

5/10/15

Currently, I am preparing the COLT camera-ready because the deadline for this is the same as for NIPS, and so I would like to send this out asap.

This involves adding the example we made for the rebuttal, making minor changes as reviewers suggest, and editing the figure for the subgradient conditions. I am also keeping a separate copy that I'd like to merge, doing something I have found much cleaner and simpler in the proof of the minimax equilibrium - using direct Lagrangian duality and subgradient conditions, instead of LP duality. Reasons for this:

• Repartitioning constraints into LP form hides the fact that we are only really optimizing in p << n dimensions, because the LP has dimension O(n+p).
• This allows all the duality in the paper to be applied only via the minimax theorem, and also seems easier for people to understand, I believe.
• The Lagrange duality formulation can be extended by direct computation to other convex loss functions (I have done this, as we discussed before).

The computation of z^* is the only proof that becomes more opaque this way, but it follows from KKT conditions.

### Adding Specialists to Beat the Majority Vote

#### Margin Distribution

When the leaf-specialist algorithm is used, the average predictions (AP) of the test examples w.r.t. ${\displaystyle \sigma^*}$ end up being concentrated more and more on {-1,0,1}. Here is an extreme case, when leaves are grown out fully without regard for statistical considerations, so that p > n (more discussion in Sec. 3 of gameSpecialists.pdf): [2]

Here is what happens in the more typical situation when we take statistics into account and limit the minimum size of each leaf: [3]. The margin there is concentrated on +/- 1, but there is still dispersion.

It turns out this is the typical case (another typical example, with leaf specialists added, is at [4]. ) Typical tree-learning algorithms are only competitive when they are run all the way to overfitting; we cannot overfit to run our algorithm, but our bias correction results in better performance anyway.

I expected that the more information the constraints contain, the fewer non-zero-AP hedged examples there will be. So I believe I understand why the hedged examples are concentrated at zero; the constraints specify the labeling well enough that little extra hedging against different solutions is required. But I do not really understand why the clipped examples have APs concentrated at +/- 1, rather than sometimes being higher in magnitude.

Here is another example of the protein resp dataset again: [5]. This is the one in the doc table. It makes 22.6% error overall, but only 10.1% error on the 52.9% of the data that is clipped.

The log loss error is comparable or better to SGD, but results in a far different margin distribution, much smoother as we might expect: [6]. But the results are similar.

A fair comparison to random forests, involves growing out trees well past statistical overfitting. We therefore try this for

Other points/summary:

• Good news: we have an algorithm that works.
• Better news: clipped examples consistently have significantly lower error than typical.
• Changes to b are overwhelmed by changes to the large number of other variables in the problem.
• We are not quite competitive to RFs in log loss or square loss - I tried both. But those algorithms are pending revision.
• Have not tried data-poor situations because I am not sure how to justify them.

[Update: minibatch SGD is now working, it took a few days to go through several open-source LP solvers with tons of compatibility and speed issues, finding none to be satisfactory (in hindsight, this was unnecessary). So I wrote a minibatch SGD routine to do the job. It often finds a quite different ${\displaystyle \sigma^*}$ to the LP solver, but generates ~ the same predictions, as confirmed by repeated testing on several datasets. We use cross-validation to find an SGD learning rate that works.]

#### Leaf Specialists for Trees and Forests

This is described in gameSpecialists.pdf, along with experimental results. On a high level, there appear to be two phenomena at work here:

• The tree structure hierarchically partitions the data into somewhat pure regions.
• Our algorithm does error correction on each of these regions, and fuses them into an optimal prediction consistent with all constraints on the game.

#### Questions

• Is there a better way to subsample leaves that will be the best specialists? Surely not all are required.
• Is it worthwhile to use some base unsupervised hierarchical clustering (e.g. Ward's) on the unlabeled data to generate a partition, just to illustrate the benefits? This is reminiscent of "Hierarchical Sampling for Active Learning" by Daniel and Sanjoy.
• How well do we perform if we only use some of the leaves, not all? Just the pure ones?
• How can we use alternating decision trees' intermediate leaves?
• What if we minimize quantile loss over the forest? How does this compare to the state-of-the-art "Quantile Regression Forests"?

## Plan for Efficient Optimization of Slack Function

• Uniform convergence bound on slack function, over weightings in the confidence set. Confidence set defined ignoring computational constraints.
• Conf set = ${\displaystyle \{ \sigma \geq 0^p : \gamma (\sigma) \leq \gamma(\sigma^*) + \epsilon \} }$
• Assumption: best ${\displaystyle \sigma}$ is significantly better than ERM
• If time: localize only to confidence set, because slack function is unbounded
• Devise an approximation of the confidence set that admits an efficient sampling rule for unlabeled data
• Margin-based looks appealing, and works by duality-based arguments.
• How can we do better than SGD? Independent of the other problems
• Active learning
• How do we use less labeled data? Active learning by successively focusing on nested subsets of the data?

## Papers on active learning with perceptrons

• Analysis of Perceptron-Based Active Learning / Sanjoy Dasgupta, Adam Tauman Kalai and Claire Monteleoni

A nice analysis to basically show that fast convergence is possible for the active perceptron. But uniform distribution on the unit sphere assumed. This leads to an analysis of the angles between a random data point on the sphere and hypothesis vectors at various times. This will not directly translate for us.

We instead need there to be a low probability of data "close" to the optimal hypothesis. This will lead to quick estimation of the integral and quick convergence of the integral with ${\displaystyle \eta }$.

We can take the realizable case as a starting point, because it is easier in some ways. For instance, if we assume that the theoretically best separator is given to us, then there are less smoothness/monotonicity problems with moving away from this optimal predictor.

But how to move around the nonuniformity of the linear classifier? We cannot just take angles, as is the case with this analysis.

• Worst-case analysis of selective sampling for linear classification / Cesa Bianchi, Gentile

This is a randomized algorithm for activizing the perceptron, based solely on the margin. The above comments regarding the bias, etc. still apply, though. Also, it should be possible to do this deterministically.

## General Active Learning Research Plan

• Two ways of making infinite hypothesis classes tractable: ${\displaystyle \epsilon }$-covers and approximation of the exponential weights integral (Laplace, maybe?)
• Also two algorithms which can be analyzed: Castro-Nowak and Perceptron
• Will write up what happens with Castro-Nowak using ${\displaystyle \epsilon }$-covers: should be able to show efficient active learning without Tsybakov conditions (i.e. even in cases with unfavorable conditional label probabilities).
• Can also work on what happens with perceptron using these covers