The road to wisdom, BFGS NLP

Page content

First, let me tell you about the newsletter button I added :D . You can subscribe to get notified when I post :). It’s down and right, please feel compelled to press it :D .

The road to wisdom?—Well, it’s plain

and simple to express:


and err

and err again,

but less

and less

and less.

— Piet Hein

The above words are from a grook I like.

Sometimes I spend all day trying to count

the leaves on a single tree. To do this I

have to climb branch by branch and

write down the numbers in a little book.

So I suppose, from their point of view,

it’s reasonable that my friends say: what

foolishness! She’s got her head in the clouds


But it’s not. Of course I have to give up,

but by then I’m half crazy with the wonder

of it — the abundance of the leaves, the

quietness of the branches, the hopelessness

of my effort. And I am in that delicious

and important place, roaring with laughter,

full of earth-praise.

-Mary Oliver,Foolishness? No, it’s not

Autumn is great in Romania, lots of beautiful light and nice colored leafs. I enjoyed great walks in the park, but naturally I started thinking about: what would it take for a machine to detect leafs in a picture. I actually don’t know how natural this thought is :)) but I said it’s a good problem for the weekend, brushing up on my maths and maybe if I could write it in a tutorial manner it could help somebody.

This article will not be about poems, of course, but instead it will be about implementing a technique that enables you to get to where you want to be :). Following the MUI article I felt like I needed to do something about nonlinear optimizations.

So we basically have 2 parts: getting the characteristics of an image and learning which characteristics describe a leaf and which don’t.

This being a binary decision problem (is it a leaf or is it not) we don’t need segmentation or something fancy so we could use GIST descriptors for the characteristics.

If you can not understand anything in the above sentence, don’t worry, we will talk about Gabor extraction and characteristics, in a future article (probably the next one :) ) .

We will discuss the second part:

We have an n dimensional descriptor extracted from a picture (float/double/real array) how do we classify it into 2 categories?

We could think of this like we have a bunch of classes (types of stuff) and for any picture we have a bunch of probabilities for each class. Basically given a picture (characteristic vector) we have probabilities associated to all the classes. In our case, the problem being binary, we just have 2 classes: Class A “the image is a leaf” and Class B “the image is not a leaf”. The classes are exclusive so the probability that our image is in class A is 1 - the probability our image is in class B. Easy enough soo far.

We established the main idea, now in comes:

Logistic regression

Why Logistic regression?

For our problem (binary classification) there are a lot of methods to achieve what we want: SVM, logistic regression, neural networks etc.

Empirically, machine learning people say that when starting with a binary classification problem you first test logistic regression (because it is the easiest to implement and most of the cases it is sufficiently good) then if this does not work you try linear svm and then other methods like neural networks etc.

What is Logistic regression?

Regression can be described by this formula :

$$ E(Y|x) = {\beta}_0 + {\beta}_1 x$$

We can translate this equation to the following sentence : Given a variable x and an outcome Y, the expected probability that the variable x is in the category Y (or the variable x is connected to the outcome Y) is a linear combination of x and betas.

Our problem then revolves around finding the beta coefficients that best fit our known expectations, so we can then apply the formula to get new expectations. The learning part gets the beta coefficients and the prediction part applies the formula using the betas (the eternal recipe for almost all machine learning algorithms).

Note that x values range between $-\infty$ and $+\infty$.

Now the way we link $E(y|x)$ to the linear beta combination is the key factor in deciding between regression types.

Possible choices: $$ y = \sum{{\beta}_i}$$ linear regression $$ y = \frac{1}{1-e^{\sum{{\beta}_i}}}$$ logistic regression $$ y = e^{\sum{{\beta}_i}}$$ poisson regression

There are a lot of models to choose from (Cox and Snell 1989) but the most chosen is the logistic distribution. When facing theories about logistic regression always bare in mind that it is actually chosen this way, and it is not derived. If you are inclined to mathematically reach the logistic formula you are thinking about it the wrong way. Basically it’s all about the link function you choose in regression, of course this choice needs to satisfy the linear equation but it is not the only choice.

Why model E(Y|x) using a logistic distribution?

  • From a mathematical point of view, it is extremely flexible and easily used
  • It lends itself to a clinically meaningful interpretation

So we then go for

$$ \frac{e^{{\beta}_0+{\beta}_1 x}}{1 + e^{{\beta}_0 + {\beta}_1 x}} $$

We can plot this function (it is also called logit or sigmoid ), it is S-shaped and beautiful.

Let $$ P(x) = \frac{e^{{\beta}_0+{\beta}_1 x}}{1 + e^{{\beta}_0 + {\beta}_1 x}} $$ so we can write things compactly

Back to our model

If we log P(x) we get $$ \log{\frac{P(x)}{1-P(x)}} = {\beta}_0 + {\beta}_1 x$$

Nice! So we find out that our choice has the many desirable properties of the linear regression model (not by chance mind you :) ). It is linear in its parameters, may be continuos and may range from $-\infty$ to $+\infty$ depending on the range of $x$.

Finding Beta (fitting the logistic regression model)

So we have a sample of $n$ independent observations: a bunch of pairs $(x_i,y_i)$ where, $ x_i$ is a characteristic vector and $y_i$ is the the inclusion of $x_i$ in class $y_i$ .

So usually we get this system of equations where we can apply least square method to get the solutions, but because our model has dichotomous outcome we can’t do that. We need to use another method, and the most used is called maximum likelihood. This method tries to maximize the probability of obtaining the observed set of data by tweaking the unknown parameters (in our case Betas). To use this method we will first construct the likelihood function. This function expresses the probability of the observed data as a function of the unknown parameters.

Getting the maximum likelihood function

In our case Y is 0 or 1 (because we have 2 categories). This also implies that we have P(X) for Y = 0 and 1-P(X) for $Y = 1$ . So we form this function: $$ l(\beta) = \prod{P(x_i)^{y_i} [ {1-P(x_i)}]^{1-y_i}} $$

It’s intuitive enough, does not need a lot of explanations. Now we see that we have a product of terms that have different powers. Solving this would be really hard. Fortunately we only need the maximum so we can apply a function that preserves monotony and transforms powers into multiplications and multiplications into sums. Yes, you figured it out, we log everything !

$$ L(\beta) = log(l(\beta)) = \sum{y_i \log{P(x_i)} + ({1-y_i}) + \log{(1- P(x_i))}}$$

Looks nicer!

Now to solve this (find beta that maximizes $L(\beta)$ ) or to put it bluntly find $argmax_{\beta} (L)$, we differentiate $L(\beta)$. We know that it is sufficient because $L(\beta)$ is quadratic so finding it’s maximum implies finding where the derivative is zero.

Now we have a different system to solve: $$ [Y_i- P(x_i)] = 0$$

and $$ x_i[Y_i- P(x_i)] = 0$$

We can tell that this function is quadratic (and if you don’t believe me, just look at the plot)

Solving these equations

You can’t find a closed form for solving these equations. The main reason is the mix of transcendental functions (P(x)) with rational functions. Because you can’t figure stuff mathematically you need to do some kind of iterative solver that reaches a good approximation of the solution.

We have this nonlinear system and we need a solution. In come nonlinear optimization techniques! Knowing these are quadratic we can easily figure out we should use a quadratic optimization technique :D . Fortunately we have some at our disposal :D.

I understood a lot of software programs use iterative weighted least squares procedure, but I don’t really like it and it is shown that BFGS is faster converging for likelihood equations (my implementation takes about 16 iterations).


People usually stop here with the explanations because they say that in this day and age, we just use algorithms that some other people built and it’s enough. Because I like going deeper, I will also try to explain the BFGS method.

Firstly we need to understand that Newton was awesome :) . He figured out a method for finding roots of a differentiable function that is actually still used. Most quadratic methods today are improvements of his method. Let’s see what the Newton method does:

For the single variable case:

We basically try to get a sequence of steps from an initial step $x_0$ that converges towards some $ x^* $ value that satisfies $ f(x^{* })= 0 $ .We call $x^{ * }$ a stationary point.

To do the above we use the Taylor expansion of the function:

$$ f(x) = f_T ({x_n + \Delta x}) \approx f(x_n) + f^{’}(x_n) \Delta x + \frac{1}{2} f^{’’}(x_n)\Delta x^2 $$

So we want $\Delta x$ such that $x_n + \Delta x$ is actually $x^*$

$$ 0 = \frac{d}{d \Delta x} ( {f(x_n) + f^`(x_n) \Delta x + \frac{1}{2} f^{’’}(x_n) \Delta x^2}) = f^{’}(x_n) + f^{’’}(x_n) \Delta x $$

So we get this update rule: $$ x_{n+1} = x_n + \Delta x = x_n - f^{’}(x_n)/f^{’’}(x_n) \Delta x$$

We see that we need f to be continuos and twice differentiable.

We of course, need to do this in higher dimensions (our x has N elements) so the formula extends well if we replace the first derivative with the gradient vector and the second derivative with the hessian matrix.

Multi variable newton method

So if x is an n-dimensional vector (like in our case) we have the newton update rule:

$$ x_{n+1} = x_n - [H_{f(x_n)}]^{-1} \nabla f(x_n)$$

Where $H_{f(x_n)}$ is the Hessian matrix of $f$ evaluated in $x_n$ and $\nabla f(x_n)$ is the gradient evaluated in $x_n$. The Gradient is the vector of partial derivatives of f(x) respect to $x_i$ so $\nabla f(x_n) \in \mathbb{R}^N$

$$ (\nabla f)_{i} \equiv \frac{\partial f}{\partial x_i} $$

The Hessian matrix is the matrix containing all the combinations of partial second derivatives so $H_{f(x)} \in \mathbb{R}^{NXN}$

$$ H_{ij} \equiv \frac{\partial^{2} f}{\partial x_{i} \partial x_{j} } $$

From the multivariate formula above we understand we need to inverse a big matrix (the hessian). This is not an easy task and event if the newton method in theory converges really fast (quadratic convergence rate) the inversion of the matrix slows the process down soo fast that the rate of convergence does not matter that much.

So people started using newtons method and improving it, mainly doing 2 things:

  • choosing an $\alpha$ (step size) between 0 and 1 to weight the update (just because converting from analytical to discreet may imply that the step is to big and we may miss the point, reach a forever oscillating situation) . Choosing a good $\alpha$ is an optimization problem in itself.

$$ x_{n+1} = x_n - \alpha [H_{f(x_n)}]^{-1} \nabla f(x_n) , \alpha \in (0,1)$$

  • finding ways of approximating the hessian inverse so we don’t actually need to calculate it (a lot of cvasi-newton methods)

For an easier comparison of cvasi-newton methods for line search, we usually look at the way the search direction is calculated. So we make a note of the Newton search direction:

$$ p^N_{k} = -\nabla^2 f_k^{-1} \nabla f_k$$

BFGS cvasi-newton method

Broyden-Fletcher-Goldgarb-Shanno algorithm is named after all the people that found it in the same time. This is a really cool story about how sometimes minds synchronize and come up with the same idea. It’s the most popular quasi-newton algorithm.

So the derivation starts from forming the quadratic model of the function we wish to minimize/maximize :

$$ m_k(p) = f_k + \nabla f^T_k p + \frac{1}{2} p^T B_k p $$

note that $B_k$ is actually an approximation of our Hessian, so it is symmetric and positive definite that will be updated per iteration.

So we find that bfgs has this search direction

$$ p_k = -B_k^{-1} \nabla f_k$$

This gives us the update formula for our betas (x here): $$ x_{k+1} = x_k + \alpha_k p_k $$

$\alpha_k$ needs to satisfy the Wolfe conditions. It is actually the step size which we will determined per iteration. Basically we replace the Hessian matrix with our approximation.

How do we update our Hessian approximation?

Davidon proposed to update it in a simple manner to account for the curvature measured during the most recent step. Suppose that we have generated a new iterate $x_{k+1}$ and wish to construct a new quadratic model, of the form:

$$ m_{k+1}(p) = f_{k+1} + \nabla f^T_{k+1} p + \frac{1}{2} p^T B_{k+1} p$$

What requirements should we impose on $B_{k+1}$ based on the knowledge we have gained during the latest step? One reasonable requirement is that the gradient of $m_{k+1}$ should match the gradient of the objective function f at the latest two iterates $x_k$ and $x_{k+1}$. Since $\nabla m_{k+1}(0)$ is actually $\nabla f_{k+1}$ , the second of these conditions is satisfied automatically. The first condition can be written:

$$ \nabla m_{k+1} (-\alpha_k p_k) = \nabla f_{k+1} - \alpha_k B_{k+1} p_k = \nabla f_k$$ So we actually have: $$ B_{k+1} \alpha_k p_k = \nabla f_{k+1} - \nabla f_k$$

So Let $s_k = x_{k+1} - x_k$ and $y_k=\nabla f_{k+1} - \nabla f_k$

$$ B_{k+1} s_k = y_k $$

So this is known as the secant equation

Given the displacement $s_k$ and the change of gradients $y_k$. the secant equation requires that the symmetric positive definite matrix $B_{k+1}$ map $s_k$ into $y_k$. This will be possible only if $s_k$ and $y_k$ satisfy the curvature conditions $$ s_k^T y_k > 0 $$

Ok so there are multiple $B_k$ that satisfy the secant equation. To narrow them down and choose a good approximation we formulate this problem:

$$min_B ||B - B_k ||$$ $B = B^T$ and $B s_k = y_k$

So we try to minimize the difference between B and $B_k$. Different norms in this inner minimization problem, gives rise to different quasi-newton methods. A norm that we use to simply our lives is the weighted Frobenius norm. $$||A||_w \equiv ||W^{\frac{1}{2}} A W^{\frac{1}{2}} ||_F $$ where $|| \dot ||_F$ is defined by

$$ ||C||F^2 = \sum \sum c{ij}^2 $$

So the weight $W$ ca be chosen as any matrix satisfying the relation $Wy_k = s_k$

So we get the solution:

$$ B_{k+1} = (I - \sigma_k y_k s_k^T)B_k(I - \sigma_k s_k y_k^T) + \sigma_k y_k y_k^T$$ and $$ \sigma_k = \frac{1}{y_k^T s_k} $$

so $$ H_k = B_k^{-1}$$

This is great! We can calculate the search direction just by matrix-vector multiplication. Using the Sherman-Morrison-Woodbury formula, we can derive the following expression for the update of the inverse Hessian approximation $H_k$

$$ H_{k+1} = H_k - \frac{H_k y_k y_k^T H_k}{y_k^T H_k y_k} + \frac{s_k s_k^T}{y_k^T s_k}$$

So this is good but we can do better, this is not actually the BFGS method, but it is DFP.

We trace back and make a different argument solving the same problem we did above for the inverse Hessian matrix.

$$ H_{k+1} y_k = s_k$$ $$ min_H || H - H_k||$$ $ H = H^T$ and $H y_k = s_k$ W do the same weighted Frobenius norm and we get the solution:

$$H_{k+1} = (I - \rho_k s_k y_k^T)H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T (BFGS)$$ $$ \rho_k = \frac{1}{y_k^T s_k} $$


So we need to choose the initial $H_0$ and we can implement the algorithm. Unfortunately picking the initial approximation has no magic formula, I use the identity matrix because it is easier, but some professional software use an approximation of the inverse hessian calculated by finite differences at $x_0$. Using the identity matrix could sometimes lead to bad hessian approximation that make the matrix non semi-positive definite, as you will see in my code, when this happens we need to drop the approximation (this can lead to not finding a solution).

BFGS algorithm

Given a starting point $x_0$, convergence tolerance $\epsilon > 0$, inverse Hessian Approximation $H_0$

  • $ k = 0 $

  • while $ ||\nabla f_k || > \epsilon$

    • Compute search direction $p_k = -H_k \nabla f_k$

    • Set $x_{k+1} = x_k + \alpha_k p_k$ where $\alpha_k$ is computed from a line search procedure to satisfy the Wolfe conditions

    • Define $s_k = x_{k+1} - x_k$ and $y_k = \nabla f_{k+1} - \nabla f_k$

    • Compute $H_{k+1} = (I - \rho_k s_k y_k^T)H_k (I - \rho_k y_k s_k^T) + \rho_k s_k s_k^T (BFGS)$

    • $k = k +1$

We notice that there are no $O(n^3)$ operations in the above algorithm ! The rate of convergence is super-linear, and the algorithm is neat!

So we have the whole thing figured out, this is the way we will get all our Betas :)

On to the implementation


I didn’t feel like writing my own linear algebra library because there are a lot of them already, and it’s not all that interesting for me, so I used Eigen . I liked how the code reads when using Eigen but I don’t know for sure if it’s the fastest. I saw that you can compile it with BLAS and it’s good enough for me :) . I will probably do a gpu version of the computation heavy parts (if it is necessary) but it’s fast enough on my CPU so far.

The implementation is straight forward. The only notable difference between it and the presented theory is that we modify the input to have a column of ones so we can do our formulas easier (we will not hold $ \beta_0$ separately so our probability function will actually be

$$ p = e^{\frac{x \beta}{1 + e^{x \beta}}} $$

So we have the Log likelihood function and its gradient implementation

    class FunctionLogLikelyhood
	PVMatrix allx;
	PVMatrixT allxT;
	ObservationsVector	 y;
	PVector TestBetas;

	// don't forget, first value of x (or last) needs to be 1
	void GenerateTestData()
		y	 = ObservationsVector(SAMPLES_SIZE,1);

		TestBetas = PVector::Random();

		for (unsigned int i = 0; i < SAMPLES_SIZE; i++)
			PVector genvec = PVector::Random();

			genvec(0, 0) = 1.f;
			y(i) = probabilityFunction(genvec, TestBetas);
			allx.row(i) = genvec;
		allxT = allx.transpose();
	// populate allx and y
	void Load()

	//p = e^(x * beta)/(1 + e^(x * beta))
	MReal inline probabilityFunction(const PVector& x, const PVector& betas){return 1. / (1.f + exp(;}

	//get the log likelyhood at Beta
	MReal inline Eval(const PVector& betas){return (1.0 / (1.0 + exp(-(allx * betas).array())) - y.array()).matrix().squaredNorm();}

	//get log likelyhood gradient at beta
	void Gradient(const PVector& beta, PVector& gradientOut)
		const ObservationsVector Values = (1.0 / (exp(-(allx *beta).array()) + 1.0)).matrix() - y;
		gradientOut = allxT*Values;


and our BFGS algorithm with a small unit test system (because we can’t test on real data until we do the image characteristics):

	FILE * f = fopen("tests.txt", "w+");
	unsigned int test = 0;
	unsigned int fails = 0;
	for (; test < NUMBER_OF_TESTS; test++)

		PVector  results;

		HessianMatrix BKinv = HessianMatrix::Identity();

		// we start with a random initial estimation
		PVector Betas = PVector::Random();

		const unsigned int MAX_NUM_ITERATIONS = 100;

		MReal gradientNorm = 1.f;
		MReal firstStepAlpha = 0.05f;
		unsigned int iter = 0;
		for (; iter < MAX_NUM_ITERATIONS && gradientNorm >= 0.001; iter++)
			PVector GradientK;
			g_function.Gradient(Betas, GradientK);
			gradientNorm = GradientK.norm();
			//Compute the search direction
			PVector PK = -1 * BKinv * GradientK;

			MReal phi =;

			if (phi > 0) // hessian not positive definite
				BKinv = HessianMatrix::Identity();
				PK = -1 * GradientK;

			MReal alpha = (iter==0)? firstStepAlpha : line_searchAlpha(PK, Betas);
			//Update estimation of beta
			Betas = Betas + alpha * PK;
			// get the sk -> this is Betas_(k+1) - Betas_k
			PVector SK = alpha * PK;

			PVector GradientK1;
			g_function.Gradient(Betas, GradientK1);
			// get YK  -> this is Gradient_(k+1) - Gradient_k
			PVector YK = GradientK1 - GradientK;

			PVectorT YKT = YK.transpose();
			PVectorT SKT = SK.transpose();
			//Update BK
			MReal YKT_SK =;

			if (YKT_SK == 0.0f)

		    //Update BKinv
			MReal SKT_YK = SKT*YK;

			BKinv = BKinv + ((SKT_YK + (YKT*BKinv)*YK)*(SK*SKT)) / (SKT_YK*SKT_YK) - (((BKinv*YK)*SKT) + ((SK*YKT)*BKinv)) / SKT_YK;
		if (iter == MAX_NUM_ITERATIONS)

		printf("----------------# %d #---------------\n", test);
		fprintf(f, "----------------# %d #---------------\n",test);
		fprintf(f, "iter:%d error: %f\n", iter, (g_function.TestBetas - Betas).norm());
	fprintf(f, "NUMBER OF FAILS: %d", fails);

So this is it! The code is simple, the project is a good one for the weekend. I will continue with finding the image characteristics and probably do a similar tutorial like article.

You can check the github of this project, but the code will probably change, so this commit is 0b77483 :).