One of the key aspects of OCR-based invoice data extraction is to be able to automatically detect which words and phrases returned by OCR are meaningful. The solution should only return key invoice data in a standardised format. Learn how the Naive Bayes algorithm can help you achieve this goal.

## Using Bayes’ Theorem to “understand” the meaning of text

Take a look at our previous guides where we discussed how Optical Character Recognition (OCR) extracts text from images and how the TF-IDF algorithm can help you understand keyword importance (weights).

## Exclusive accounting tips and how-tos

Join now and we’ll send you weekly email updates with the latest articles on accounting and bookkeeping.

In this article we will dig deeper into invoice data capture techniques in order to better understand the meaning of the OCR text, for example, to understand what category the text belongs to.

Many algorithms have shown success in this area, from simple logistic regression to the much more complicated Support Vector Machines (SVM) algorithms. However, we will focus our attention on the Naive Bayes algorithm.

We have chosen this algorithm because it is very effective for similar Natural Language Processing (NLP) problems and it’s also fast and interpretable.

## What is Bayes’ Theorem?

Created by Reverend Thomas Bayes in the XVIII century, Bayes Theorem describes the probability of a given event based on what has already happened.

In other words, it shows the probability that A will occur given that B has already occurred. It uses conditional probabilities in order to establish the chance of an event occurring based upon some prior evidence.

Mathematically, Bayes’ Theorem can be stated as follows:

$$P(A|B) = {P(B|A)P(A) \over P(B)}$$

The denominator can be expanded to:

$$P(B) = {P(B|A)P(A) + P(B|¬A)P(¬A)}$$

Finally, the entire equation takes the form:

$$P(A|B) = {P(B|A)P(A) \over P(B|A)P(A) + P(B|¬A)P(¬A)}$$

The components of the equation are as follows:

P(A|B) – the probability of A happening given B has happened
P(B|A) – the probability of B happening given A has happened
P(A) – the probability of A happening
P(B) – the probability of B happening
P(A) – the probability of A NOT happening (or 1-P(A))
P(B|A) – the probability of B happening given A has NOT happened

## Bayes Theorem in practice: sick or not?

Let’s assume that we have a test that checks a pretty uncommon disease – only 5% of the population has it (this means that 95% of the population does not have it). There’s also a test to check whether you have this disease with 99% accuracy.

We can ask the following question: What is the probability that I have the disease if the test came out positive?

First, let’s rewrite Bayes’ Theorem so that it pertains better to this specific case:

$$P(sick|+) = {P(+|sick)P(sick) \over P(+|sick)P(sick) + P(+|healthy)P(healthy)}$$

Probability that the test is positive if you are sick: P(+|sick) = 99%
Probability that the test is positive if you are NOT sick: P(+|¬sick) = P(+|healthy) = 1%
Probability that you are sick: P(sick) = 5%
Probability that you are NOT sick: P(¬sick)=P(healthy) = 95%

Then, let’s plug these values into the equation and see what happens.

$$P(sick|+) = {(0.99)(0.05) \over (0.99)(0.05) + (0.01)(0.95)}$$
$$= {0.0495 \over 0.059} = {0.839}$$

So, according to Bayes Theorem, you can be only about 84% sure that you have the disease when you are given a positive result! I know this sounds bizarre but these are the real results.

## How to use Bayes’ Classifier to categorise texts

So how does this help machines to better understand texts?

So far all we know how to do is find conditional probabilities. The good news is that this is pretty much all we really need to be able to classify data into different categories. Here’s an example.

In the article on the TF-IDF algorithm, we showed how weights can be applied to words based on their occurrences.

The following table shows two people, Alice and Bob (our categories), and words they typically use in their emails with their weights applied using a TF-IDF algorithm.

Alice Bob
love [0.7] taxes [0.5]
dog [0.4] accounting [0.3]
accounting [0.1] love [0.1]

If we receive a new message, can we decide whether it was sent from Alice or Bob just based upon the words in the message?

The answer is a resounding YES!

We can achieve this by using what we know about Bayes’ Theorem. All we have to do is take each word from the message and check the probability that it came from either Alice or Bob. Finally, we multiply all the probabilities together and see who has the highest one.

## Time to do the math

Let’s say we receive a new message: “I love accounting!”.

We can calculate the probability to try and guess who the sender is. We’ll assume that we receive an equal amount of messages from Alice and Bob so that:

$$P(Alice) = P(Bob) = 0.5.$$

We ask ourselves: “What is the probability that Alice sent the word ‘love’?” Using the probability information from the table and Bayes’ Theorem we can write:

$$P(Alice|love) = {P(love|Alice)P(Alice) \over P(love)}$$
$$= {P(love|Alice)P(Alice) \over P(love|Alice)P(Alice) + P(love|Bob)P(Bob)}$$
$$= {(0.7)(0.5) \over (0.7)(0.5) + (0.1)(0.5)}$$
$$= {0.875}$$

So there is an 87.5% chance that Alice would send the word ‘love’. Well, what about Bob?

$$P(Bob|love) = {P(love|Bob)P(Bob) \over P(love|Bob)P(Bob) + P(love|Alice)P(Alice)}$$
$$= {0.125}$$

There is only a 12.5% chance that Bob sent the word ‘love’.

Let’s check the other word, ‘accounting’.

After doing the same calculation we see the probability that it was sent by Alice is 25% and the probability that it was sent by Bob is 75%. We will ignore the word ‘I’ since it holds little or no weight (as we discussed in the previous article) in deciding who sent the message.

Finally, we multiply all of our results and see who has the highest probability. So:

$$P(messageAlice) = {P(Alice|love)P(Alice|accounting)}$$
$$= {(0.875)(0.25)} = {0.21875}$$

$$P(messageBob) = {P(Bob|love)P(Bob|accounting)}$$
$$= {(0.125)(0.75)} = {0.09375}$$

We can see that Alice has a higher probability so we categorise the message as coming from Alice. We should now have a pretty good feel for how this works.

Even without doing the math we should intuitively see that the sentence “I love accounting and taxes” is most likely sent by Bob while “I love my dog” has a greater probability of coming from Alice.

## Why Naive Bayes is called naive?

The term “naive” means that there are no correlations between any of the words (or features), so the algorithm really has no concept of the context of the message.

Each word is treated independently and only the probabilities are multiplied.

The sentence: “I love my dog but I hate taxes” would have the exact same numerical result as “I hate my dog but I love taxes”.

To us, it is obvious that these two messages were most likely sent by different people but to the Naive Bayes algorithm there is no difference between them.

## Benefits of using Bayes theorem

However, even given its shortcoming and very simplistic approach, Naive Bayes algorithm provides us with really good results on some rather difficult NLP problems.

It has been shown to provide better results than more complicated algorithms such as SVM.

It is also very fast because the probabilities are calculated directly instead of using an iterative process of training (required by Neural Networks and other algorithms).

A final benefit of using Naive Bayes is that it requires very little adjustment to go from two classes to more. We could have just as easily had Alice, Bob, and Charlie sending messages.

Other algorithms require segmentations in order to classify more than two categories. They use techniques like one-vs-rest or one-vs-one calculations to simulate only two classes and then combine different outcomes.

## Exclusive accounting tips and how-tos

Join now and we’ll send you weekly email updates with the latest articles on accounting and bookkeeping.