There are many tools in the developer’s toolbox when it comes to automatic data extraction. A good example is TF-IDF algorithm (Term Frequency – Inverse Document Frequency) which helps the system understand the importance of keywords extracted using OCR. Here’s how TF-IDF can be used for invoice and receipt recognition.
The previous guides in our data extraction series discussed:
- what is OCR or Optical Character Recognition, a process to turn a picture file (.jpg, .pdf, etc.) into a raw text file.
- How can Bayes Theorem algorithm help you understand OCRed text.
In this article we focus on other techniques in order to make this text file “understandable” to a computer.
For this purpose, we must delve into the world of NLP or Natural Language Processing. We will focus mainly on how we can transform our file of raw text into a format that will easily be understandable by our algorithm.
In a nutshell, TF-IDF is a technique for understanding how important a word is in a document which is often used as a weighting factor for numerous use cases. TF-IDF takes under consideration how frequent a word appears in a single document in relation to how frequent that word is in general. Search engines can use TF-IDF to determine which results are the most relevant for a search query.
How can an algorithm understand words?
All algorithms must be mathematical in nature and therefore do not understand words but rather numbers. Thus, in order for the algorithm to analyse the result OCR returns, we need to change raw text into numerical values.
First step is thorough “cleaning” of the data to prepare it for the transformation.
The problem with most raw text received from an OCR is that it contains a lot of unwanted information. We need to remove this in order to get the best possible set of words for our algorithm.
Dropping all punctuation, making all words lowercase, and removing extra spaces are just some of the procedures that are used to optimise the input data.
Let’s take three example sentences that we can use to better understand how they can be turned into numbers that will be understandable for our algorithm:
- I don’t like taxes or accounting.
- I hate doing my taxes.
- I love accounting and I love my dog.
Bag of words to eliminate stop-words
Once the data is nice and clean, we are ready to make the transformation from word to number. We can do this by using a very simple algorithm called bag-of-words (BoW).
This process counts all the occurrences of different words in each document and makes a table out of them. Later it will use the count of each words in each document in order to classify it into the proper category.
We can see an example of this below, using our 3 sentences.
The problem with the BoW procedure is that common words such as “the”, “is”, “it”, etc. are repeated much more often than words with greater meaning. In our example, each sentence contains the word “I” which doesn’t give much context to the sentences.
The solution to this problem lies in a special list of stop-words. This is a list of the most commonly used English words. The BoW algorithm knows to ignore these stop-words and focus more on words that will have a greater significance to the meaning of the document.
Having a list like this increases the efficiency of the later algorithm used for classifying. But what happens when we have documents in a different language or we have words that can appear in many documents but are not stop-words?
The TF-IDF algorithm is the perfect solution to this problem. The name of this algorithm can be broken down into 2 parts, tf which stands for term frequency and idf which stands for inverse document frequency. Let’s take a closer look at each part of the procedure to better understand it.
“Term Frequency” counts how many times a word appears in a document.
It is the exact same procedure that we used when we discussed the BoW. However there can be one major difference. When calculating the term frequency we divide the total number of words in the document so that longer documents do not have a greater influence than shorter documents. We will use the simpler version of this procedure.
TF(word) =Number of times a word appears in a document
Inverse Document Frequency
This term is key to adding the proper weight to each of the words in our vocabulary. It is calculated by taking the logarithm of the total number of documents and dividing by the number of documents that contain the given word. Without this part each word in our vocabulary has the exact same weight.
IDF(word) = logTotal number of documents / Number of documents with given word
We multiply the two values together and we receive our table of words all properly scaled. Now words that appear in many documents (e.g. “the”, “is”, “it”) will have a reduced weight. This process will also catch words like “total” which can be very prevalent in accounting documents but might not have the importance they would in other context. The TF-IDF procedure will automatically assign a reduced weight to such a word since it will appear in nearly all documents. This would not be possible with the BoW technique unless we manually added this word to our stop-words list. Now that we have reduced the weight of frequently occurring terms, words like “petrol” will have a greater value and improve the future classification.
Let’s see how our table has changed after applying the TF-IDF transformation.
We can see right away that the word “I” now has a weight of “0” and is no longer an important word in our set of documents. However the word “love”, which appears twice in Doc 3 and doesn’t appear in any other document now has a very high weight.
TF-IDF algorithm improvements
Even using this improved strategy we can stumble across another problem which has to do with context.
As we know there is a huge difference between terms such as “like” and “don’t like” however an algorithm that only counts and weighs words cannot see this. So how do we get around such an issue?
We must introduce bigrams or even trigrams into our algorithm.
A bigram is defined as a pair of words rather than one word. This way the algorithm will treat “like” and “don’t like” as two separate things and assign them each different values depending on how often they appear in the documents.
Adding bigrams essentially doubles our vocabulary size but it provides a huge boost in the categorical decision process.
Including two guides on data extraction and analysis we’ve written so far, we have:
- taken an invoice or a receipt and changed the picture file into a text file using OCR
- taken the raw text and cleaned it
- transformed the raw text into numerical values that a Machine Learning algorithm can understand based on TF-IDF algorithm
The final thing left to do is to choose the best algorithm in order to classify each document into a specific category. Next time we can see the final step which takes our set of words and categorizes them into specific categories so that invoice data capture provides the most accurate results.