Data extraction tools are handy for all accountants to make sure they don’t waste their time on manual data entry. To make it possible, you need a need to understand what OCR is and how it works to read and understand expense-related documents. Read on as we reveal the intricacies of how arbitrue extracts data from invoices and receipts.
OCR – what you need to know
To put it simply, OCR (Optical Character Recognition) is a process used to turn an image file into a text file. We can treat this process as a type of compression since text documents require significantly less space than picture files such as JPEG, PDF, etc.
OCR techniques are already used in many different fields. Some examples where OCR is thriving are devices to help the visually impaired, read algorithms that translate handwriting to text documents, automatic number plate recognition, and, of course, data entry for business documents.
Here at arbitrue we use it on a daily basis in order to quickly read and classify your documents eliminating lots of manual work required to transfer data by hand from invoices into accounting systems. In other words, arbitrue does all the heavy lifting, while you can focus on analysing finances and advising your clients.
How can you extract text from images?
The first thing that we have to take into consideration is how we can properly prepare the scanned document in order to best change its content into text.
1. Optimise the file
We will fix the following things:
- colour (everything has to be black and white),
- fill spaces (if surrounded by black—make it black, if surrounded by white—make it white),
Now we are ready to extract the text.
2. Extract individual letters
Once our document is prepared, the next step is to cut out the individual letters. An algorithm scans across the document and extracts all the black objects that are surrounded by white space. Each one of these objects will be treated as a single letter.
3. Match the pattern to each letter
Now that we have our letters cut out, we have to identify them. This is done in several different ways depending on the OCR system.
The easiest way is to use a filter of different fonts to try to match the pattern. Let’s look at a specific example.
If we extract a shape that looks like this: B, we need to identify it as the capital letter “b”.
The way to come to this conclusion requires the algorithm to overlap every letter and number over the object. The filter that returns the best overlap will be identified as the letter or number that is chosen. It is very important to have a wide variety of different fonts available to create flexible filters so that the OCR can choose the best match.
There are other ways for the OCR to recognise text. It can use feature detection which focuses on recognising individual elements of a letter.
For example, the letter A can be recognised because it is created from three separate lines: /, \, and —. This method works better because we do not need a huge number of saved filters in different fonts in order to recognise a letter. The features that are used can be created manually or automatically by using neural networks.
The algorithm doesn’t know what the humans do
However, there are also difficulties associated with OCR.
For one, many OCRs cannot read a document that is crooked or upside down. For us, it is obvious that a document is askew, however for an algorithm this is a foreign concept.
The shape that the document cuts out will no longer nicely fit to any given filter and due to this the algorithm will return either nonsense or nothing at all. If the document is too blurry or the contrast is not sharp enough then again, the algorithm will not be able to identify the letters that are present on the page.
13 and B can look the same to the algorithm
If, for example, we had the letters “Th”, where the top right of the “T” touched the upper left of the “h”, our algorithm would treat this as a single letter and try to find a proper filter for this object.
Obviously, this would result in a poor guess for our letter. This problem can be minimised by using the feature detection instead of the whole letter filters.
Another challenge arises exactly from the opposite effect. If our B from earlier was written in such a way that the vertical line didn’t connect with the rest of the letter, then the algorithm would most likely interpret this as the number 13 because it would treat this as two separate letters.
As a human we can use our reasoning to know that the number 13 probably doesn’t belong at the beginning of a word like: Best.
However our algorithm may have troubles coming to the same conclusion. Some more advanced OCR systems do take into account languages and context which is essential in this case.
The future of OCR
The development of OCR techniques continues and constantly there are better and better working models. This is what we’re currently exploring here at arbitrue to make sure manual verification is not involved.
Advanced Deep Learning models that use Convolutional Neural Networks to predict the letters are at the head of the pack to optimise OCR. As it continues to develop, manual data entry will definitely become a thing of the past.
Once we have all our documents nicely read by our OCR we can take the next step and use that data to classify our documents. In order to have this happen we have to dive into the world of Natural Language Processing (or NLP) which is the next concept to understand when you already know what OCR is.
Now the context of the read documents plays an important part in helping to choose the proper category for each document. With these advanced algorithms, arbitrue can categorise huge amounts of documents without manual intervention.
One of the tools that can help the algorithm understand the importance of keywords is TF-IDF (Term Frequency – Inverse Document Frequency). And for getting an even better understanding of OCRed text, you can use the Naive Bayes Theorem algorithm.