How were documents processed before the advance of modern AI? What are modern AI approaches for document processing? What is Hypatos’ approach of using AI in document processing?
Manual document processing is a major cost driver in organizations and with the advance of modern AI techniques such as deep learning, it is possible to automate a majority of document processing. At Hypatos, we combine NLP and machine vision to build a solution to auto extract machine readable information from documents, validate and process the extracted data to make back-office tasks more efficient.
They were processed with inflexible templates that achieved 10-20% Straight Through Processing which means that 10-20% of the invoices can be handled by templates without any human intervention.
Let’s examine invoices as an example since they are typical semi-structured documents and quite common. Companies exchange millions of invoices every day. Invoices are not machine readable but follow a high-level format and have certain fields included. They are usually generated by individual suppliers using a specific template.
A straightforward solution is to define a template, which is unique to each sender and describes the layout of an invoice. A template-based system requires seeing example documents beforehand and is unlikely to accurately handle documents from unseen templates. Even for the same supplier, multiple templates may be needed, as purchase orders can be quite different. So, it is not suitable for large enterprises or businesses with a sizable number of invoices.
NLP and machine vision are the most useful AI techniques for document processing, but their performance is limited when they are used in isolation to process documents.
Alternatively, Natural Language Processing (NLP) techniques have become popular in handling the tasks of processing and understanding natural language texts and information extraction, i.e. named entity recognition. One possible solution is to make use of recurrent neural network (RNN), which operates on 1D serialized text.
The significant shortcoming of 1D RNN models is the lack of layout information, as the latent relation between words is impacted not only by the sequential order, but also by how those words are visually arranged. For example, words in a table should be treated differently compared to words in a paragraph like this one.
The layout information are very crucial for the understanding of structured documents. See below a sample invoice with all the gray word bounding boxes. Can you guess where the line-items are located?
If you have not, that is probably because you have not seen many invoices before. They are located in the middle section as seen below. This is the same invoice but with texts instead of bounding boxes.
Even without reading the detailed text information, a human who had seen invoices before can easily guess where the sender, recipient address blocks, and line-items are located.
Limitations of NLP and machine vision approaches led us to develop a novel 2D document processing artificial neural network model. In our model, the input invoices are not viewed as a text sequence, instead, they are embedded into a higher-dimensional matrix representation, using a pre-trained embedding model. The convolutional layers come after the embedding layers, and the last layer maps each pixel to an entity space.
Based on the above document understanding pipeline, we build a powerful information extraction engine, which significantly outperforms approaches based on sequential text or templates, in particular in line-item related entities as seen below:
In order to compare our results against competitors, feel free to check out our latest benchmark. And if you have document based processes, please contact us to automate them.
Further stories from our blog