Data extraction: First step to automated document processing

What is data extraction? Why is data extraction important now? What are the alternatives to automated data extraction? What is the performance of data extraction solutions? How can you automate data extraction at your company?

Processing incoming documents makes up a large part of the back office activities and it can be automated using today’s technology. Document processing requires data extraction and improves with the quality and quantity of extracted data. Therefore, better document data extraction enables companies to automate higher levels of advanced processing.

‍

What is data extraction?

Data extraction is the act of retrieving data from documents and other data sources. Wikipedia provides a bit more formal definition: “data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration)”

Why is data extraction important now?

With today’s technology, most documents can be automatically processed if they could be converted into structured data. Therefore, data extraction quality is the biggest obstacle to automation of back office activities worth trillions of dollars.

Structured data in a back office is typically processed by machines. For example, once an invoice is entered into a well-configured Enterprise Resource Planning (ERP) tool like SAP, payments can be automatically completed, and system records can be automatically prepared.

Data extraction remains the obstacle to back office automation because manual data extraction is not comprehensive. Due to the prohibitive cost of data extraction, companies extract only critical fields from documents, recording a few percentage points of the total information available in documents. The limited information enables automation of the most critical processes, for example payment in case of an invoice. However, other important processes such as VAT compliance validation or account prediction remain manual since the necessary data is not extracted from documents.

‍

What are the alternatives to automated data extraction?

Manual data extraction and template-based solutions are the most widely available alternatives.

Template based solutions allow companies to create templates which capture data from documents following a specific format. Given the wide variety of document formats, they rarely achieve high rates of automation

Manual processing has its own flaws: It is slow, expensive, inaccurate and demotivating.

‍

What is the performance of data extraction solutions?

Automated data extraction quality is increasing, and it is becoming a superior alternative to manual or template-based data extraction. For example, in our benchmark study, Hypatos deep learning technology was able to extract ~50 fields per invoice correctly. Given this inevitable trend, companies need to automate data extraction to reap the benefits of both automated data extraction from documents and to enable increased automation in document processing.

Image recognition is a good example for an area where machine learning research has led to significant improvements in performance in the last few years. Though there are fewer benchmarks on document extraction, based on our client discussions, we can see that automated data extraction performance has dramatically improved over the past ~5 years.

Performance of machine learning solutions have improved significantly in the past 5 years — Source: Benchmarks.AI

How can you automate data extraction at your company?

Given the availability of high-performance solutions and potential automation benefits, the need to automate data extraction is obvious. However, most large companies are dealing with hundreds of different forms and it is important to identify which processes to start automating first. An initial data extraction project that delivers outsized benefits in a short amount of time can help convince management to automate more processes and trigger a transformation in the company’s efficiency.

Ideally, you should start automating from high volume, complex documents with the most advanced processing steps where you can find off-the-shelf solutions. More formally, the metrics to pay attention to are:

Current cost of data extraction: While this is hard to estimate exactly, it depends on document volume and complexity of the document. Just relying on those 2 metrics can give you an estimate of the most expensive documents for your company.
Current cost of advanced document processing: What does your team do after extracting the data? Additional manual document processing steps after data extraction is a good proxy for the amount of effort spent on advanced document processing. For example, in the case of invoices, the typical processing steps which can be automated after complete data extraction include VAT compliance checks and predicting account of the invoices for invoices which cannot be matched to POs.
Availability of data extraction solutions: If this is a document like a receipt, invoice, order, pay slip etc. that almost every organization receives, there is likely to be a solution for that document. At Hypatos, we have solutions for all of these categories. However, where we do not have a solution, we work with companies to build custom machine learning models for their documents. For example, we are supporting a major energy company to process forms it receives from its franchises.

We have helped numerous Fortune 500 and the Big 4, the largest consumers of financial documents, automate data extraction. Let us know your document related challenges.

‍