What is data extraction? Why is data extraction important now? What are the alternatives to automated data extraction? What is the performance of data extraction solutions? How can you automate data extraction at your company?
Processing incoming documents makes up a large part of the back office activities and it can be automated using today’s technology. Document processing requires data extraction and improves with the quality and quantity of extracted data. Therefore, better document data extraction enables companies to automate higher levels of advanced processing.
Data extraction is the act of retrieving data from documents and other data sources. Wikipedia provides a bit more formal definition: “data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration)”
With today’s technology, most documents can be automatically processed if they could be converted into structured data. Therefore, data extraction quality is the biggest obstacle to automation of back office activities worth trillions of dollars.
Structured data in a back office is typically processed by machines. For example, once an invoice is entered into a well-configured Enterprise Resource Planning (ERP) tool like SAP, payments can be automatically completed, and system records can be automatically prepared.
Data extraction remains the obstacle to back office automation because manual data extraction is not comprehensive. Due to the prohibitive cost of data extraction, companies extract only critical fields from documents, recording a few percentage points of the total information available in documents. The limited information enables automation of the most critical processes, for example payment in case of an invoice. However, other important processes such as VAT compliance validation or account prediction remain manual since the necessary data is not extracted from documents.
Manual data extraction and template-based solutions are the most widely available alternatives.
Template based solutions allow companies to create templates which capture data from documents following a specific format. Given the wide variety of document formats, they rarely achieve high rates of automation
Manual processing has its own flaws: It is slow, expensive, inaccurate and demotivating.
Automated data extraction quality is increasing, and it is becoming a superior alternative to manual or template-based data extraction. For example, in our benchmark study, Hypatos deep learning technology was able to extract ~50 fields per invoice correctly. Given this inevitable trend, companies need to automate data extraction to reap the benefits of both automated data extraction from documents and to enable increased automation in document processing.
Image recognition is a good example for an area where machine learning research has led to significant improvements in performance in the last few years. Though there are fewer benchmarks on document extraction, based on our client discussions, we can see that automated data extraction performance has dramatically improved over the past ~5 years.
Given the availability of high-performance solutions and potential automation benefits, the need to automate data extraction is obvious. However, most large companies are dealing with hundreds of different forms and it is important to identify which processes to start automating first. An initial data extraction project that delivers outsized benefits in a short amount of time can help convince management to automate more processes and trigger a transformation in the company’s efficiency.
Ideally, you should start automating from high volume, complex documents with the most advanced processing steps where you can find off-the-shelf solutions. More formally, the metrics to pay attention to are:
We have helped numerous Fortune 500 and the Big 4, the largest consumers of financial documents, automate data extraction. Let us know your document related challenges.
Further stories from our blog