Zonal OCR

Using Zonal OCR to Extract Data Fields From Scanned Documents

Zonal Optical Character Recognition (OCR), also sometimes referred to as Template OCR, is a technology used to extract text located at a specific location inside a scanned document. In this article we’ll explain how Zonal OCR works and how it can be used to automate data-entry workflows.

Most of today’s PDF scanning solutions offer out of the box Optical Character Recognition (OCR) capabilities which convert your scanned images into searchable and editable PDF documents. In some cases, a simple OCR system is however not enough and you need to level up your game. For example if you are not interested in the whole text of a document, but rather want to pull certain text elements which are located at specific positions.

This is when a technology called “Zonal OCR” (also referred to as Template OCR) comes into play. Zonal OCR basically allows to extract only the important data fields from a scanned document and store the extracted values in a structured database. One popular use case for Zonal OCR is to convert PDF to Excel or Automated Invoice Processing.

How does Zonal OCR software work?

First, let’s talk a bit what the term actually means. You probably already read about OCR and how it is used to convert scanned documents into searchable and editable documents. But having the whole text of the document accessible is only the first step.

Zonal OCR goes one step further. Instead of only converting your scanned images into text, a Zonal OCR software system can be trained to understand the structure and hierarchy of you document. By defining “zones”, it is possible to teach a zone based OCR system to distinguish certain data fields from each other.

Let’s imagine your business receives hundreds of purchase orders or sales orders every week. Thanks to a consistent layout, it’s easy to teach a Zonal OCR system where certain data fields can be found. More advanced systems like Docparser can apply PDF data extractions for various layouts, for example  in the case invoice OCR processing.

To sum it up: Zonal OCR is a special type of Optical Character Recognition which extracts only certain text data fields from a document. The extraction is based on “zones” which are defined by the user prior to scanning. Your

Training a Zonal OCR software

Training a Zonal OCR system basically means to define where all data fields can be found inside a document. This process needs to be done only once and the locations (zones) of the data fields are then saved in a template.

Once you trained your system properly, the zone templates can be used for scanning further documents. And this is where Zonal OCR really shines.

Big batches of documents having the same layout can be processed in a snap once the system was trained properly. Need to extract client names and reference numbers from hundreds of quotes, purchase orders or sales orders? No problem at all. Once you set up your master template, all you need to do is to feed more documents to the system.

How Docparser uses Zonal OCR methods

Setting up a zonal OCR system is straightforward in most cases. As the data extraction is based on the location inside the document, most solutions offer a visual “zone definition” process. The screenshot below shows the setup process of Docparser. All the user has to do is to draw a square around the area where the data field is located.

zonal ocr

This process is then repeated for each data field which the user wants to extract. In a typical scenario, the user needs to define a handful of zones which will then result in the equivalent number of extracted data fields.

Where Zonal OCR tends to fail

Most Zonal OCR systems are purely location based. The advantage of such systems is that the setup is very easy. As mentioned above, the user only needs to draw a rectangle (zone) around a specific area and the setup is done.

This covers however only a subset of cases. In reality, extracting data from semi-structured documents is a bit more complex. To give you a better picture, lets look at some example. The following cases can not be handled by a simple Zonal OCR system:

  • Extracting compound data fields (e.g. First + Last Name, Postal Address, …)
  • Repeating data fields (e.g. Multiple product numbers, …)
  • Table data
  • Data fields with variable positions (e.g. Invoice totals, ..)

This is why Docparser offers a powerful set of features which goes beyond the capabilities of a classic Zonal OCR system. By offering sophisticated tools, Docparser compensates perfectly the shortcomings of classic Zonal OCR systems. Start a free trial today and give it a try today! 🙂

25 thoughts on “Using Zonal OCR to Extract Data Fields From Scanned Documents”

  1. Hi, I am looking for software that will search a PDF & work document then fill in fields such as name address number etc. The application is that we download a hundred CV’s a day, all in different formats, rather than copying and pasting name fields into the excel spreadsheet, which we then upload into our crm, we can get software to do it, maybe based on Name, address, phone fields etc, does your software do this?

    1. Hi Simon, great question! To be completely honest, I don’t think that Docparser is a great for this use-case at this moment in time. We focus mainly on getting data out of documents with a fix structure or at least being semi-structured. There are however companies focusing their work entirely on getting data out of CVs. The software category for this would be “CV Parser” and you should find some solutions when searching in Google.

  2. Hello ~ I am looking for a solution for extracting data from tax returns. Would tax returns be a consistent enough format for this product? I’m assuming we’d have to “teach” it all the different forms (i.e. 1040, 1041, 1065, 1120S, state forms, etc.)?

    1. Hi Christie! Yes, tax returns should be something Docparser can handle. And yes, you need to create a set of parsing rules for each document layout. I would suggest you create a free trial account and give a try. We’ll be more than happy to help you setting up your parsing rules.

      1. Hi,
        I am looking for a solution for extracting data from receipt in image format (eg. jpg, jpeg,png,..).
        Can the solution be used for image and can be trained for different receipt format?

        Thanks.

      2. HI Myat! Thanks for the great question! We are actually working on supporting image file formats such as JPG, PNG and TIFF. And yes, you can train Docparser to extract data from various document layouts. However, we don’t support photos of receipts which are not well aligned. Docparser was designed to process documents which come either as native PDF format or were scanned with a flatbed scanner.

  3. Hello,

    My work primarily involves extracting information from PDF documents (scientific publications) and populating an extraction grid in Excel with this data. Can docparser:

    A. locate specific keywords/data in a PDF document and highlight them?
    B. extract that specific information from the PDF document into a structured Data extraction grid in excel?

    Ideally, the extracted data from the PDF would be assigned to a specific column (an example data element to be extracted might be “mean age”, which first, would be highlighted in the PDF after the OCR/docparser would identify it in the document, and then, that information would be extracted and precisely placed in the Excel column labelled “Mean Age”).

    As part of the data extraction, I often conduct literature reviews and need to repeat this process for multiple PDF documents so any method that could expedite the process would be very useful.

    1. Hi Briain, thanks a lot for reaching out! While we can extract data from documents which come in a semi-structured format, we can’t fill in specific cells in a give Excel document. You can however extract data fields based on keywords and then download the data in a Excel / CSV file generated by our system.

  4. DocParser,

    Our document (SSN card, driver’s license) may come from fax or email attachment (scanned ssn card or driver’s license ). We are looking for a solution that can integrate with fax and email, extract the data to json format. Does DocParser support that? If not all, what are supported?

    Thanks,

    1. Hi Zhenwu, thanks a lot for reaching out and your interest in Docparser! Our solution offers a built-in email reception feature. You just need to forward your documents to the email address which comes with your parser and they will get processed right away after reception. Regarding the faxes, you would need to obtain a digital copy of your fax in PDF, JPG, PNG or TIFF format. Chances are high that you fax solution already provides such a functionality. Once you have a digital copy of your fax, you can forward it to Docparser by email or use our API to import it.

    1. Hi Chris! Unfortunately our solution Docparser does not recognize handwritten text. There are other OCR solutions that might be able to help you, but please note that you won’t reach a very high OCR accuracy rate for handwritten text. When using OCR for handwritten text, you always need to double-check the results.

  5. Hi .

    We deal with the extraction of addresses (usually near the top left corner) from A4 documents. Can your system retrieve this and isolate it from the rest of the text on the document. The documents are fairly regimented, but not perfectly structured. Ideally, we would like to provide this via an API integration. Is this possible?

    1. Hi David! Docparser works best if your documents are having the exact same format. If you have an address at the exact same position, you can use our address normalization filter. We can extract data from variable positions, but unfortunately this method is more suited for data following a fixed patterns (e.g. invoice numbers). I would suggest searching for a “CV parser” software. It sounds like a CV parser could be helpful to you.

  6. Hello,
    I am looking for a way to capture addresses for return mail. Just to clarify, this system can capture addresses, correct?
    When the data is captured, can it be exported to an excel file or a database?
    Also, can this system be used on an ios or android device?
    Thanks!

    1. Hi Lauren, thanks for the question! Yes, Docparser can parse postal addresses. However, our address parser only works if the addresses are located in the exact position inside a document. We don’t provide any SDK for iOS or Android unfortunately. Docparser is a web-based application focused on “documents” (native PDF and scanned documents) and we don’t support parsing of photos taken with a camera at this point in time.

  7. I NEED TO EXTRACT PRODUCT DATA FROM AN EMAIL ATTACHMENT. THE FILE IS A SHIPPING RECEIPT PDF. I WOULD NEED TO EXTRACT THE SHIPPED DATA ITEMS AND SAVED INTO AN EXCEL FILE WITH THE EMAIL NAME. IS THIS POSSIBLE?

  8. Hi DocParser,

    I am trying to convert handwritten form to digital. Form structure remains same but there are some boxes which are filled by user(handwritten). Is there anyway to segment out the handwritten part?
    If I am able to take that part out, i am planning to use OCR . Please let me know. Thanks

    1. Hi Pinaki! Thanks for the great question. Unfortunately Docparser is not able to recognize handwritten text reliable. To my knowledge, all OCR solutions for hand-written text (ICR) require human validation due to low accuracy levels. I would suggest that you look into employing a data entry service such as keyers-net.com.

  9. Will docparser be able to scan multiple sections of a page? For example, if there were 4 receipts on every page, could it scan each receipt separately and save it to the receipt number.pdf

    1. Hi Debbie! Thanks for the great question! Docparser works best when one document equals to one set of data. You can still define multiple regions and get all the data at once, but we won’t be able to separate the part of the document and save it under a new filename.

  10. Hi,
    Can your software be useful for extracting data from a handwritten document that has a column and over a hundred rows?

    1. Hi Dipo, thanks a lot for reaching out and your interest in Docparser! Unfortunately we don’t offer handwriting detection.

Leave a Reply

Your email address will not be published. Required fields are marked *