Zonal OCR

Using Zonal OCR to Extract Data Fields From Scanned Documents

Zonal Optical Character Recognition (OCR), also sometimes referred to as Template OCR, is a technology used to extract text located at a specific location inside a scanned document. In this article we’ll explain how Zonal OCR works and how it can be used to automate data-entry workflows.

Most of today’s document and PDF scanning offer out of the box Optical Character Recognition (OCR) capabilities which convert your scanned images (JPG, PNG, or TIFF files) into searchable and editable PDF documents. In some cases, a simple OCR system is however not enough and you need to level up your game. For example if you are not interested in the whole text of a document, but rather want to pull certain text elements which are located at specific positions.

This is when a technology called “Zonal OCR” (also referred to as Template OCR) comes into play. Zonal OCR basically allows to extract only the important data fields from a scanned document and store the extracted values in a structured database. One popular use case for Zonal OCR is to convert PDF to Excel or Automated Invoice Processing.

How does Zonal OCR software work?

First, let’s talk a bit what the term actually means. You probably already read about OCR and how it is used to convert scanned documents into searchable and editable documents. But having the whole text of the document accessible is only the first step.

Zonal OCR goes one step further. Instead of only converting your scanned images into text, a Zonal OCR software system can be trained to understand the structure and hierarchy of you document. By defining “zones”, it is possible to teach a zone based OCR system to distinguish certain data fields from each other.

Let’s imagine your business receives hundreds of purchase orders or sales orders every week. Thanks to a consistent layout, it’s easy to teach a Zonal OCR system where certain data fields can be found. More advanced systems like Docparser can apply PDF data extractions for various layouts, for example  in the case invoice OCR processing.

To sum it up: Zonal OCR is a special type of Optical Character Recognition which extracts only certain text data fields from a document. The extraction is based on “zones” which are defined by the user prior to scanning. Your

Training a Zonal OCR software

Training a Zonal OCR system basically means to define where all data fields can be found inside a document. This process needs to be done only once and the locations (zones) of the data fields are then saved in a template.

Once you trained your system properly, the zone templates can be used for scanning further documents. And this is where Zonal OCR really shines.

Big batches of documents having the same layout can be processed in a snap once the system was trained properly. Need to extract client names and reference numbers from hundreds of quotes, purchase orders or sales orders? No problem at all. Once you set up your master template, all you need to do is to feed more documents to the system.

How Docparser uses Zonal OCR methods

Setting up a zonal OCR system is straightforward in most cases. As the data extraction is based on the location inside the document, most solutions offer a visual “zone definition” process. The screenshot below shows the setup process of Docparser. All the user has to do is to draw a square around the area where the data field is located.

zonal ocr

This process is then repeated for each data field which the user wants to extract. In a typical scenario, the user needs to define a handful of zones which will then result in the equivalent number of extracted data fields.

Where Zonal OCR tends to fail

Most Zonal OCR systems are purely location based. The advantage of such systems is that the setup is very easy. As mentioned above, the user only needs to draw a rectangle (zone) around a specific area and the setup is done.

This covers however only a subset of cases. In reality, extracting data from semi-structured documents is a bit more complex. To give you a better picture, lets look at some example. The following cases can not be handled by a simple Zonal OCR system:

  • Extracting compound data fields (e.g. First + Last Name, Postal Address, …)
  • Repeating data fields (e.g. Multiple product numbers, …)
  • Table data
  • Data fields with variable positions (e.g. Invoice totals, ..)

This is why Docparser offers a powerful set of features which goes beyond the capabilities of a classic Zonal OCR system. By offering sophisticated tools, Docparser compensates perfectly the shortcomings of classic Zonal OCR systems. Start a free trial today and give it a try today! 🙂

40 thoughts on “Using Zonal OCR to Extract Data Fields From Scanned Documents”

  1. Hi, I am looking for software that will search a PDF & work document then fill in fields such as name address number etc. The application is that we download a hundred CV’s a day, all in different formats, rather than copying and pasting name fields into the excel spreadsheet, which we then upload into our crm, we can get software to do it, maybe based on Name, address, phone fields etc, does your software do this?

    1. Hi Simon, great question! To be completely honest, I don’t think that Docparser is a great for this use-case at this moment in time. We focus mainly on getting data out of documents with a fix structure or at least being semi-structured. There are however companies focusing their work entirely on getting data out of CVs. The software category for this would be “CV Parser” and you should find some solutions when searching in Google.

  2. Hello ~ I am looking for a solution for extracting data from tax returns. Would tax returns be a consistent enough format for this product? I’m assuming we’d have to “teach” it all the different forms (i.e. 1040, 1041, 1065, 1120S, state forms, etc.)?

    1. Hi Christie! Yes, tax returns should be something Docparser can handle. And yes, you need to create a set of parsing rules for each document layout. I would suggest you create a free trial account and give a try. We’ll be more than happy to help you setting up your parsing rules.

      1. Hi,
        I am looking for a solution for extracting data from receipt in image format (eg. jpg, jpeg,png,..).
        Can the solution be used for image and can be trained for different receipt format?

        Thanks.

      2. HI Myat! Thanks for the great question! We are actually working on supporting image file formats such as JPG, PNG and TIFF. And yes, you can train Docparser to extract data from various document layouts. However, we don’t support photos of receipts which are not well aligned. Docparser was designed to process documents which come either as native PDF format or were scanned with a flatbed scanner.

  3. Hello,

    My work primarily involves extracting information from PDF documents (scientific publications) and populating an extraction grid in Excel with this data. Can docparser:

    A. locate specific keywords/data in a PDF document and highlight them?
    B. extract that specific information from the PDF document into a structured Data extraction grid in excel?

    Ideally, the extracted data from the PDF would be assigned to a specific column (an example data element to be extracted might be “mean age”, which first, would be highlighted in the PDF after the OCR/docparser would identify it in the document, and then, that information would be extracted and precisely placed in the Excel column labelled “Mean Age”).

    As part of the data extraction, I often conduct literature reviews and need to repeat this process for multiple PDF documents so any method that could expedite the process would be very useful.

    1. Hi Briain, thanks a lot for reaching out! While we can extract data from documents which come in a semi-structured format, we can’t fill in specific cells in a give Excel document. You can however extract data fields based on keywords and then download the data in a Excel / CSV file generated by our system.

    2. Hi,

      please can you recommend a software that supports extraction of data from image file formats such as JPG, PNG and TIFF

  4. DocParser,

    Our document (SSN card, driver’s license) may come from fax or email attachment (scanned ssn card or driver’s license ). We are looking for a solution that can integrate with fax and email, extract the data to json format. Does DocParser support that? If not all, what are supported?

    Thanks,

    1. Hi Zhenwu, thanks a lot for reaching out and your interest in Docparser! Our solution offers a built-in email reception feature. You just need to forward your documents to the email address which comes with your parser and they will get processed right away after reception. Regarding the faxes, you would need to obtain a digital copy of your fax in PDF, JPG, PNG or TIFF format. Chances are high that you fax solution already provides such a functionality. Once you have a digital copy of your fax, you can forward it to Docparser by email or use our API to import it.

    1. Hi Chris! Unfortunately our solution Docparser does not recognize handwritten text. There are other OCR solutions that might be able to help you, but please note that you won’t reach a very high OCR accuracy rate for handwritten text. When using OCR for handwritten text, you always need to double-check the results.

  5. Hi .

    We deal with the extraction of addresses (usually near the top left corner) from A4 documents. Can your system retrieve this and isolate it from the rest of the text on the document. The documents are fairly regimented, but not perfectly structured. Ideally, we would like to provide this via an API integration. Is this possible?

    1. Hi David! Docparser works best if your documents are having the exact same format. If you have an address at the exact same position, you can use our address normalization filter. We can extract data from variable positions, but unfortunately this method is more suited for data following a fixed patterns (e.g. invoice numbers). I would suggest searching for a “CV parser” software. It sounds like a CV parser could be helpful to you.

  6. Hello,
    I am looking for a way to capture addresses for return mail. Just to clarify, this system can capture addresses, correct?
    When the data is captured, can it be exported to an excel file or a database?
    Also, can this system be used on an ios or android device?
    Thanks!

    1. Hi Lauren, thanks for the question! Yes, Docparser can parse postal addresses. However, our address parser only works if the addresses are located in the exact position inside a document. We don’t provide any SDK for iOS or Android unfortunately. Docparser is a web-based application focused on “documents” (native PDF and scanned documents) and we don’t support parsing of photos taken with a camera at this point in time.

  7. I NEED TO EXTRACT PRODUCT DATA FROM AN EMAIL ATTACHMENT. THE FILE IS A SHIPPING RECEIPT PDF. I WOULD NEED TO EXTRACT THE SHIPPED DATA ITEMS AND SAVED INTO AN EXCEL FILE WITH THE EMAIL NAME. IS THIS POSSIBLE?

  8. Hi DocParser,

    I am trying to convert handwritten form to digital. Form structure remains same but there are some boxes which are filled by user(handwritten). Is there anyway to segment out the handwritten part?
    If I am able to take that part out, i am planning to use OCR . Please let me know. Thanks

    1. Hi Pinaki! Thanks for the great question. Unfortunately Docparser is not able to recognize handwritten text reliable. To my knowledge, all OCR solutions for hand-written text (ICR) require human validation due to low accuracy levels. I would suggest that you look into employing a data entry service such as keyers-net.com.

  9. Will docparser be able to scan multiple sections of a page? For example, if there were 4 receipts on every page, could it scan each receipt separately and save it to the receipt number.pdf

    1. Hi Debbie! Thanks for the great question! Docparser works best when one document equals to one set of data. You can still define multiple regions and get all the data at once, but we won’t be able to separate the part of the document and save it under a new filename.

  10. Hi,
    Can your software be useful for extracting data from a handwritten document that has a column and over a hundred rows?

    1. Hi Dipo, thanks a lot for reaching out and your interest in Docparser! Unfortunately we don’t offer handwriting detection.

    1. Hi Dino! Thanks for the kind words and your question. I’m afraid Docparser is not a good fit for your use-case. Docparser is all about getting data from recurring documents with fixed layouts (e.g. Purchase Orders, Invoices, …). I’m sorry for the bad news and hope you’ll find a solution to your problem soon.

  11. Hello, I have a scenario with 20 document types, all in a standard format, all going into one email box. Can Docparser identify which format (of the 20) a specific document without any QR/barcode, based solely on the document format? Moreover, can the parsed document then be sent to a different mailbox based on the format type? Finally, can any non-identified documents be sent to an alternate (21st) mailbox? Thank you.

    1. Hi Shawn, thanks a lot for reaching out and the great question!

      At this point, Docparser does not have a built-in “send email” function. While Docparser can receive your documents by email, we can only send out parsed data (incl. a link to the document) with HTTP request (webhook integrations). You can however use one of our integration partners (Zapier, MS Flow, Workato, Stamplay, …) to send out emails including the document as an attachment.

      Whether or not you can identify a document layout with Docparser highly depends on the content and layout of the documents. In most cases, it’s possible to find simple rules like “if … is present in the top left corner, classify as type A”. I would suggest to create a free trial account and just give it a spin. Our support staff will be more than happy to help you with the setup once you give us access to your documents.

  12. I’m trying to identify empty fields in OCR forms where the fields are to be filled out by hand. I don’t care what the data is in the field only that it is not empty as a quality check for completeness. Can this tool do that?

    1. Hi Amy! I’m afraid Docparser is not a good fit for your use-case. Right now, Docparser is not capable to detect hand-written text accurately. Which means that form fields would be classified as empty quite often, even though they are filled with hand-written text. I’m sorry for the bad news!

  13. I have an old ephemeris that I want to preserve. The info is mostly table data with many glyphs and symbols. Can docuparser be trained to read this info and save it in PDF and other editable files?

    1. Thanks for the question! It’s difficult to answer this question without having a look at some sample documents. I would recommend to create a free trial account and give Docparser a spin. Our support team will be happy to assess your sample documents once they are uploaded to your free trial account.

  14. Firstly, Can docparser convert semi structured documents into structured spreadsheets in batches (nightly) from a specified folder? I would like to automate the zonal OCR process for a set of documents.
    Secondly, Is docparser purely Cloud based or deployed On premise too?

    Thanks!

    1. Hi Mukund! Yes, Docparser can handle semi structured documents to a certain degree. You can read more about extracting data from semi structured documents in our knowledge base. And yes, Docparser can import documents from a watched folder (Dropbox, Google Drive, …) and process documents in batches. To answer your last question, Docparser is a cloud only solution. I would recommend to start a free trial and give it a try! 🙂

  15. Hi,
    I am looking for a solution to use a camera phone to scan a prescription to get name, address, medication etc. Can the software work on mobile? Scanned by hand, by a customer not a flatbed scanner?
    Thanks.
    Ian

    1. Hi Ian, thanks for the question! Unfortunately Docparser is not a good fit for your use-case. We specialize in getting data from scanned dand PDF documents.

Comments are closed.