Zonal OCR, or Zonal Optical Character Recognition, also sometimes referred to as Template OCR, is a technology used to extract text located at a specific location inside a scanned document. This article will explain how Zonal OCR works and how it can automate data-entry workflows.
Most of today’s document and PDF scanning offer out-of-the-box Optical Character Recognition (OCR) capabilities that convert your scanned images (JPG, PNG, or TIFF files) into searchable and editable PDF documents. In some cases, however, a simple OCR system is not enough, and you need to level up your game. For example, if you are not interested in the whole text of a document but instead want to pull certain text elements located at specific positions.
The situations above are when a technology called “Zonal OCR” (also referred to as Template OCR) comes into play. Zonal OCR extracts only the essential data fields from a scanned document and stores the extracted values in a structured database. One widespread use case for Zonal OCR is to convert PDF to Excel or Automated Invoice Processing.
Use Zonal OCR to Extract Data Fields From Scanned Documents
Convert your scanned images into searchable PDF documents!
Try Docparser for free. No credit card required.
Table of Contents
- The History of the PDF
- How does Zonal OCR software work?
- How can Docparser help?
- Frequently Asked Questions (FAQs)
The History of the PDF
Let’s add in a little history lesson for you. Adobe invented the PDF in 1993. It was popular in desktop publishing workflows.
The PDF was a proprietary format until it was released as an open standard in 2009, then published by the International Organization for Standardization.
Now, there are multiple editions of the PDF.
A PDF combines vector graphics, text, and raster graphics. Basic types of PDF content include:
- Text not encoded in plain text
- Raster graphics for photographs
- Vector graphics for illustrations
The skeleton of a PDF combines three technologies:
- A font-embedding system for fonts to travel along with the document
- A subset of the PostScript page description programming language
- Structured storage system for bundling these elements into a compact file.
How do PDFs store data?
A PDF, Portable Document Format, is a graphics file format supporting vector and raster graphics into one compact file. A single PDF can contain several pages. The most important thing to remember is that PDFs preserve layers and feature attributes.
PDF files are of the highest quality because PDF supports the preservation of vector graphics. They also store all map information in a single file, making it easy to share content to users around the globe who may or may not have access to the internet.
How to Extract Data from a PDF?
Docparser can extract data from your PDF in the following three ways:
Manual text extraction is tedious and time-consuming, and many of the tasks are repetitive. Automated extraction simplifies the process–saving you time and resources.
Again, table extraction is time-consuming and mundane. Docparser offers batch PDF processing, a built-in OCR, and lets you create custom data extraction rules.
Some documents involve images, graphics, or scanned content. For example, when researchers publish one of their articles, they may turn their document into a PDF for reference purposes.
What is PDF/A? Is that a type of PDF?PDF/A is an archival format of a PDF. It embeds all fonts used throughout the document within the PDF file. This means that someone reading your file won’t need to have the same fonts you chose for the file.
The PDF exists to address the need to electronically archive documents for preservation purposes. This way, documents are retrieved and rendered with consistency and predictability.
How does Zonal OCR software work?First, let’s talk a bit about what the term means. You probably already read about OCR and how it converts scanned documents into searchable and editable documents. But having the whole text of the document accessible is only the first step.
Zonal or Template OCR goes one step further. Instead of only converting your scanned images into text, a software system can be trained to understand the structure and hierarchy of your document. By defining “zones,” it is possible to teach a zone-based OCR system to distinguish specific data fields from each other.
Let’s imagine your business receives hundreds of purchase orders or sales orders every week. Thanks to a consistent layout, it’s easy to teach a Zonal OCR system where specific data fields can be found. In addition, more advanced procedures like Docparser can apply PDF data extractions for various layouts, for example, in the case of invoice OCR processing.
What is Zonal OCR?
To sum it up: Zonal OCR is a particular type of Optical Character Recognition that extracts only specific text data fields from a document. The extraction is based on “zones” defined by the user before scanning.
Training your software
Training a Zonal OCR system means defining where all data fields can be found inside a document. This process needs to be done only once, and the locations (zones) of the data fields are then saved in a template.
Once you train your system correctly, use the zone templates for scanning further documents. And this is where Zonal OCR shines.
Once the system is adequately trained, big batches of documents with the same layout can be processed in a snap. Need to extract client names and reference numbers from hundreds of quotes, purchase orders, or sales orders? No problem at all. Once you set up your master template, all you need to do is feed more documents to the system.
Where does Zonal OCR tend to fail?
Most Zonal OCR systems are purely location-based. The advantage of such systems is that the setup is straightforward. As mentioned above, the user only needs to draw a rectangle (zone) around a specific area, and the layout is done.
The above covers only a subset of cases. Extracting data from semi-structured documents is a bit more complex. To give you a better picture, let’s look at some examples. A simple Zonal OCR system cannot handle the following cases:
- Extracting compound data fields (e.g., First + Last Name, Postal Address, …)
- Repeating data fields (e.g., Multiple product numbers, …)
- Table data
- Data fields with variable positions (e.g., Invoice totals, ..)
This is why Docparser offers a powerful set of features that goes beyond the capabilities of a classic Zonal OCR system. By providing sophisticated tools, Docparser compensates perfectly for the shortcomings of traditional methods.
What is Full OCR versus Zonal OCR?
Full OCR is another type of optical character recognition. First, software, like Docparser, reads the entire document. Then, it places a text layer on top of the PDF document. The text layer allows the searching of the full content of the document.
Full OCR is best for reports, or contracts, or any document where essential words and phrases can be searched inside the document management system.
Zonal OCR creates zones in documents and sets specific margins for entire pages. The data is extracted from these specified areas, and anything cropped out is left out. Any characters partially entered in zonal fields can’t be read.
Creating these zones or “smart zones” optimizes data extraction, accuracy and allows the user to set formatting rules for advanced document processing.
How can Docparser help?
Setting up a zonal or template OCR system is straightforward in most cases. As the data extraction is based on the location inside the document, most solutions offer a visual “zone definition” process.
For example, the screenshot below shows the setup process of Docparser. All the user must do is draw a square around the area where the data field is located.
This process is repeated for each data field that the user wants to extract. In a typical scenario, the user needs to define a handful of zones, resulting in the equivalent number of extracted data fields.
Use Zonal OCR to Extract Data Fields
Convert your scanned images into searchable PDF documents!
Try Docparser for free. No credit card required.
Frequently Asked Questions (FAQs)
What are some common uses of Docparser?
We’re glad you asked our favorite question to answer. Docparser extracts data from PDFs. There are many uses, but some cases work better than others.
Docparser is best for batches of documents with similar layouts and structures. It’s best if you limit your document layouts to 1-100.
Standard document types include:
- Form-Based Contracts
- HR & Admin Documents
- Product Catalogues & Price Lists
- Bank Statements
- Fillable PDF Forms
- Accounts Payable & Invoice Processing and Automation
- Purchase & Sales Orders
- Shipping & Delivery Orders
- Various other transactional business documents
Do you offer webhooks and cloud integrations?
Webhooks and cloud integrations automatically import documents. Use webhooks and cloud integrations to import your files from a cloud storage provider, copy the parsed data to a Google Spreadsheet, database, CRM, or API.
We allow various integrations. For starters, we allow direct integrations with third parties. These are easy to set up. For example, we can automatically import your documents to the application you want or send the parsed data to the app you connected.
Integration platforms. These allow you to send your parsed data to dozens of applications. You can also import your document from different sources. Automating your data flow was never easier!
We recommend the following:
Webhooks are a form of cloud integration targeted towards developers. Webhooks are custom HTTP requests triggered each time a new document is parsed. The request is sent to an HTTP endpoint that can be defined in the format of your choice. Webhooks are triggered after we parse your document. We offer simple webhooks which let you define a target URL or advanced webhooks, giving you complete control over the HTTP request. Find out more here.
Can I cancel my account anytime?
Yes, you can cancel your account at any time. You can also upgrade or downgrade your paid subscriptions too. When you cancel, your subscription is automatically terminated, and there are no additional payments required.
Cancel your subscription on our Subscription Plan page. Even though you’ve cancelled your subscription, we don’t close your account so that you can have access to your parsed data. Your account can also be entirely deleted from our system if you’d like.
How long do you store my data?
For as long as you specify. We store the original files and the parsed data for one month. After this, we destroy the data associated with the actual file and parsed data.
You can set a data retention timeline value between 0-120 days. Zero days of retention means your data is deleted immediately. However, in case of a Webhook error, we keep the data for one week for debugging.
Can I pause my subscription?
Yes. You can pause your subscription at any time for either 3- or 6- month intervals. When you break, unused credits will be lost. Parsing your documents is also queued until the account is reactivated again, either after the pause or if you decide to come back sooner than anticipated.
Are there options for downloading parsed data?
Yes. Docparser offers two ways to download and export parsed data.
- Download as one single document.
- Download as multiple documents.
What languages does the OCR engine support?
We support a variety of languages. For text-based PDFs, we pull the text directly from the file. Please view our list of languages here.
In conclusion, Zonal OCR automates tedious processes of indexing fields. You can set up batches in Docparser to define, read, convert, and automatically populate fields within the specified zones. This reduces the amount of manual labor needed in the extraction process.
Zonal OCR helps you capture relevant documents from various file formats. You can also quickly shift all your business documents to paperless processing, making data accessible, searchable, and editable.
Use Zonal OCR to Extract Data Fields
Convert your scanned images into searchable PDF documents!
Try Docparser for free. No credit card required.
Hi, I am looking for software that will search a PDF & work document then fill in fields such as name address number etc. The application is that we download a hundred CV’s a day, all in different formats, rather than copying and pasting name fields into the excel spreadsheet, which we then upload into our crm, we can get software to do it, maybe based on Name, address, phone fields etc, does your software do this?
Hi Simon, great question! To be completely honest, I don’t think that Docparser is a great for this use-case at this moment in time. We focus mainly on getting data out of documents with a fix structure or at least being semi-structured. There are however companies focusing their work entirely on getting data out of CVs. The software category for this would be “CV Parser” and you should find some solutions when searching in Google.
Hello ~ I am looking for a solution for extracting data from tax returns. Would tax returns be a consistent enough format for this product? I’m assuming we’d have to “teach” it all the different forms (i.e. 1040, 1041, 1065, 1120S, state forms, etc.)?
Hi Christie! Yes, tax returns should be something Docparser can handle. And yes, you need to create a set of parsing rules for each document layout. I would suggest you create a free trial account and give a try. We’ll be more than happy to help you setting up your parsing rules.
I am looking for a solution for extracting data from receipt in image format (eg. jpg, jpeg,png,..).
Can the solution be used for image and can be trained for different receipt format?
HI Myat! Thanks for the great question! We are actually working on supporting image file formats such as JPG, PNG and TIFF. And yes, you can train Docparser to extract data from various document layouts. However, we don’t support photos of receipts which are not well aligned. Docparser was designed to process documents which come either as native PDF format or were scanned with a flatbed scanner.
My work primarily involves extracting information from PDF documents (scientific publications) and populating an extraction grid in Excel with this data. Can docparser:
A. locate specific keywords/data in a PDF document and highlight them?
B. extract that specific information from the PDF document into a structured Data extraction grid in excel?
Ideally, the extracted data from the PDF would be assigned to a specific column (an example data element to be extracted might be “mean age”, which first, would be highlighted in the PDF after the OCR/docparser would identify it in the document, and then, that information would be extracted and precisely placed in the Excel column labelled “Mean Age”).
As part of the data extraction, I often conduct literature reviews and need to repeat this process for multiple PDF documents so any method that could expedite the process would be very useful.
Hi Briain, thanks a lot for reaching out! While we can extract data from documents which come in a semi-structured format, we can’t fill in specific cells in a give Excel document. You can however extract data fields based on keywords and then download the data in a Excel / CSV file generated by our system.
please can you recommend a software that supports extraction of data from image file formats such as JPG, PNG and TIFF
Hi Young, Docparser does support JPG, PNG and TIFF images. 🙂
Our document (SSN card, driver’s license) may come from fax or email attachment (scanned ssn card or driver’s license ). We are looking for a solution that can integrate with fax and email, extract the data to json format. Does DocParser support that? If not all, what are supported?
Hi Zhenwu, thanks a lot for reaching out and your interest in Docparser! Our solution offers a built-in email reception feature. You just need to forward your documents to the email address which comes with your parser and they will get processed right away after reception. Regarding the faxes, you would need to obtain a digital copy of your fax in PDF, JPG, PNG or TIFF format. Chances are high that you fax solution already provides such a functionality. Once you have a digital copy of your fax, you can forward it to Docparser by email or use our API to import it.
Thanks a lot for the quick response. I will try it out.
Hi Can it recognize hand written in a form?
Hi Chris! Unfortunately our solution Docparser does not recognize handwritten text. There are other OCR solutions that might be able to help you, but please note that you won’t reach a very high OCR accuracy rate for handwritten text. When using OCR for handwritten text, you always need to double-check the results.
We deal with the extraction of addresses (usually near the top left corner) from A4 documents. Can your system retrieve this and isolate it from the rest of the text on the document. The documents are fairly regimented, but not perfectly structured. Ideally, we would like to provide this via an API integration. Is this possible?
Hi David! Docparser works best if your documents are having the exact same format. If you have an address at the exact same position, you can use our address normalization filter. We can extract data from variable positions, but unfortunately this method is more suited for data following a fixed patterns (e.g. invoice numbers). I would suggest searching for a “CV parser” software. It sounds like a CV parser could be helpful to you.
this is the API you are looking for.
No need to use template and it work with any document
I am looking for a way to capture addresses for return mail. Just to clarify, this system can capture addresses, correct?
When the data is captured, can it be exported to an excel file or a database?
Also, can this system be used on an ios or android device?
Hi Lauren, thanks for the question! Yes, Docparser can parse postal addresses. However, our address parser only works if the addresses are located in the exact position inside a document. We don’t provide any SDK for iOS or Android unfortunately. Docparser is a web-based application focused on “documents” (native PDF and scanned documents) and we don’t support parsing of photos taken with a camera at this point in time.
I NEED TO EXTRACT PRODUCT DATA FROM AN EMAIL ATTACHMENT. THE FILE IS A SHIPPING RECEIPT PDF. I WOULD NEED TO EXTRACT THE SHIPPED DATA ITEMS AND SAVED INTO AN EXCEL FILE WITH THE EMAIL NAME. IS THIS POSSIBLE?
Hi Dianne, thanks for the question! Yes, this is possible. Docparser comes with a built-in email reception feature and you can use it for shipping receipts: https://docparserprod.wpengine.com/solutions/shipping-delivery-reports
I am trying to convert handwritten form to digital. Form structure remains same but there are some boxes which are filled by user(handwritten). Is there anyway to segment out the handwritten part?
If I am able to take that part out, i am planning to use OCR . Please let me know. Thanks
Hi Pinaki! Thanks for the great question. Unfortunately Docparser is not able to recognize handwritten text reliable. To my knowledge, all OCR solutions for hand-written text (ICR) require human validation due to low accuracy levels. I would suggest that you look into employing a data entry service such as keyers-net.com.
Will docparser be able to scan multiple sections of a page? For example, if there were 4 receipts on every page, could it scan each receipt separately and save it to the receipt number.pdf
Hi Debbie! Thanks for the great question! Docparser works best when one document equals to one set of data. You can still define multiple regions and get all the data at once, but we won’t be able to separate the part of the document and save it under a new filename.
Can your software be useful for extracting data from a handwritten document that has a column and over a hundred rows?
Hi Dipo, thanks a lot for reaching out and your interest in Docparser! Unfortunately we don’t offer handwriting detection.
Hi! Great info on the blogpost.
Is DocParser a good fit for scanning pdf catalogs? E.g. this: https://www.konzum.hr/Katalozi/Konzum-katalog-24.5.-29.5.-cetvrtak-24.svibanj-2018.-12-00-00-utorak-29.svibanj-2018.-12-00-00
Essentially, it’s just a bunch of items in a store with a name, sales price, and a regular price. I’d rather not work on my own OCR engine if one already exists, so I’m curious about your product.
Hi Dino! Thanks for the kind words and your question. I’m afraid Docparser is not a good fit for your use-case. Docparser is all about getting data from recurring documents with fixed layouts (e.g. Purchase Orders, Invoices, …). I’m sorry for the bad news and hope you’ll find a solution to your problem soon.
Hello, I have a scenario with 20 document types, all in a standard format, all going into one email box. Can Docparser identify which format (of the 20) a specific document without any QR/barcode, based solely on the document format? Moreover, can the parsed document then be sent to a different mailbox based on the format type? Finally, can any non-identified documents be sent to an alternate (21st) mailbox? Thank you.
Hi Shawn, thanks a lot for reaching out and the great question!
At this point, Docparser does not have a built-in “send email” function. While Docparser can receive your documents by email, we can only send out parsed data (incl. a link to the document) with HTTP request (webhook integrations). You can however use one of our integration partners (Zapier, MS Flow, Workato, Stamplay, …) to send out emails including the document as an attachment.
Whether or not you can identify a document layout with Docparser highly depends on the content and layout of the documents. In most cases, it’s possible to find simple rules like “if … is present in the top left corner, classify as type A”. I would suggest to create a free trial account and just give it a spin. Our support staff will be more than happy to help you with the setup once you give us access to your documents.
I’m trying to identify empty fields in OCR forms where the fields are to be filled out by hand. I don’t care what the data is in the field only that it is not empty as a quality check for completeness. Can this tool do that?
Hi Amy! I’m afraid Docparser is not a good fit for your use-case. Right now, Docparser is not capable to detect hand-written text accurately. Which means that form fields would be classified as empty quite often, even though they are filled with hand-written text. I’m sorry for the bad news!
I have an old ephemeris that I want to preserve. The info is mostly table data with many glyphs and symbols. Can docuparser be trained to read this info and save it in PDF and other editable files?
Thanks for the question! It’s difficult to answer this question without having a look at some sample documents. I would recommend to create a free account and give Docparser a spin. Our support team will be happy to assess your sample documents once they are uploaded to your free account.
Firstly, Can docparser convert semi structured documents into structured spreadsheets in batches (nightly) from a specified folder? I would like to automate the zonal OCR process for a set of documents.
Secondly, Is docparser purely Cloud based or deployed On premise too?
Hi Mukund! Yes, Docparser can handle semi structured documents to a certain degree. You can read more about extracting data from semi structured documents in our knowledge base. And yes, Docparser can import documents from a watched folder (Dropbox, Google Drive, …) and process documents in batches. To answer your last question, Docparser is a cloud only solution. I would recommend to create a free account and give it a try! 🙂
I am looking for a solution to use a camera phone to scan a prescription to get name, address, medication etc. Can the software work on mobile? Scanned by hand, by a customer not a flatbed scanner?
Hi Ian, thanks for the question! Unfortunately Docparser is not a good fit for your use-case. We specialize in getting data from scanned dand PDF documents.
I have several hundred pdf files I need to read through and pick one field off the save that pdf file with that field I grabbed. Can this program do that?
Thanks for reaching out and for your interest in Docparser!
Our app was built to ingest documents you send us, extract specific data points from them, and then make that data available as a file download or a webhook.
If your documents have consistent structure we can build filters to extract that one field you’re looking for, or if it’s unstructured and you’re looking for a specific keyword, you could use a tag document rule to look for a keyword and return a value if it’s found:
I would recommend creating a free account (no credit card required) and letting us know if you run into trouble getting set up!
Is doc parser capable of pulling information off of multiple pages in 1 file? In our case we batch scan tickets and want to pull the unique information of of each ticket without having to scan each ticket individually.
Thanks for reaching out! Our app supports extraction of data from up to 30 pages in a single document by default. I would recommend creating an account (no credit card required) and let us know if you have any questions or run into trouble getting set up at [email protected]