Docparser is an OCR PDF Scanner that uses OCR to extract data from PDF documents. It allows you to convert PDF to Excel files, convert PDF to JSON, and even update cloud platforms through integrations.
What is OCR on a scanner?
Optical Character Recognition (OCR) is a technology that allows you to extract data from scanned documents resulting in a text which you can then edit, update, or aggregate with other tools for data analysis and a range of other uses.
Optical Character Recognition (OCR), is essentially the conversion of scanned images with text, be it typed, in print, or written by hand, into … well … text. Typically you see OCR used in extracting text information from photos, passports, and scanned documents. OCR is often used for “digitizing” recognized text, so it can be utilized later, edited, searched, aggregated for analysis, etc.
There are often many steps to OCR
Pre-processing happens to improve the possibility of having the text recognized in the process. De-skewing is one of the most used techniques, and layout analysis to target zones of the PDF is also important to consider when extracting text with a high degree of OCR accuracy. Additionally converting grey-scale and color to black and white allows the process to focus on just 2 options (Binarization), and increases the opportunity for successful extraction of the text, from the source.
A Complete Cloud-Based OCR PDF Scanning Solution
If you have PDFs with text, you need OCR data extraction from PDF documents, a subscription with Docparser leaves you in the driver seat.
Whether you are working to extract information from scanned PDF invoices, purchase orders, or looking to automate the receipt of payroll PDF’s for your bookkeeper, we’ve got you covered. We use the best OCR software available that currently supports 46 languages. An example of Japanese and English scanned PDF, with before and after parsing shown below:
Current languages supported with our PDF OCR:
Languages Supported | Languages Supported |
English | Indonesian |
Afrikaans | Italian |
Albanian | Japanese |
Basque | Korean |
Brazilian (Portuguese) | Latin |
Bulgarian | Latvian |
Byelorussian | Lithuanian |
Catalan | Macedonian |
Chinese Simplified | Malay |
Chinese Traditional | Moldavian |
Croatian | Norwegian |
Czech | Polish |
Danish | Portuguese |
Dutch | Romanian |
Esperanto | Russian |
Estonian | Serbian |
Finnish | Slovak |
French | Slovenian |
Galician | Spanish |
German | Swedish |
Greek | Tagalog |
Hungarian | Turkish |
Icelandic | Ukrainian |
16 Responses
Hello!
There are ~36000 scanned PDFs and I want to parse them. Just check existing a small sentence inside them. As result I want to see true/false.
Do you have a solution for me?
Hi Andrey, this sounds definitely like something we can do. You can use our “Tag Document” parsing rule to search for a specific phrase and then output a custom value (e.g. true) when the phrase is present. I would suggest to create a free trial and contact our support stuff once you uploaded a couple of sample documents.
ABBYY Fine Reader has functionality to automate new image files placed in a folder called a ‘hot folder’, do you have similar functionality to this? I would be receiving multiple image files during a day and would to have them converted automatically upon receipt.
Thanks
Hi Benjamin! Thanks for the question! Docparser is all about automating workflows and you there are several ways to import your documents. As Docparser is a cloud solution, you need to make sure that the documents get uploaded to us though. For example, you can use our integration partners to import documents from your cloud storage provider (Dropbox, Google Drive, Box, …) or automatically forward incoming emails to your parser.
I am not clear is Docparser able to read hand written text (say within a pre-printed form)?
How do you handle the error reporting such as certain words the OCR is unsure of one of the letters or digits and so it needs a human to review and ‘teach’ the system what is the correct character or letter?
Thanks
Hi James! Thanks for the great question. Docparser does not recognize handwritten text at this point of time. As you pointed already out, OCR for handwritten text comes with a high error rate and a human validation is mandatory. This is however not something we built into Docparser yet. However, adding a validation interface to Docparser is definitely something we would like to do in the future. If you sign up for a free account, you’ll be informed about product updates.
Hi,
Just wondering if docparser can parse a scanned multiple number of receipts and have them organised in an excel worksheet as fields with their corresponding data. The receipts are of the ATM machines when withdrawing money. I need to keep track of when and how much was withdrawn from the account.
Thank you.
Al.
Hi Ali, thanks a lot for reaching out and your interest in Docparser! If your receipt are scanned properly (well aligned with an office scanner), Docparser should be able to get the data you need. However, Docparser does not do a great job for documents which were “scanned” with a photo camera. I would suggest you create a free account and give it a try. If you experience any issues during the setup, please don’t hesitate to contact our support staff.
Hello,
What automatic feeder scanner do you recomend?
Hi Papila, thanks for reaching out! Docparser does not have specific guidelines on automatic feeder scanners. As long as the output generated is at least 300DPI and follows our scan requirements, it should work.
We have 5000 analyst reports in PDF and want to extract all the content (text, tables and images) into json formats. Is this possible with docparser?
Hi Audrey! Thanks a lot for reaching out and your interest in Docparser! Whether or not Docparser is a good fit depends on how your documents are structured and what data you want to extract. I would suggest to create a free account and upload a couple of sample files. You can then reach out to our customer support and they’ll be happy to check if Docparser is a good fit or not.
Brazilian is not a language…
You are totally right! We added “Portuguese” to the entry to make it more clear.
Do you support converting a PDF without a text layer, via OCR, to a PDF with a text layer (e.g., PDF/A)?
Hi Ryan, yes, Docparser is producing “Sandwich PDFs” as a side product. However, Docparser was primarily designed to pull specific data points from your documents. If you are only looking for a PDF/A generation tool, you will probably be better with OCRMyPDF as it’s free and designed for this purpose.