Optical Character Recognition - Digitizing Records (OCR)

About Authors:

Dr. Charles Florin, Data Sciences Leader Charles has worked extensively on Image Processing and dynamic stochastic processes with deep expertise in image-based analytics and pattern recognition. He is a PhD in Applied Mathematics.

Gaurav Johari, Business Development Leader Gaurav has worked in management consulting, business transformation and analytics-driven initiatives across industry verticals and functions. He is a Masters in Business Administration.

1 Introduction to Optical Character Recognition for Medical Records

Organizations across all industries are in the process of digitizing records such as patient records, client forms, billing records, maintenance records, electronic health records (EHR) to reduce storage costs, increase information security and enable downstream analytics processes.Nevertheless, when organizations transition from paper to electronic copies, their legacy data in paper format needs to be scanned and made intelligible to analytics algorithms.

Optical Character Recognition (OCR) is the process by which scanned copies of records are analyzed, characters recognized, and made available for editing, search and analytics. OCR involves a series of advanced Computer Vision and Pattern Recognition techniques.

Karvy Analytics combines the technical expertise needed to achieve desired results from state-of-the-art OCR and leverage its Big Data expertise to do so efficiently and securely on a large quantum of documents.

2 Challenges in Digitizing Records

Organizations across all industries are in the process of digitizing records such as patient records for healthcare, billing records, maintenance logs or clients forms. After digitization, the cost of storage and communication is greatly reduced; information security is also greatly improved, especially in the context of confidential information. In a digital format, a document may be shared instantaneously to all the relevant parties, without the need to manage paper file rooms. Furthermore, when digital documents are available in a text-searchable format, theircontent can be easily searched for specific information without the need to spend time combing through a large volume of paper documents. In addition, the recent years have seen a dramatic drop in the cost of storage, thus giving the upfront costs of digitization (scanning, OCR and filing) a positive Return-On-Investment.

Furthermore, digital records are the foundation of further downstream analytics. These analytics programs are especially important for large organizations such as Healthcare Provider and Payer[1] to respond to the industry trends such as data integration, consumerization, patient risk stratification, value-based care and cost reduction. Analytics services to healthcare providers address challenges such as reducing the re-admission rates to maximize revenues and patient satisfaction, moving to value-based reimbursements through the monitored implementation of specific quality measures and deriving actionable insights with data integrated from multiple sources. Analytics services to healthcare payers address challenges such as stratifying the members’ risk and identifying opportunities for gap closure through preventive measures, fraud prevention, and cost optimization.

When organizations transition from paper to digital record systems, one of the first steps is to scan the available legacy data in paper format. However, simply doing so does not render the text intelligible for algorithms to edit, search or analyze. Rather, these scans are simple photographic images of the text, rather than the text itself. To make the text intelligible to algorithms is a prerequisite before further analytics.

Modern forms are designed in such a way to make it easy for OCR systems to analyze scanned images of these forms and turn them into machine-encoded documents. These forms typically include “comb-fields” to limit the position and spacing of the characters, and check boxes when a limited number of choices are available. Other techniques such as 1D or 2D code bars are becoming increasingly popular, along Quick Response Codes (QR Code, ISO/IEC 18004) [Figure 1].

Figure 1 Example of tools used to optimize the digitization of forms. (a) 1d code bar, (b) QR code, (c) comb-field text field, (d) check boxes

OCR consists in analyzing the scanned images, detecting the characters on that image, and transforming the text image into machine-encoded characters, words and sentences. The OCR process is presented below in section 3.

3 Structure of OCR systems

OCR systems have been investigated since 1920s and leverage research fields such as computer vision, pattern recognition and artificial intelligence.The typical OCR process can be splitinto three fundamental steps (image pre-processing, character recognition, and post-processing correction), which are detailed below:

3.1 Pre-processing a scanned image for OCR

Although typical commercial OCR systems report a character recognition success rate of 90-99%, much of the performance depends on the scanning process and preparation of the image before the actual character recognition is performed. It is therefore, imperative to optimize the scanning process itself in order to maximize OCR efficiency. The scanning device should be cleaned and maintained regularly, and all scanned documents should have the same orientation.

After a paper document is digitally scanned, the resulting image is processed through a series of algorithms based on Computer-Vision techniques. These pre-processing steps aim to correct irregularities from the scanning process as much as possible.

  • De-Skewing: a process by which the bounding box of the scanned document is detected, and the image is rotated such that the document may present itself following a typical upright position, as most forms follow a standard A4 format.[Figure 2]
  • De-noising: a process by which ‘noise’ coming from scanning devices is reduced or eliminated. Noise may present itself under a variety of forms such as pepper-and-salt noise, dark lines and other image artifacts from the scanning device.
  • Character enhancing: de-noising algorithms may lead to some characters losing their edges. It may therefore become necessary to enhance the characters’ edges by applying a sharpening image mask such as a gradient edge filter.
  • Histogram Equalization: depending on the scanning device, some part of the image may be more exposed than others. This explains why some parts of the resulting document appear brighter than others [Figure 3]. Histogram equalization is a technique that aims to balance out the brightest and darkest regions of an image [Figure 3].
  • Page Segmentation: ifthe scanned image contains a region that is outside the document, that region should be detected and segmented, as it could lead to additional errors during character recognition.
  • Page Layout Analysis: With the generalization of Big Data infrastructures, record digitization projects now often involve millions of documents formatted in hundreds, sometimes thousands, of different layouts. These layouts may include not only text but also pictures, schemas and charts. To further improve the accuracy of the line-word-character segmentation, the documents are automatically classified based on their layouts[4].
  • Line-Word-Character segmentation: OCR systems work best when individual characters are identified and isolated. To do so, pattern recognition is applied to detect text blocks, followed by text lines, and finally individual characters. [Figure 4]

3.2 Character Recognition

Once the position of an individual character in the scanned document is known, pattern recognition techniques are applied at the characters‘position to correctly identify that the character in a known alphabet. These pattern recognition techniques may be divided into feature-based and feature-less techniques.

The feature-based techniques rely on explicit characteristics of the character such as vertical/horizontal lines, loops, and line intersections. The features of a specific character are compared with a set of features from a known set of characters to identify the most similar character. Feature-based techniques have been used for handwritten documents where the characters’ appearance may vary greatly depending on the document’s author. The list of typical algorithms includes k-nearest neighbors, SVM and Neural Networks [2].

The featureless techniques have gained popularity with documents which use a consistent set of character fonts, or when the scanning process and pre-processing steps successfully eliminate inconsistencies in the image. These techniques consist in identifying the most likely character solely based on the grey-scale value of the character block. At the time of writing the present paper, the best results on standard datasets are achieved with techniques relying on Convolutional Neural Networks [3], a type of feed-forward Neural Networks with multiple hidden layers. Convolutional Neural Networks tend to minimize the effect of errors while pre-processing the image[3]. In the MNIST database of hand-written digits managed by the US National Institute of Standards and Technologies, a technique based on Convolutional Neural Networks achieves an error rate of 0.21% [5].

3.3 Post-Processing Corrections

Even the best OCR algorithms still produce errors. For instance, a 99% performance, or 1 error in 100 characters, may lead to unnecessary costs or information lost. Typical errors include substitution errors (e.g. incorrectly identifying an ‘m’ for an ‘n’), deletion errors (e.g. skipping a character within a word), insertion errors (e.g. adding a character that was not part of the original text). For that reason, post-processing corrections are often times performed after character recognition. Short of hiring a team of transcriptionists to manually check and correct the documents, the post-processing corrections have to beautomated. The corrections typically consist of a spell-check based on a specialized dictionary, such as a language-specific medical dictionary, or other context-based lexicon adapted to the specific text being digitized.

A dictionary-based approach replaces the words by the most likely words in a dictionary. The measure of likelihood is defined using a distance metric between words, such as the Damerau-Levenshtein distance[6]. However, purely dictionary-based corrections do not account for context information contained in the sentence. For instance, “Medical Mÿstory” may be corrected into “Medical Mystery”, as “Mystery” is the most likely word for “Mÿstory” when analyzed out of context. Nevertheless, a context-based correction should be able to point to “Medical History” as the correct answer.

Context-based approaches typically rely on Statistical Language Modeling (SLM), a Natural Language Processing (NLP) technique that aims at modeling regular expressions and word co-occurrences. In the previous example, a Provider- or Payer-specific lexicon should clearly point to “Medical History” as the correct answer. A co-occurrence matrix is defined as the affinity between a word and a regular expression. In our example, “History” would be more strongly associated with “Medical History” than “Mystery”. A co-occurrence matrix thus,models a statistical distribution of word associations in regular expressions. That model is then used by optimization techniques, such as Pointwise Mutual Information[7], to correctly assign the most probable word given the context of a given sentence.

3.4 Big Data testing and validation

Even the most accurate OCR systems do not perform at 100% accuracy, and a precise estimate of the accuracy level is necessary before engaging further resources in downstream analytics. However, such an estimate may be quite costly to obtain, in part because of the huge size of the datasets, the variety of the digitized documents and the fact that sometimes the output data may be unstructured.

For that reason, Karvy Analytics has developed a testing and validation automation framework to perform a statistical evaluation of the OCR accuracy level stratified based on the document layout classification. This ensures that the documents with the least accurate OCR output may be further processed to avoid unnecessary costs and errors in downstream analytics.

4 Application of OCR across Industry

Optical Character Recognition has a wide range of application across industry verticals which have been gaining traction since the beginning of OCR’sinception. Lately, with the exponential advances in imaging technology, computational processing and Big Data technologies, applicability and adoption of OCR has improved significantly. Below are some of the usage cases that Karvy Analytics believes will transform the landscape of digitization and analytics in these industries:

1. Insurance

  • a. Fraud Mitigation: Leverage OCR for extraction of digital information from physical records (legacy and current) and assesses relevant part to deduce any anomalies in the contract vs. claims vs. claim conditions vs. scenarios.
  • b. Risk Adjustments: Identify the risks associated with the insurance (health or personal) through digital extraction of data from claims documents and identify conditions associated with certain high-medium-low risk profiles to optimally assess the premium.

2. Healthcare

  • a. Patient Record Digitization: Most patients’records are in physical format,which contains rich information about patient past history, medical conditions and inhibitions, which when digitized and recorded as structured data can be analyzed for downstream real-time medical assessment, re-admission analysis and treatment anomalies.
  • b. Disease Diagnosis and patient-at-risk: OCR with CAD (Computer-aided Diagnosis) can help identify critical medical conditions from digital records and assist identification of pre-condition(s).

3. Banking

  • a. Customer Record Digitization: Offline surveys, contract forms and customer documents in digital format contain rich information which, when codified, adds to customer 360 profile and help assess customer information in real-time across sources.

4. Manufacturing / Telecommunications

  • a. Vendor Contract Assessment: Offline surveys, contract forms and vendor documents in digital format contain rich information which, when codified, adds to vendor 360 profile and help assess customer information in real-time across sources.

5. Public / eGovernance

  • a. Public Record Digitization: Most government records are either in paper or digital form of paper records which, when codified using OCR, can transform eGovernance initiatives.

6. Road Safety

  • a. Interstate Vehicle Record: Digitization and codification of vehicle records across states can help reduce thefts, criminal activities and terrorism (using stolen vehicles), which can be identified usingcodified records.
  • b. Toll Identification: Helps identify the type and volume of cars for differential tolls.

5 Karvy Analytics OCR Solutions to Automatically Digitize Paper Records

Through its partnerships with over 500 companies, the Karvy Group has processed personal information from over 70 millionrecords in the last 30+ years [7], making it one of the most trusted companies in the world for treatment of personal or confidential information.

Karvy Analytics Limited has a globally distributed team of experts who deliver a range of solutions based on Big Data, advanced analytics (statistical and mathematical modeling techniques), social analytics, and mobile descriptive analytics for new business insights. More specifically, with both industry-specific and OCR expertise, the team has developedan OCR platform tailored to the challenges of enterprise-wide digital conversion.Karvy’s solution leverages Big Data infrastructure to securely support a large amount of data, with time and cost efficiency.With a proprietary testing and validation testbed, our OCR servers convertdigitally-scanned documents into standard compressed formats based on standard text files, MS Word, XML files, JSON documents, PDF and PDF-A.The OCR platform can be further leveraged for downstream analytics developed either by our clients independently, or in partnership with Karvy Analytics.

6 Conclusion

Optical Character Recognition (OCR) is becoming a key enabler as companies invest in electronic records. Acknowledging the value of legacy data, OCR is the stepping-stone toward text-based data analytics. Thanks to Karvy Analytics’ experience in Computer Vision, Pattern Recognition and Artificial Intelligence, the team has developed the infrastructure necessary to support OCR on a large scale for its clients. Whether the digital conversion project is task-specific or enterprise-wide, KarvyAnalytics’ OCR infrastructure can be tailored to enable client organizations gain from full-text analytics based on their legacy documents.

For further information, contact us on our website at http://www.karvyanalytics.com or by email at contactus@karvyanalytics.com.

7 References

  1. Why Health Care May Finally Be Ready for Big Data, Nilay D. Shah and Jyotishman Pathak, Harvard Business Review, December 3rd 2014
  2. Character recognition in natural images. T. E. de Campos et al. In Proceedings of the ICCV 2009
  3. Reading digits in natural images with unsupervised feature learning. Y. Netzer et al. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011
  4. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval, A. W. Harley et al., ICDAR 2015
  5. Regularization of Neural Networks using DropConnect, L. Wan et al, In Proceedings of the ICML 2013
  6. A technique for computer detection and correction of spelling errors, F. Damerau, In Communications of the ACM (ACM) 7(3): 171–176, March 1964
  7. Speech and Language Processing. D. Jurafsky & J. H. Martin, Publisher: Pearson Prentice Hall, c2009
  8. Specific corporate partnerships and services are detailed at http://www.karvy.com
back to top