Introduction to High End Forms
Processing
by Dan Elam, eVisory
Introduction
eVisory provides unbiased analysis and implementation support services for imaging and forms processing systems, including American Express. This short white paper is intended to be used as a generic discussion on forms processing technology.
Dan Elam is the
author of this white paper. He is chairman of the AIIM/ISO C15.9 Standards
Committee for OCR technology and an internationally recognized expert on forms
processing technology. He is also the founder of Contemplor, the
industry newsletter for forms processing technology and the editor of the
industry standard for forms encoding. He has participate in complex OCR design
efforts for systems for the IRS, US Post Office, CIGNA, Delta Dental of
California, and others.
Forms Processing Technology Discussion
Ever since typewriters and computers became a staple of offices around the world, many have searched for ways to reduce or eliminate the vast amounts of information that must be captured. Character recognition technology was first pioneered in the 1960s. By the late 1970’s a few systems were being sold commercially.
Unfortunately, many of those early systems failed to deliver on their promise and today many believe that the technology still has significant problems. While these systems are complex, the core technology has advanced considerably and significant savings can be achieved.
In considering the applications for OCR and forms processing technology, it is helpful to review the key issues and challenges for the technology. There are four steps to successfully capturing the data for most of forms: Forms Recognition, Form Removal, Name & Address Block Recognition (for free-form forms), and Character Recognition.
Forms recognition relies on specific locations on the form or line geometry information to determine the form type. Without the form type, the system does not know where on the form to look for the data.
Form removal is extremely important for most forms processing systems. High volume systems rely on colored drop out boxes to remove the form. Nearly all of the early systems required drop out inks, but today’s systems have sophisticated algorithms to remove black lines and do not require drop out inks.
To achieve the best possible performance for a forms processing system the forms usually must be redesigned. This is especially true for handprinted forms where touching characters are more likely.
In some cases the redesign of a form can have a profound impact on the ability to collect information. Careful attention must be provided to ensure that the users accept the new forms and use them properly. In some cases, forms may be photocopied by users. These situations must be anticipated prior to the redesign of the forms.
Forms redesign is extremely important from an accuracy standpoint. A properly redesigned form can increase accuracy 50% while reducing the number of characters which must be manually reviewed. In some cases, careful redesign of the forms can result in eliminating virtually all OCR errors while reducing the number of characters which must be reviewed.
A properly designed form should accomplish several objectives. Most importantly, it should encourage the user to fill out the form completely and accurately. To do this, forms should be visually appealing with a minimum of fields and instructions. The form must be easy to fill out and the form must not require the user to write differently from normal.
Secondly, the form itself can improve the OCR accuracy through validation routines and keeping users from touching the characters together.
In some situations the form may be processed with a number of other forms intermixed into the scan batch. Forms used in this environment should include barcodes or special marks that help the system identify the form correctly.
Finally, a properly designed form should be suited for the application it is designed to serve. For example, key fields should not be placed on the crease of the piece of paper. Forms which are often photocopied should not rely on colored ink. Many of these design issues can only be handled by experienced forms designers.
The first step to redesigning a form is to determine what information must be collected on the form. Short forms such as deposit slips and credit card transactions usually only capture a minimum of information, but larger forms such as HCFA-1500s, tax returns, and others frequently ask for more information than is input to a system.
Some information is critical to the form. This information is frequently used to index a record. If the field is incorrect, the transaction may be lost or be unable to be processed. Social security numbers are probably the most common. The key is to determine the exact accuracy that is needed. A seemingly good 99.5% character accuracy rate means that nearly 13% of all fields will contain an error.
Validation routines can substantially improve OCR accuracy. Validation routines check the OCR results against a database or performs a mathematical check of the data. For example, in a field for age the system could automatically reject anyone less than 15 years since they might not be allowed to fill out the forms. Some systems include advanced routines to test the OCR’s engine second best guess character. For processing retirement papers, for example, the best guess “15” might fail, but the second best guess of “75” might pass the validation routines and be used to avoid an error or manual review.
Some validation routines can be improved by working with the form itself. In the age example, the user can be asked for their age and asked to enter their birthdate. The system could then calculate their age based on their birthday and use it to compare the two fields. This sort of interfield validation requires more input by the form users, so it should only be used for important fields.
In certain situations, validation routines are the only way to resolve certain characters. Consider the following examples below:
1 l I 7
Figure 1 Similar Character Examples
The “1” and the “7” can be identified as long as the image quality is good and the characters are not skewed, but resolving the capital “i” and the lowercase “L” can be extremely difficult. Validation routines know how to properly identify these characters.
The single biggest challenge for forms processing systems is to process touching characters. For example, a touching letter “r” and “n” can easily be mistaken for a letter “m”.
rn
Figure 2 Touching Letter Confusion
This is true even for machine printed characters that are nearly touching.
rn
Figure 3 Machine Print Touching Characters
Ideally, users should write characters separately in order to avoid these problems. Systems are available to process unconstrained characters, but these can be very expensive and always have lower accuracy rates.
|
First Name |
Last name |
|
Hair Color |
Last Book Read |
|
Dog’s Name |
BIRD’s Name |
Figure 4 Unconstrained Fields
Forms which use unconstrained fields are the easiest for the users to fill out. However the results can often be difficult for a human data entry operator and virtually impossible for a forms processing system.
The best approach is to separate the characters by one of several different methods. The optimum method involves colored ink on the form that drops out when it is scanned. Most scanners and copier use a colored light that reflects from the paper to produce an image. If a green light bulb, for example, hits green paper, the green ink does not show up on the image. If the user has written in exact shade of green ink, the user information would also not show up. This is the easiest and most widely used method for forms drop out and requires the least computing power to perform. The drawback is that the forms must be pre-printed using the specific ink colors. Printers may not substitute inks and the forms may be difficult (or impossible) to photocopy. PC-based programs which print forms cannot print in color and also cannot use this method. These forms must be handled through electronic forms removal which is discussed later.
Color choice is often determined by the scanner. Particular green inks work with most Fujitsu and TDC scanners. Red is preferred for ScanOptics high volume scanners. Some scanners, such as the Kodak ImageLink 923, are equipped with white bulbs that do not have any drop out features (except white). These scanners can usually be retrofitted with a colored bulb (red is most common) but some may invalidate the warranty. The TDC scanners currently in use for OSOS will drop out certain green inks without any modifications.
There are two methods used to encourage forms users to separate their characters. The first is to draw an entire box around the field in the colored ink.
![]()
Figure 5 Blank Social Security Field
The user writes in the box. For virtually all forms processing systems, it doesn’t matter if they actually write outside the box: the box is merely present to encourage them to keep the characters from touching.
![]()
Figure 6 Filled-in Field
When scanned, the colored boxes disappear, leaving only the image of the user’s data.
![]()
Figure 7 Field After Scanning
Some applications use a white form with colored lines around the boxes. The examples above use this method. Other methods paint the entire background in the drop-out color. This makes the user data fields stand out from the background. The method requires slightly more sophisticated printing capabilities and can be difficult for forms which have many instructions or field labels.

Figure 8 Colored Background
Another method is to use “tick marks”. Tick marks are useful because they can be used for both drop out and non-drop out forms. Tick marks are often used for business reply cards since they take up less room than full constrainment boxes. Tick marks are not as accurate as full constrainment boxes, but they provide a good compromise.
![]()
Figure 9 Drop Out Tick Marks Field
Users enter the data into the fields and try to separate their characters. Since the tick marks are closer, touching characters are more common with this approach, but it still assists the engine since it knows how many possible characters are available within in a certain linear distance.
![]()
Figure 10 Filled-in Field
The problem with non-drop out forms is that the field lines interfere with the character recognition process. This is true for both constrainment boxes and tick marks that are not printed in drop-out inks. Consider the following example.
![]()
Figure 11 Interference Example
Electronic forms removal technology can remove the form from the image, but what happens in the case of a form which intersects a character? Some systems leave a little extra data while others faithfully remove the form and then attempt to rebuild the character. Avoiding this problem is the major reason why colored drop-out forms are preferred.
The corporate filings should be able to be easily redesigned to operate properly in an OCR environment. Slightly better performance can be achieved using colored drop out forms, but this is not absolutely necessary.
Some form with free-form information require special name and address recognition modules in order to process these images. This is very complex technology costing millions of dollars. These routines work by finding a group of pixels which look similar to a name and address block and then processing them for OCR.

Figure 12 Name and Address Block Recognition
Ironically, while the forms may be difficult to process overall, the actual character recognition is not that difficult even for low end technology.
Most people assume that character recognition accuracy is the single most important aspect to achieving cost savings from forms processing systems. Although important, character recognition accuracy is NOT usually the most significant aspect of cost savings.
The whole idea of accuracy is often distorted and misrepresented: especially by the vendors who sell the technology. A complete discussion of the details for how OCR engines operate is beyond the scope of this white paper, but here are the basics.
When a bitmap character is submitted to an OCR engine, the engine tries to match the character to patterns that it has defined in order to determine the best match.
|
|
|
Y |
80% |
|
|
|
U |
65% |
|
Y |
= |
V |
71% |
|
|
|
v |
40% |
|
|
|
u |
27% |
The engine algorithms return a ‘confidence value’ for each character. If the best guess character is above a certain threshold, the character is considered to be correct. In the example above, if the threshold is set to 75%, the OCR engine would accept the capital ‘Y’ as the correct character.
To increase accuracy rates, the thresholds are set lower: resulting in more characters that must be manually reviewed. If the threshold was set to 85%, the character would be considered a ‘reject’ and would be forced in to a manual review mode. More rejected characters results in higher accuracy, but also in higher data correction costs. The most cost efficient systems use different thresholds for different fields in order to spend labor only on those fields which are truly critical.
Sometimes the OCR engines make mistakes, however. Note the confidence values in the example below.
|
|
|
Y |
70% |
|
|
|
U |
65% |
|
Y |
= |
V |
71% |
|
|
|
v |
80% |
|
|
|
u |
27% |
In this example, the engine algorithms suggest that the lowercase ‘v’ is the correct choice. Situations where the OCR engine believes that a character is correct when it is not are called ‘substitution errors’. Substitution errors are very difficult to correct since these characters are not submitted to a user for manual review. As a result, these characters only get corrected in slow downstream process. Substitutions can be reduced by raising the threshold and rejecting more characters. The ratio of rejects to substitutions can be plotted and is often called a ‘reject-substitution curve’.

R
e
j
e
c
t
s
Substitutions
Figure 13 Reject-Substitution Curve
Each OCR engine has it’s own characteristics of this reject-substitution curve. Generally speaking, today’s systems substitute one character for every 3-5,000 that they pass without rejection (another way to say this would be for each character considered ‘correct’). It is important to note that today’s OCR engines are actually more accurate than human beings when it comes to looking at single character. (Humans still perform better overall since they can look at the characters in context.)
Today’s systems use sophisticated inter-field calculations and validation routines to further reduce substitutions and lower rejects. These validations routines are often vertical market-specific or custom developed for each application. Most systems have special rules for context to help further narrow the gap between machines and humans.
Most turnkey forms processing systems use the same OCR engines. As a result, the ability to correct rejects quickly is often the most important aspect when it comes to the true cost of a system.
As you can see,
there is no such single thing as ‘accuracy’. A vendor could claim to have 100%
accuracy – just by rejecting 100% of the characters. Other vendors commonly
claim to have a 98% accuracy rate on hand printed characters. In this case they
are likely counting the reject rate and ignoring the substitution errors.
Substitution errors can be reduced by increasing the number of characters rejected for manual review. Unfortunately, every character rejected is one which a human must carefully review.
A typical data entry operator can key about 11,900 characters per hour of alphanumeric text from a piece of paper. Many people assume that it is faster to key from image than paper, but most key from image solutions are slower due to subtle ergonomic issues. Early key from image systems for checks (numeric only) had key stroke rates as low as 1,500 characters per hour.
There are basically two common methods for correcting rejected characters today. Individual characters are often corrected using a ‘ribbon editor’ mode. This mode permits key operators to have look-ahead and very fast keystroke rates. Often used for machine printed characters, ribbon editor correction modes can achieve keystroke rates near 10,000 keystrokes per hour.

Figure 14 Ribbon Editor Error Correction
Most vendors now offer these types of ‘ribbon editors’, but there are still minor differences than can change keystroke rate dramatically.
All systems also offer ‘context mode’ error correction. Context mode is useful because it allows users to see the character in context with the surrounding characters. This reduces mistakes, but is significantly slower.

Figure 15 Context Mode Error Correction
Operators key the highlighted key and the next image appears. Because of the ergonomic changes, keystroke rates for this mode are usually between 3,500 and 5,000 keystrokes per hour.