character recognition testing
Handprint Sample Form Account
hsf at magi.ncsl.nist.gov
Tue Apr 24 10:44:17 EDT 1990
The National Institute of Standards and Technology (NIST)
formerly National Bureau of Standards (NBS) has developed
a data base for testing handprint character recognition.
The database is on a ISO-9660 formated CD and is
described briefly below. Please forward this to
interested parties.
__________________________________________________________________
NIST Handprint Database
The NIST handprinted character database consists of 2100
pages of bilevel, black and white, image data of hand
printed numerals and text with a total character count of
over 1,000,000 characters. Data is compressed using CCIT
G4 compression and decompression software is provided in
C.
The total image database, in uncompressed form, contains
about 3 Gigabytes of image data, with 273,000 numerals
and 707,700 alphabetic characters. The handprinting
sample was obtained from a selection of field data
collection staff of the Bureau of the Census, with a
geographic sampling corresponding to the population
density of the United States. The geographical sampling
was done because previous national samples of
handprinted material have suggested that there are
significant regional differences in handprinting style.
Most of the individuals who participated in the sampling
are accustomed to filling out forms relatively neatly,
and so this sample may represent a "best possible" sample
of handprinting. Even so, the range of characters and
spatial placement of those characters is broad enough to
present very difficult challenges to the image
recognition systems currently available or likely to be
available in the near future.
Typical Use
This test data set was designed for multiple uses in the
area of image (character) recognition. The problem of
computer recognition of document content from images is
usually broken down into three operations. First the
relevant areas containing text are located. This is
usually referred to as field isolation. Next the entire
field image containing one or more characters is broken
into the images of individual characters. This process is
usually referred to as segmentation. Finally, these
isolated characters must be correctly interpreted. The
images in the data base are designed to test all three
of the processes.
The test data can be used for any one of the three
operations, although it is important to recognize that
the success of all subsequent steps in this process is
dependent on the success of the previous steps.
for further information contact:
Joan Sauerwine
301-975-2208
FAX 301-975-2183
More information about the Connectionists
mailing list