character recognition testing

Tue Apr 24 10:44:17 EDT 1990

       The National Institute of Standards and Technology (NIST)
       formerly National Bureau of Standards (NBS) has developed
       a data base for testing handprint character recognition.
       The database is on a ISO-9660 formated CD and is
       described briefly below. Please forward this to
       interested parties.

__________________________________________________________________

                        NIST Handprint Database

       The NIST handprinted character database consists of 2100
       pages of bilevel, black and white, image data of hand
       printed numerals and text with a total character count of
       over 1,000,000 characters. Data is compressed using CCIT
       G4 compression and decompression software is provided in
       C.
       
       The total image database, in uncompressed form, contains
       about 3 Gigabytes of image data, with 273,000 numerals
       and 707,700 alphabetic characters. The handprinting
       sample was obtained from a selection of field data
       collection staff of the Bureau of the Census, with a
       geographic sampling corresponding to the population
       density of the United States. The geographical sampling
       was done because previous national samples of 
       handprinted material have suggested that there are
       significant regional differences in handprinting style.

       Most of the individuals who participated in the sampling
       are accustomed to filling out forms relatively neatly,
       and so this sample may represent a "best possible" sample
       of handprinting.  Even so, the range of characters and
       spatial placement of those characters is broad enough to
       present very difficult challenges to the image
       recognition systems currently available or likely to be
       available in the near future.

                              Typical Use

       This test data set was designed for multiple uses in the
       area of image (character) recognition. The problem of
       computer recognition of document content from images is
       usually broken down into three operations. First the
       relevant areas containing text are located. This is
       usually referred to as field isolation. Next the entire
       field image containing one or more characters is broken
       into the images of individual characters. This process is
       usually referred to as segmentation. Finally, these
       isolated characters must be correctly interpreted. The
       images in the data base are designed to test all three
       of the processes.

       The test data can be used for any one of the three
       operations, although it is important to recognize that
       the success of all subsequent steps in this process is
       dependent on the success of the previous steps.
       
       
       for further information contact:

			Joan Sauerwine
			301-975-2208
			FAX 301-975-2183