Optimized for OCR

Digital Collections and Initiatives

Best Practices for Digitization: Optimized for OCR

*Text-based Archival Master File Optical capture resolution Bit depth Embedded color/gray profile Notes
*Image specs optimized for OCR
Printed computer documents Uncompressed TIFF v.6 400 PPI 8/24 Adobe RGB (1998)/Gray Gamma 2.2  
Typed documents Uncompressed TIFF v.6 400 PPI 8/24 Adobe RGB (1998)/Gray Gamma 2.2  
Printed publication matter Uncompressed TIFF v.6 400 PPI 8/24 Adobe RGB (1998)/Gray Gamma 2.2  
Printed matter on microform Uncompressed TIFF v.6 *3500 PPI 8 Gray Gamma 2.2 *Accounts for magnification ratio

This standard defines best practices for digitizing text-based material that is slated for Optical Character Recognition (OCR) processing, either immediately after digitization or at some point in the future. These standards prioritize legibility.

While material digitized according to these standards will often be in fairly robust condition, it may be vulnerable due to inherent vice in the medium (such as brittle paper or deterioration of a film substrate). Conservation assessment is recommended prior to digitization.

Digitization standards for preservation

Collections that contain any color (in images or text) should be photographed entirely in color, though this should be at the discretion of project stakeholders. Digitization for OCR will often benefit from grayscale capture with modest contrast enhancement. Master files for materials that were originally produced in grayscale or bitonal, such as microforms, should always be grayscale. For microforms, higher resolution imaging is required to account for the magnification of the original document relative to its representation on film. For example, an 8.5x11-inch document, captured on 35mm microfilm represents an approximately .12 rate of magnification; 3500 x .12 = 420 (PPI). Dedicated scanners are required for this type of imaging.

Master File Format: All master files must be uncompressed, Tagged Image File Format (TIFF) version 6, in either “little endian” (IBM PC) or “big endian” (Mac) byte order. In addition all files must pass JHOVE format validation.

Resolution: Image capture resolution is measured in pixels per inch (PPI). This should be a true optical resolution; the lens and pixel array in the capture device should be capable of creating an image file to the required resolution specification without interpolation.

Bit Depth:

Color (RGB):

  • Images are captured natively in 24-bit RGB RAW or TIFF format and Master Files are exported as 24-bit TIFF files with the “Adobe RGB (1998)” color profile embedded.

Grayscale:

  • Master Files are saved in 8-bit mode and should be embedded with the “Gray Gamma 2.2” profile.

Editing:

  • All images should be cropped to include the entire item/object, leaving a small background border around the material to show the entirety of a page or object. Black borders are preferred but there are exceptions, such as dark originals, outsourced projects or image file collections created by a third party.

  • While sophisticated image viewers can easily rotate an image, the master image file should be oriented properly. For bound materials with pages of varying orientation, default to the orientation of the binding.