Handbook for Digital Projects:
A Management Tool for Preservation and Access
Developing Best Practices:
Guidelines from Case StudiesIntroduction
This chapter contains six case studies that move the reader from the theoretical views of how digitization should be conducted to the actual practice of planning, executing, and evaluating projects. Some of the sections focus primarily on the experiences of one institution, while other sections are composites of what has been learned from various situations.
The case studies include descriptions of what has worked and has not worked. Wherever possible, authors have included tips for those who are beginning new scanning projects.
1. Working with Printed Text and Manuscripts
Stephen Chapman
Harvard University LibraryLook at the growing body of network-accessible books, journals, and archives from cultural institutions and commercial publishers and you will discover that electronic text is not all alike. Some collections are searchable, others are not; some have high-quality color reproductions, others limit their content to black-and-white (1-bit) images; some support go-to-page and go-to-section navigation, and many simply provide page-forward, page-back functionality. Rather than present a single case study of one type of electronic text, this section presents a composite case study of the challenges raised by several types of text conversion and the guidelines that have emerged in response to them.
Since costs among all of these versions vary widely, the first job for the budget-conscious manager is to select the least-expensive electronic publication model appropriate to the collection(s) she or he has selected for digitization. Generally speaking, electronic text falls into three categories:
Page Images These digital photocopies are created by scanning printed pages or microfilm. Page images are not searchable. They may be black-and-white, grayscale, or color. Assume that black-and-white (1-bit) page images are the least expensive products of text digitization, but be sure to account for the associated costs of the structural metadata that is needed in order to make the images suitable for browsing and on-line navigation.
Full Text In order for printed text to become fully searchable electronic text (full text), the letters on the original pages must be translated to machine-processible ASCII. There are two ways to do this: either by typing from the original (known as keying) or by using an optical character recognition (OCR) program to convert page images to ASCII. The first process is manual, the second automated. Since keying can easily be ten times more expensive than scanning-plus-OCR, page images are often made to facilitate the creation of full text. When these two products (full text and page images) are made, there is the added advantage of being able to present a facsimile version of the original page -- with fonts, formatting, marginalia, and illustrations intact -- in response to a search. In other words, the ASCII is used to create an index for searching, and only the page images are delivered to the screen or printer.
One might ask: If scanning and OCR are so much cheaper than keying, why consider keying at all? First, OCR is viable only for page images of machine-printed text. Handwritten originals must be keyed to become searchable. Second, OCR accuracy decreases as the complexity of originals increases (number of fonts, number of columns, illustrations accompanying text), and as the quality of the page images decreases. Therefore, if near 100% accuracy of searching is required, it might be less expensive to key than to undertake the three-step process of scan, OCR, and correct OCR errors. Several reliable studies report that a trained technician can correct 6-10 pages per hour. Depending upon salary, this task alone could easily exceed the cost of keying.
Encoded Text, or Full Text with Mark-up This third publishing model for text conversion is the most expensive, but also the most functional and flexible in the online environment. Like plain full text, encoded text production requires keying or OCR of page images to create ASCII. The final step is to encode attributes of a given document by placing Standard Generalized Markup Language (SGML) tags around selected text. There are hundreds of SGML elements that can be used for encoding. The Text Encoding Initiative (TEI) Guidelines refer to a subset that has been used widely for publications in the humanities. Texts usually are encoded at one or both of the following levels: (1) structural: referring to divisions such as chapters within books, articles within journals, poems within anthologies; or (2) descriptive: referring to elements such as dates, names of persons or places, and occupations. When a properly configured search interface/application is coupled with an SGML database, encoding makes fielded searching possible (e.g., find "slavery" in captions), and can also be used to control the presentation of the document -- including multiple representations if desired.
Note: It is not necessary to create page images in order to produce encoded text if (1) keying is an affordable approach to production, and (2) your goal is to present modern rather than facsimile pages to the screen or printer.
After deciding which electronic products satisfy the project requirements, the manager's second task should be to specify the outcome for the source materials after conversion. Since the printed originals are also products of text conversion, it is important to determine whether they should emerge exactly as they began or whether alterations are acceptable. It is significantly easier to create a project budget and plan of work if disposition decisions -- related to access policies, materials housing and location, and even deaccessioning -- are made at the outset.
Decisions about the appropriate outcomes for the source materials inform, if not determine, the handling guidelines for scanning. Materials that will be kept, particularly if they are to remain as circulating copies, may need to be assessed, cleaned, repaired, or rehoused at some point in the project. On the other hand, materials that will be moved to offsite storage or even discarded allow for a greater range of options in scanning techniques.
Questions about handling and disposition are particularly important for bound materials. Disbound pages, even when highly brittle, can either be scanned on flatbed scanners or can be automatically fed to sheetfeed scanners (with straight paper paths). In other words, it is much less expensive to scan pages than to scan books. As of 1999, production statistics gathered from a number of projects indicate that although technicians can scan up to five pages per minute, they typically average between two and three. Auto-feed scanners, on the other hand, can scan two sides of a page in a single pass. Using the same output settings (e.g., 600 dpi1-bit TIFF), these scanners produce 20 images per minute. Thus, assessments of source materials are critical because whenever manual feeding (or page turning) is required, scanning prices are tied directly to labor costs. In this model, improvements in scanning technology can only result in better quality, but not higher speed. When auto-feeding is allowed, technology improvements can result not only in higher quality but also higher speeds, and therefore lower unit costs.
Decisions regarding appropriate handling are complex, and any method must be tested and confirmed with a sample of materials before undertaking full production. No single best system has emerged for text scanning. Auto-feed, flatbed, overhead, or even digital camera systems are all viable. When selecting a scanner or writing specifications for a service bureau, handling requirements should be specified first, then image quality and speed. Scanning software plays an important role in these areas. For example, the same input settings -- e.g., 600 dpi 1-bit TIFF -- on different scanners will produce different results on output. Batch settings often distinguish high-price from low-price systems and are critical for high-volume applications.
Rules of Thumb
Although there are many variables associated with selecting the best methods to create page images and/or full text, there are fortunately some rules of thumb common to many text conversion projects.
- To minimize costs of creating and maintaining page images, 1-bit scanning with lossless compression should be used whenever possible; permitting the use of auto-feed scanners is the least expensive way to produce images of high enough quality for OCR, printing, and/or computer output microfilm (COM). Quality from all 1-bit scanners -- sheetfeed, flatbed, and overhead -- is the product of engineering (hardware, optics), software, and operator skill, so be sure to confirm that resolution requirements cited in one project work equally well for the materials and scanner you have selected in yours.
- When grayscale or color scanning is preferred for machine-printed text, use a scanner or digital camera with enough spatial resolution to capture the lines, edges, and other details of the source materials. Compare the costs and quality of line-array and area-array systems to determine which produces acceptable quality at the lowest cost. If OCR is required, fairly sophisticated image processing (following scanning) will be needed to generate 1-bit files from the grayscale or color scans.
- When conservation assessment and/or treatment is mandated for the source materials, conservators should participate in selecting the scanning equipment that will be used and in writing the handling guidelines for the project.
- Image quality and quality control requirements relate directly to the disposition of the source materials. Quality requirements will be higher for projects where reduced access to, or even replacement of, the originals is required. Costs, ironically, may be lower, since auto-feeding may be viewed as a more acceptable technique for these items than for unique materials in good condition.
- Costs of document preparation (excluding conservation treatment), metadata creation, and quality control are likely to exceed the cost of scanning, particularly for 1-bit imaging.
- Given the design of overhead scanners, as well as the limited depth of field in many digital cameras, bound volumes will be less expensive to scan if they can be opened fully (180 degrees). Text printed near or into the gutter margin is always difficult to capture -- as handling requirements increase, so will the costs.
- Oversize pages (particularly when the longest dimension is greater than 17") are always more expensive to scan. High-quality digital reproduction of text becomes more difficult with direct scanning; newspapers, for example, have routinely been microfilmed first in order to produce page images of adequate quality.
- Many image enhancement techniques, such as despeckling and deskewing, can be automated following scanning. Image processing is important not only to the appearance of page images, but also to their optimization for OCR.
- The structural metadata needed to organize page images may be created before, during, or after scanning. Given the idiosyncrasies of pagination and organization of many historic collections, one should expect these tasks to be manual, or semiautomated at best.
- Requirements for full text accuracy and depth of encoding result from a careful analysis of the source materials and consultation with the community(ies) interested in using the digital collections.
The following table summarizes the decisions that have the most important impact on quality and cost in text conversion projects. Many guidelines have been proposed from case studies, and these have been generalized for the table. As discussed in other chapters, however, good management begins by setting goals, not by blindly following guidelines. Relate your decisions to your publication objectives and preferred outcomes for the source materials, and the scanning guidelines and costs will naturally follow.
KEY QUALITY AND COST DECISIONS FOR DIGITIZED TEXT
Product
Examples of Key Decisions
Guidelines
Source Materials
Handling
* Contact with glass permitted
All scanners are viable* Bound volumes must be supported during scanning (opened less than180°)
Face-up scanning required, with appropriate cradle/book support
Disposition
* Maintain standard of access: return in original format to original location
Identify resources available for treatment. If staffing and funding are available, for example, to assess, disbind, and rebind materials, then compare costs of scanning pages versus scanning books before selecting best approach.* Reduce access by changing circulation policy or by relocating
To save cost, auto-feed if feasible, but budget for necessary preparation material and rehousing costs.
* Severely reduce or even eliminate access by creating digital images of replacement quality and/or by disposing source materials after scanning
Requirements for quality control and metadata must be explicitly defined (consider use of technical targets); disbinding might be most appropriate in these circumstances.
Preparation
* Facilitate highest quality scanning at the lowest costSegregate materials into batches whenever feasible (e.g., by size; or by content -- text, illustration, mixed, color)
Page Images
Specifications for master (archival) images
* Achieve tone reproduction appropriate to source materials and output requirements
When black-and-white (1-bit) fails to capture essential information, use scanners that sample 12-bits per pixel and output at least 8-bits per pixel for grayscale and 24-bits per pixel for color.* For machine-printed text, achieve detail reproduction needed to meet output requirements (screen, print, OCR for machine-printed text)
400-600 dpi commonly used; threshold and image processing capabilities also critical to image quality, especially for 1-bit images; post-scan enhancements can increase OCR accuracy
*For handwritten manuscripts and soft-edge type, such as photostats, achieve detail reproduction needed to meet output requirements (screen, print, zoom)
300 dpi minimum for 1-bit, 200-400 dpi minimum for grayscale and color
*Use open format
TIFF
*Use safe compression
Group 4 (lossless) compression for 1-bit, none for grayscale and color images
* Implement quality control program
Confirm that all files for object have been received, sequence is correct, metadata is complete and correct (100%); check image quality on screen, in print or both (sample)
Specifications for delivery images (derivatives)
* Print, computer-output microfilm (COM)
Master images (high-resolution TIFFs), PDF, or Postscript* On-screen images
Legibility generally achieved at 80-120 dpi; minimize file size by using fewer than the full 8-bits for GIF whenever possible (e.g., 4-bit); if 8 to 24-bits are required, consider JPEG
Specifications for navigation
* Page-forward, page back
Include sequence field in image database, or embed sequence in filenames* Go-to page
Include page number field in image database, or embed page number in filenames (the latter is generally a more expensive solution)
* Go-to section
Include feature or feature code field in image database, or mark- up full text (see below)
Full Text
Specification for accuracy (characters only)
* 100%
Get prices for keying first, then conduct sample OCR test of page images* less than 100%
Conduct sample OCR test of page images and review acceptability of output; avoid need to correct OCR-generated text at all costs
Marked-up Text
Specification for accuracy (characters and formatting)
* Fidelity to original required desirable
Keying/encoding may be the least expensive approach; test scanning/ OCR only if the original layout and fonts are relatively simpleSpecification for encoding
* Accommodate attributes of materials in hand while using practices endorsed by broader community
Consult TEI LITE and create DTD to accommodate structural divisions and descriptive features in the texts in hand; local interpretations of the general guidelines are possibleSources
Bicknese, Douglas A. Measuring the Accuracy of the OCR in the Making of America. Winter1998. http://moa.umdl.umich.edu/moaocr.html
Guthrie, Kevin M. "JSTOR: From Project to Independent Organization." D-Lib Magazine (July-August, 1997). http://www.dlib.org/dlib/july97/07guthrie.html
Morrison, Alan, Michael Popham and Karen Wikander. "Creating and Documenting Electronic Texts: A Guide to Good Practice." AHDS Guides to Good Practice. Arts and Humanities Data Service, 2000. http://ota.ahds.ac.uk/documents/creating/
Text Endocding Initiative (TEI). "The TEI Consortium Homepage." http://www.ctei-c.org/
University of Virginia Library Electronic Text Center. The Electronic Text Center: On-Line Helpsheets. http://etext.lib.virginia.edu/helpsheets/sgmlscan.html
2. Working with Photographs
Franziska Frey
Image Permanence InstituteWhy are Photographs Different?
There are several issues that set photographs apart from other documents for scanning.
Permanence
The materials that make up photographs are not chemically stable. These materials include silver or dyes as image-forming materials; paper, celluloid, or other plastics as base materials; and gelatin, albumen, or collodion as binders. Environmental influences such as light, chemical agents, heat, humidity, and storage conditions affect and destroy photographic materials. The only reliable method to preserve them for a long period of time is dark storage at low temperature and low humidity.
Faced with deterioration in the form of color dye fading, vinegar syndrome in acetate film, and degrading and flammable nitrate film, collection managers are debating whether it is better to invest in improved storage or in reformatting.
Complexity
Many digitization projects for photographs grew out of projects primarily dealing with text. This approach can lead to problems because images have to be treated quite differently when digitized.
The main goal when digitizing text documents is legibility. However, there are many different aspects of quality to be considered when digitizing images. In addition, finding aids for images are quite complex. Research is still underway to determine how best to facilitate effective searches.
Survey of Collection
Cataloging yet to
be done will restrict progress rather
than the lack of technology.Before a digitization project starts, the collection should be carefully surveyed. Not only the images but also the cataloguing system should be evaluated. In the long run, the inadequacy of the current image-description methods and the enormous amount of cataloging yet to be done with image collections will be the factors that restrict progress toward a digital future, rather than the lack of suitable imaging technology.
Types of Photographs
Photographs can be classified into two groups according to whether the image is viewed by reflected or transmitted light. The most important difference between the two is in dynamic range -- the difference between the lightest and the darkest areas of the image. Reflection prints of any type usually have a smaller dynamic range than negatives. Color transparencies have the largest dynamic range.
Negatives Negative collections especially profit from digitization since this makes them easily accessible. Millions of negatives are never used only because their image content is not readily available to the user. A printing process is needed to get a positive image. Therefore, not only the public but also often even the collection managers themselves don't know what a negative collection contains. It has already been proven that as soon as negatives are scanned and a positive image can be viewed, almost instantaneously their use has grown enormously. A huge number of older negatives are glass plate negatives. Choosing to digitize them reduces the risk of loss through breakage because they only have to be handled once.
Color Type Another way to classify images is by color type. Depending on the color type, images will be scanned in black-and-white or color.
Full Color-- Most of today's photographs are taken in full color. However, this trend only dates back to the mid-1960's. This means that the majority of collections will not include too many color photographs, a fact that will change when more color photographs come into archives and libraries.
Monochromatic Color -- A large number of photographs to be scanned will be monochromatic color (Reilly). Many 19th-century photographic print processes have characteristic colors, e.g., the purplish-brown colors of albumen prints and the blue color of cyanotypes. Such colors help scholarly interpretation by conveying information about process and providing clues to the degree of deterioration the photographs may have suffered. Keeping this color information in the digital file is important since it is an inherent characteristic of the picture.
Black and White -- Black and white photographs taken in monochrome are either neutral black in color or have no significant visual information conveyed by the color of the images. Primarily, these are negatives or modern silver-gelatin developed-out prints made in the 20th century.
Electronic PhotographyMore and more, collections include images that never had a film original. Caring for electronic originals requires collection managers to pay attention to new specialties such as file formats, intellectual property law, high-speed data transfer technology, and database management.
Formats
Image collections often will have a variety of formats, although certain formats (e.g., negatives, prints) can predominate. This variety requires the use of versatile scanning equipment.
Condition of Collection
A collection survey prior to scanning will help with decisions about what should be selected. It also can lead to a plan to control conditions of the original collection in the future, for example by providing better storage facilities or enclosure materials. Preparing a collection for scanning often includes an improvement of the physical conditions of the collection.
Size of Collection
The size of the collection also influences the scanning method and parameters. If the collection is very small, you can choose a time-consuming scanning method. A good example is the National Gallery in London, which scanned every painting using a special multispectral camera. Since the collection consists of only 3,000 paintings, it was possible to scan everything several times. This is not a possible solution for a collection that consists of thousands of images that will most likely not be rescanned within the near future. In addition, with larger collections the workflow has to be planned carefully.
Goals of Digitization
As the digitization of large collections is not likely to be attempted more than once a generation due to cost, educated decisions about the scanning and archiving processes are imperative. The term archival implies that all digitized images are not only optimized for current work flows and imaging devices but will continue to be usable on future, as yet unknown delivery and output systems (Frey & Süsstrunk, 1996; Frey, 1997; Frey & Süsstrunk, 1997).
One of the big issues that institutions should consider prior to implementing a project is the anticipated use of their digital image collections. There is a consensus within the preservation community that a number of image files must be created from every photograph to meet a range of uses. First, an archive or master image should be created. The archival master file should represent the highest quality the institution can afford. It should not be treated for any specific output and should be left uncompressed or compressed in a lossless manner. It will also require an intensive quality review. From this archive file, various derivatives will be calculated. These derivative files are meant to be used. Speed of access and transmission and suitability for certain purposes are the main issues to consider in the creation of these derivative files.
Scan from Duplicate or Original?
A decision has to be made whether to scan from the original or a duplicate. There are advantages and disadvantages to each approach. Because every generation of photographic copying involves some quality loss, using intermediates immediately implies some decrease in quality. Intermediates may also serve some other purposes, however; for example, they might serve as masters for photographic references copies or as preservation surrogates.
This leads to the question of whether the negative or the print should be used for digitization, assuming both are available. Quality will always be best if the first generation of an image (i.e., the negative), is used. However, there may be big differences between the negative and the print, mainly in the domain of fine-art photography. The artist often spends a lot of time in the darkroom creating the print. The results of all this work are lost if the negative, rather than the print, is scanned. The outcome of the digitization will be disappointing.
Quality Control
Subjective Visual Inspection
The best approach to digital image quality control includes, on the one hand, subjective visual inspection and, on the other hand, objective measurements performed in software and on the digital files themselves. Efforts should be made to standardize the procedures and equipment for subjective evaluation.
In most cases the first evaluation of a scanned image will be made by viewing it on a monitor. The viewer will decide whether the image on the monitor fulfills the goals that have been stated at the beginning of the scanning project. This is important because human judgment decides the final acceptability of an image. Looking at images and judging their quality has always been a complex task. The viewer has to know what he/she is looking for. It should be emphasized that subjective quality control must be executed on calibrated equipment in an appropriate, standardized viewing environment
As the image is viewed on the monitor, defects such as dirt, half images, skew, laterally reversed images, and visual sharpness can be detected. In some cases it might be necessary to go back and redo the scanning.
Evaluating Digital Image Files
On the other hand, objective image quality parameters must be employed. You can accomplish this by scanning special targets and evaluating them in specialized software (Gann, 1999; Holm, "Survey," 1996).
The targets and software to evaluate them are not just for vendor checking -- they serve to guarantee the long-term usefulness of the digital files and to protect the investments of the institutions.
Image Quality Framework
When looking at image quality, the whole image processing chain has to be examined (Holm, "Factors," 1996). Besides the scanning system, you also need to look at compression, file formats, image processing for various usage, and system calibration. Image quality is affected by the sequence of applying different image processing steps, including compression. Image processing done before storing the images can affect the quality of future processing. For example, it is recommended not to sharpen the archival master file before storing.
Each of the main image quality parameters needs special targets for the different forms of images (e.g., prints, transparencies). The targets should consist of the same material as the materials that will be scanned -- photographic film and paper.
These targets are a vital part of the image quality framework. After targets are scanned they are evaluated with a software program. Some software components exist as plug-ins to full-featured image browsers, others as stand-alone programs. However, it has to be clearly stated that some of the targets and the software to evaluate them are not yet commercially available.
Targets can be incorporated into the workflow in various ways. Full versions of the targets might be scanned every few hundred images and then linked to specific batches of production files, or smaller versions of the targets might be included with every image. As more institutions initiate digitization projects, having an objective tool to compare different scanning devices will be more and more important.
Tone Reproduction
Tone reproduction is the single most important parameter for determining the quality of an image. If the tone reproduction on an image is right, users will generally find the image acceptable, even is some of the other parameters are not optimal. Capture and display must occur for the concept of tone reproduction to exist. This means that an assumption must be made regarding the final viewing device. Three mutually dependent attributes affect tone reproduction: the opto-electronic conversion function (OECF), dynamic range, and flare. The OECF shows the relationship between the optical densities of an original and the corresponding digital values of the file. Dynamic range refers to the capacity of the scanner to capture extreme density variations. The dynamic range of the scanner should meet or exceed the dynamic range of the original. Flare is generated by stray light in an optical system. Flare reduces the dynamic range of a scanner.
Color Reproduction
Several color reproduction intents can apply to a digital image. Perceptual intent, relative colorimetric intent, and absolute colorimetric intent are the terms often associated with the International Color Consortium (ICC). Perceptual intent is to create a pleasing image on a given medium under given viewing conditions. Relative colorimetric intent is to match, as closely as possible, the colors of the reproduction to the colors of the original, taking into account output media and viewing conditions. Absolute colorimetric intent is to reproduce colors as exactly as possible, independent of output media and viewing conditions.
Most of the available solutions for measuring and controlling color reproduction are geared towards the pre-press industry. However, when an image is scanned for archival purposes, the future use of the image is not yet known. Operator judgments regarding color and contrast cannot be reversed in a 24-bit RGB color system. Any output mapping different from the archived image's color space and gamma must be considered. Nevertheless, saving raw scanner data can create problems if the scanner characteristics are not well known and profiled.
One of the decisions is which color space to use. A color space is a geometric representation of colors in space, usually of three dimensions. The reason for the three dimensions is the human visual system that has three independent receptors and is therefore a three dimensional system. The most important attribute of a color space in an archival environment is that it be well defined. Furthermore, keep in mind that there is more than one solution to the problem. The right color space depends on the purpose and the use of the digital images (Süsstrunk, Buckley & Sven, 1999).
Resolution
A review of past digital projects has shown that people are most concerned about spatial resolution. This is not surprising, because of all the weak links in digital capture, spatial resolution has been the best understood by most people. Technology has evolved, however, and today reasonable spatial resolution is neither extremely expensive nor does it cost a lot to store large data files. Spatial resolution is the parameter to define detail and edge reproduction in an image. Details can be, for example, single hairs in a portrait. A good edge reproduction is critical for the visual sharpness of an image. Spatial resolution of a digital image, i.e., the number of details an image contains, is usually defined by the number of pixels per inch (ppi). The higher the number of pixels per inch, the more fine details can be transferred from the original image to the digital file.
To find the equivalent number of pixels that describe the information content of a specific photographic emulsion is not a straightforward process. Format of the original, film grain, film resolution, exposure, and processing techniques have to be taken into consideration to accurately determine the actual information content of a specific picture.
The best measure of detail and resolution is the Modulation Transfer Function (MTF). The MTF is a graphical representation of image quality that eliminates the need for decision making by the observer.
Noise
Noise refers to random signal variations associated with detection and reproduction systems. In conventional photography, noise in an image is the graininess that can be perceived. It can be seen most easily in uniform density areas. Noise is an important attribute of electronic imaging systems. Standardization will assist users and manufacturers in determining the quality of images being produced by these systems. Test results for noise are twofold. First the noise level of the system can be determined, indicating how many bit levels of the image data are actually useful. Second, for image quality considerations, the signal-to-noise (S/N) ratio is the important factor to know. S/N evaluations show the effect of random noise on scan quality. Random noise, rather than bit-depth, is the primary limiting factor of the tonal resolution of the scanner. The test can consist in scanning a grayscale target twice. The two scans are subtracted and the standard deviation of the result is examined. The subtraction should remove all non-noise components (this is the image information) and the standard deviation is a good measure of random noise (Gann, 1999).
Costs
Budget Items
There are a variety of costs to consider (Puglia, 1999):
- Selection
- Preparation
- Cataloging/Description/Indexing
- Preservation/Conservation
- Production of Intermediates
- Digitization
- Quality Control of Images and Metadata
- Network Infrastructure
- Ongoing Costs of Maintenance of Images and Data
Initial Costs
Digital conversion accounts for approximately one-third of the initial costs. Other costs, primarily those connected to cataloging, administration and quality control, account for the remaining two-thirds.
Ongoing Costs
Often,projectplanningandbudgetingstopsafterthecreationofthedigitalassets.However,an important part of the budget involves the costs for refreshing and migration and for the support of systems. This all can be put together under the umbrella of digital asset management. It is difficult at this point in time to come up with exact numbers for this process. However, since both the archival community and the graphic industry are taking this approach, more and more real numbers will be available soon. Currently, it is estimated that 5% to 10% of the initial costs per image must be budgeted on a yearly basis to maintain the images into the future, even though migration and file conversion are not done on a yearly basis.
In-house vs. Outsourcing
Many pilot projects with image collections have been used to build up an in-house scanning facility. Although this is feasible for a small project, in a many cases it will be better and necessary to establish a good relationship with a vendor and outsource the whole imaging process. Even this approach requires a good knowledge of the imaging process, because all the parameters for imaging and building the system will have to be established by the institutions themselves. As the chapter on vendor relations emphasizes, it is very important to establish a good relationship with the vendor.
Conclusions
Many of the problems arising from the need to scan for an unknown future use are not yet solved, and there is a great deal of uncertainty about how to proceed. Those responsible for some of the large digital reformatting projects report the same problem: Rapid changes in technology make it difficult to choose the best time to set up a reformatting policy that will not be outdated tomorrow.
If institutions fail to communicate their needs they won't obtain tools for special applications. The lack of communication between the technical field and institutions remains a formidable obstacle. Both institutions and industry are interested in a dialogue, but there is no common language. It cannot be emphasized enough that if institutions fail to communicate their needs to the hardware and software industries, they will not obtain the tools they need for their special applications. Archives and libraries should know that they are involved in creating the new standards. Today, it seems that whoever is first on the market with a new product is creating a de facto standard for competitors. Furthermore, time to create new standards is very short; industry will not wait years to introduce a product simply because people cannot agree on a certain issue.
A digital project cannot be looked at as a linear process in which one task follows another. Rather, it must be viewed as a complex structure of interrelated tasks in which each decision has an influence on another one. The first step in penetrating this complex structure is to thoroughly understand each single step and find metrics to quantify it. Once this is done, the separate entities can be put together in context. We are still in the first round of this process, but with the benefit of experience gathered from various digital projects, we are reaching the point where we can look at the complex system as a whole.
Sources
Arms, Caroline, ed. Enabling Access in Digital Libraries. Washington, DC: Council on Library and Information Resources, 1999.
Ester, Michael. Digital Image Collections: Issues and Practice. Washington, DC: Commission on Preservation and Access, 1996.
Frey, Franziska. "Digital Imaging for Photographic Collections: Foundations for Technical Standards," RLG DigiNews (December 1997). [Online] http://www.rlg.org/preserv/diginews/diginews3.html#com
Frey, Franziska and James Reilly. Digital Imaging for Photographic Collections: Foundations for Technical Standards (November 1999). [Online] www.rit.edu/~661www1/sub_pages/frameset2.html
Frey, Franziska and Sabine Süsstrunk. "Color Issues to Consider in Pictorial Image Data Bases." Proceedings IS&T's Fifth Color Imaging Conference, pp. 112-15. Scottsdale, AZ, November 17-20, 1997.
------. "Image Quality Issues for the Digitization of Photographic Collections." Proceedings IS&T's 49th Annual Conference, pp. 349-53. Minneapolis, MN, May 19-24, 1996.
Gann, Robert G. Desktop Scanners. Upper Saddle River, NJ: Prentice Hall, 1999.
Holm, Jack. "Factors to Consider in Pictorial Digital Image Processing," Proceedings IS&T's 49th Annual Conference, pp. 298-304. Minneapolis, MN, May 19-24, 1996.
------. "Survey of Developing Electronic Photography Standards." Standards for Electronic Imaging Technologies, Devices, and Systems, SPIE, Critical Reviews of Optical Science and Technology Series 61 (1996): 120-52.
Puglia, Steve. "The Costs of Digital Imaging," RLG DigiNews (October 1999). [Online] http://www.rlg.org/preserv/diginews
Reilly, James. Care and Identification of 19th-Century Photographic Prints. Rochester, NY: Eastman Kodak Publication, 1986.
Stephenson, Christie and Patricia McClung, eds. Delivering Digital Images--Cultural Heritage Resources for Education.Los Angeles, CA: The Getty Information Institute, 1998.
Susstrunk, Sabine Robert Buckley and Steve Sven. "Standard RGB Color Spaces," Proceedings IS&T's Seventh Color Imaging Conference (November 1999), Scottsdale, AZ.
3. An OCR Case Study
Eileen Gifford Fenton
JSTOR, University of MichiganWhat is OCR?
Optical character recognition, or OCR, is the process that converts the text of a printed page to a digital file. This is accomplished by using an OCR software package to process a digital image of the printed page. The software first analyzes the layout of text on the page and divides the text into zones that usually correspond approximately to paragraphs. Next, the order of the paragraphs is determined and then the analysis of the characters begins. Most OCR applications work by looking at character groups, i.e., words, and comparing these to a dictionary included with the application. When a match is found, the software prints the appropriate word to the text file; when a match cannot be made confidently, the software makes a reasonable assumption and flags the word as a low confidence output. Where a word or character cannot be read at all, the default character for illegible text is inserted as a placeholder.
Accuracy of OCR packages varies widely. OCR is an effective means to read modern typeface captured in high quality page images. Though OCR software has improved significantly over the last decade, OCR does not yet deal effectively with non-Arabic characters or nonmodern type and frequently struggles to translate small print, certain fonts, and complex page layouts. The accuracy of OCR packages varies widely among applications and across different source materials.
JSTOR and OCR
JSTOR, an independent not-for-profit organization headquartered in New York, NY, has the large-scale undertaking to convert and maintain digital versions of the backfiles of journals and to develop access tools that allow searching of both full text and indexed components within each issue. To date, JSTOR has converted over 4 million pages from over 100 journal titles. Over 500 academic libraries in North America and abroad have signed on as institutional participants.
JSTOR began digitizing journal back runs in the fall of 1994 with only minimal staff devoted to production activities. Since those early days both productivity levels and staffing have increased. Currently, JSTOR prepares approximately 200,000-250,000 new pages for digitization each month. The JSTOR production staff has grown to a group of 20 distributed between operations at the University of Michigan and Princeton University. Several other units at JSTOR including Library Relations, Publisher Relations, User Services, Technology Support and Development, and an administrative group complement the work of the production group.
Each journal page digitized by JSTOR is processed by an OCR application, and the resulting text files are used to support the full text searching offered to JSTOR users. In order to ensure that search results are as accurate as possible, each OCR text file is manually reviewed and corrected to a targeted accuracy level prior to being added to the database. Eliminating this manual review could reduce production costs. However, it has proven to be an essential step for assuring both the overall quality of the database and the accuracy of scholars' full-text searches.
Key Points When Considering OCR
Digital projects vary widely in content, aim, and scale, and OCR may not be the right solution for all. When considering OCR, it is useful to weigh the following.
1) Select technology that will enhance your ability to meet the objectives of the project.
Manual review has proven essential. If the project goal includes delivering converted text files to the user, you will want to think very carefully about using OCR. No OCR product is perfect. Text errors will be present in files displayed to users. As a result, you will want to thoughtfully determine the OCR accuracy level required to meet particular goals. If you are using the text files only to support searching, and they will not be displayed to the user, you may be able to tolerate lower accuracy. Decisions about accuracy should take into account the characteristics of the source material. Non-English text, mathematical or chemical symbols, and other special characters are not successfully translated by OCR applications, and their presence should be factored into your decision.
2) Scale matters -- a lot.
The appropriate approach for generating text files is affected dramatically as you move from a 20,000-page project to a 200,000-page project to a 2,000,000-page project, even if the goals of the projects are the same. Similarly, the costs generated by text file production also change dramatically with scale.
3) There is no right answer.
Solutions will be driven by the goal of the project. However, it is difficult to generalize from one project to another even when project goals may be similar. Very specific characteristics such as the nature and quality of the source materials, the available budget, and the time allotted for the project will significantly impact decisions.
4) Costs will be higher by more than you expect.
Even the most carefully planned projects including OCR will experience surprises. Initially selected software may not perform on actual data as it did on test data. You may find processing limitations in the full production phase that were hidden during the pilot phase. Expanding an application's dictionary to include specialized terms may prove to be more difficult than originally anticipated. Any number of unexpected developments may impact production timeframes and therefore budgets. It is helpful if an allowance for these unexpected developments can be built in from the beginning of the project.
5) The answer that is right for today may not be right in the future.
OCR software capabilities have developed significantly over the last five to ten years and improvements continue to be made. The dynamic nature of this technology means that projects of more than just a few months' duration may benefit by continuing to evaluate new products as they become available to determine if greater cost-benefit possibilities have developed.
Sources
Rice, Stephen V., George Nagy, and Thomas A. Nartker. Optical Character Recognition: An Illustrated Guide to the Frontier. Kluwer Academic Press: part of Kluwer International Series in Engineering and Computer Science Secs 501. 1999.
Until 1997, the Information Science Research Institute at the University of Nevada, Las Vegas, conducted an annual assessments of selected OCR products. Information on their Technology Assessment Program is available at http://www.isri.unlv.edu/info/technology/
Website: www.jstor.org Readers will find information about JSTOR's mission and history, a description of the contents of the database, and information on institutions participating in JSTOR's work. Also available is a description of our production process, technical information of general interest, and a link to a demonstration of the database.
4. Digitization of Maps and Other Oversize Documents
Janet Gertz
Columbia University Libraries
Paper maps (and other oversize documents such as architectural drawings) contain a wealth of fine details composed of graphic and textual elements. They include:
- The drawing of the location
- The use of graphic entities like elevation lines or symbols for cities of different sizes
- Printed names of countries and other features
- Color, which carries information through varied patterns and intensities.
When they are large, maps with fine details present special difficulties for digitization. There can be a huge disparity between the size of the document and the size of the smallest meaningful element that must be made visible online or in printouts. Fine detail requires high resolution scanning, and the result is very large file size. File manipulation, storage, delivery, and display all become much more complicated.
Even the mechanics of scanning are affected. Many flatbed scanners have size limitations and cannot handle large maps. Scanning may require film intermediaries such as 4x5 transparencies or single-frame microfiche, where the original object fills the body of the microfiche. Thirty-five mm slides are too small to fully capture details on large maps. When originals are not only oversized but also brittle, working from a film intermediary will put less strain on the fragile original. Some loss of quality will result because the film version is one generation removed from the original. However, fully legible images can be produced from film intermediaries, always, of course, given that the transparency or microfiche is itself carefully made and then scanned with sufficient resolution and appropriate tonality.
Scanning Parameters
Determining capture parameters follows the same rules as for other documents.
- Decide on the appropriate tonality, usually gray-scale or color. Color on most printed maps is important as a coding device, not for its precise hue as it is in art works. Nevertheless, a standard color bar should be included during scanning even when sophisticated color management is not a requirement. (For a discussion, see Ester, 1996.)
- Identify the smallest meaningful element, often a thin line.
- Determine how many pixels are needed to capture the smallest element legibly.
- Calculate the necessary resolution, often 200-300 dots per inch when scanning in 24-bit color. For a discussion of resolution and related issues, see Gertz et al. (1996) and Allen (1998).
- Whether the map is digitized directly or through a film intermediary, always include a ruler in the image so that dimensions and distances are unambiguous.
As an example, consider a hypothetical map two feet across and three feet long.
- The smallest textual elements are numbers less than 1 mm high that record elevations.
- The smallest meaningful elements are the thin lines used to indicate elevation.
- Ten different colors serve as codes, patterned as dots, parallel vertical and horizontal lines, and other graphic devices.
A scanned version of acceptable quality would permit users working on screen and with printouts to:
- Read the 1 mm text
- See unbroken elevation lines
- Clearly distinguish all color code patterns.
Assume a minimum of 200 dpi and 24-bit color is needed to achieve legibility of the 1-mm text and the lines and code patterns. For a map 36" wide, 200 dpi multiplies out to 7,200 dots across the surface of the map. If a film intermediary is used, then the effective resolution must be calculated as well. Effective resolution refers to resolution relative to the size of the original document. A transparency still requires 7,200 dots across the map to capture the same degree of detail. The map on the transparency is perhaps only 4" wide. It must be scanned at 1,800 dpi to get the same level of detail.
To calculate the file size, use the formula given in Kenney and Chapman (1996), p. 20.
formula: ( height x width x bit depth x dpi2 ) / 8 original map: ( 24" x 36" x 24 x 2002 ) / 8 = 103,680,000 transparency: ( 2.667" x 4" x 24 x 18002 ) / 8 = 103,680,000 The product of digitizing oversize documents is clearly a series of very large files. This has implications for the image creator in terms of storage, retrieval, and display. Large files take up a great deal of storage space. Enough memory must be available for images to be loaded and manipulated. Backing up files, creating derivatives, and transmitting files absorb a significant amount of time and storage media.
Problems of Access to Scanned Large Maps
The nature of these files also translates into problems for users trying to access and navigate within digital images of large maps.
- The high-resolution image in which all of the details are visible is too big for users to access or manipulate easily, given current delivery mechanisms and the capacity of common computers.
- When derivatives of the original high-resolution files are provided for access, they are often JPEG versions with considerably reduced resolution. If the resolution is low enough to make files easy to access, the finer details in the images may become illegible.
- Only part of the map image fits on screen at one time. When using the paper document, readers orient themselves to salient features through peripheral vision while focusing closely on details. On screen, it is easy to become disoriented because most of the image is not visible.
- With a paper map, it takes a single glance to follow features such as roads or boundaries from one edge to the other, but on screen it takes continued scrolling. Comparing widely separated details becomes awkward at best if they are not visible simultaneously.
Benefits of Scanning Large Maps
Despite these difficulties, there are a variety of ways to benefit from scanning large maps.
- Use the high-resolution images to produce high quality printouts to replace brittle originals.
- Derive lower resolution versions from the high-resolution master images to serve as reference-quality images and reduce unnecessary handling of brittle originals.
- Put the images on CDs and view them directly rather than trying to deliver them over a network. The workstation must be capable of handling the large files.
- Scan large maps in sections to generate a group of high-resolution files of manageable size. This entails use of software packages for managing the separate files and concatenating them as the user moves from one to the next. It also can complicate the creation of high-resolution printouts.
- Investigate some of the new compression software that permits the user to access a lower resolution image and then zoom into higher resolutions without manipulating the whole high-resolution file locally. One such product is Lizardtech's Multi-Resolution Seamless Image Database (see http://www.lizardtech.com); a number of other packages are available.
In Conclusion
- The size of the original, in proportion to the size of the smallest meaningful element, determines the needed resolution.
- File size governs the ability to store, retrieve, and display an image.
- Excellent images will fail to satisfy users if they cannot be accessed or if equipment and software are not well suited to working with large images. Speed and smoothness of scrolling and zooming are important.
- Planning the user interface must be part of initial project design.
To view a selection of approaches to scanned maps, see:
- American Memory project: http://memory.loc.gov/ammem/gmdhtml/gmddigit.html
- British Columbia Archives and Records Service: http://www.bcars.gs.gov.bc.ca/cartogr/general/maps/html
- Library of Virginia: http://image.vtls.com/BPW
- University of Connecticut: http://magic.lib.uconn.edu/magic/exhibits/
- Atlantic Neptune: http://mercator.cogs.nscc.ns.ca/neptune.html
- National Oceanic and Atmospheric Administration: http://chartmaker.ncd.noaa.gov/ocs/text/MAP-COLL.htm/
- David Rumsey Associates: http://www.davidramsey.com
Sources
Allen, David. "Creating and Distributing High Resolution Cartographic Images," RLG DigiNews 2:4, 1998. http://lyra.rlg.org/preserv/diginews/diginews2-4.html#feature
Ester, Michael. Digital Image Collections: Issues and Practice. Washington, DC: Commission on Preservation and Access, 1996.
Gertz, Janet, Robert Cartolano, and Susan Klimley. Oversize Color Images Project, Columbia University, 1996. http://www.columbia.edu/dlc/nysmb/
Kenney, Anne, and Stephen Chapman. Digital Imaging for Libraries and Archives. Ithaca, NY: Department of Preservation and Conservation, Cornell University Library, 1996.
5. Working with Microfilm
Paul Conway
Yale University Library
Preservation microfilm can be an excellent source-medium for digital conversion projects if certain caveats are taken into consideration. This section describes what librarians and archivists need to know about working with existing microfilm to produce high-quality digital images that can be displayed as images and/or processed with OCR conversion software.Background -- Project Open Book
Microfilm has been used as a medium for preservation and access since the 1930s. By the middle of the 1980s, international standards fully defined the archival qualities of preservation microfilm (Fox, 1996). The Research Libraries Group, working in close association with the American Library Association, established procedures for creating film that meets or exceeds archival standards (Elkington, 1992). By the end of 1999, the National Endowment for the Humanities had provided partial support for the preservation of over 800,000 brittle volumes on microfilm. The nation's collection of preservation microfilm is the first and one of the largest virtual libraries in the world (Conway, Selecting,1996).
In the early 1990s, Don Willis, one of the most prominent experts on the creation of preservation microfilm, proposed that it was technologically and economically feasible to create high-quality digital images from microfilm (Willis, 1992). At the time he wrote, few people outside the commercial sector -- and no U.S. archivists or librarians -- were in a position to test the hypotheses that Willis proposed. The conversion of microfilm was largely confined to corporations that needed to convert legacy files from microfilm (typically, case files and standard office documents) on a highly selective basis. What was needed was a systematic exploration of the issues associated with tapping the content of hundreds of thousands of brittle books, newspapers, and serials preserved on 35 millimeter microfilm now housed in research libraries and archives around the country. If it proved feasible to obtain high quality images at a reasonable cost from the nation's corpus of preservation film, then this material could be added to what was then expected to be a national digital image resource.
Yale University Library, with the assistance of the Commission on Preservation and Access and the National Endowment for the Humanities, accepted the job of developing a sequence of projects, collectively termed Project Open Book, to test Don Willis's hypotheses (Waters, 1991). Yale designed and implemented Project Open Book in close association with Cornell University, which at the time was also deeply engaged in digital imaging R&D, using books as the principal conversion source. Yale adopted Cornell's recommendations for base line image quality and then went on to develop a complex cost study to test the underlying economic assumptions of the imaging process. Project Open Book defined the relationship between quality and cost. The project established rules of thumb for maximizing quality and baseline cost estimates for the microfilm conversion process (Conway, "Yale," 1996).
Since the Yale project has been completed, additional projects have contributed to the general microfilm-scanning knowledge base. Additionally, several service bureaus have begun offering conversion services to libraries and archives. These commercial organizations are able to meet or exceed quality expectations at a cost-per-image that is not as low as the benchmarks identified by Yale, yet still fairly cost effective. In 1999, the principal investigators of the Cornell and Yale projects pooled their knowledge of the hybrid approach and developed a set of recommendations for converting brittle books from either film or the original item (Chapman, Conway & Kenney, 1999). Together, these developments make it possible to recommend best practices for certain kinds of materials on film and to identify when microfilm is not the best source.
Image Quality Considerations
Image quality is the first concern. High contrast 35-mm microfilm, produced according to ANSI/AIIM specifications to a Quality Index level of at least 5 (on a scale of 1 to 8) has the equivalent digital resolution of at least 800 dots per inch (dpi). It is not yet possible (nor may it be necessary) to achieve this level of scanning across the full width of the 35 mm microfilm frame. High resolution scanning from microfilm varies from 300 to 600 dpi. Bit depth ranges from bitonal (1 bit per pixel) to full gray (8 bits per pixel). Scanners for color roll film (a relative rarity in libraries) are not available commercially, although such technology is an important part of the movie industry (Kenney & Chapman, 1996).
Because of the high risk of damage, master microfilm negatives (1N) should never be used as a scanning source. Research at Yale and in Germany has shown that the same level of image quality can be obtained from a duplicate negative (2N) without placing the master negative in jeopardy (Weber, 1997). If only a positive use copy (3P) is available, it is possible to obtain a readable digital image, although some detectable drop-off in image quality should be expected.
Characteristics of the original source document and characteristics of the microfilm medium strongly influence the quality of the individual images and the total image product. Here are some highlights.
Characteristics of the Original Source
(e.g., book, document, print, map)
- High contrast between text (ink) and surface (paper) yields best results
- Discolored, damaged, uneven edges of paper complicate scanner setup
- Bleed-through of text from verso limits threshold options
- Foxing, mold, stains, and fire and water damage may be accentuated by scanning
- Tight gutters in bound volumes distort film and digital imagery unless corrected
- Fold-outs and oversize inserts may not be represented in digital form as accurately as baseline document (in-line modifications to scanner setting are require
Characteristics of the Microfilm
Image Quality
- Polarity: negative microfilm yields higher quality images than positive film
- Density: medium contrast (dMax ca. .90) to high contrast (dMax ca. 1.30) film results in higher quality images than low contrast (dMax ca. .80) negatives. RLG dMin guideline (< .25) holds.
- Reduction ratio: lower is better; accurate recording of ratio is crucial for reproduction at original size
- Skew: minimize or eliminate -- no greater than 5 degrees from parallel
Product Quality
- Consistent placement: minimize or eliminate centerline weaving
- Duplicate images: duplicate images bracketing illustrations have minimal impact
- Splices: eliminate splices inside a given volume on the reel
- Dimensions of original: record accurately on bibliographic target
- Blank frames: eliminate or reduce quantity wherever possible
- Orientation: A2 position provides most consistent product with some scanners; one full frame per image is generally preferable.
- Test charts: incorporate RIT Alphanumeric Test Chart into scanner setup routine
Bottom Line on the Quality of Bitonal Scanning
- Nature, quality, and value of complex illustrations determine the appropriateness of bitonal scanning; if illustrations are vital and complex, then bitonal scanning may not be appropriate.
- Crispness of text (printed or hand-written) is essential for legibility of the digital image.
- No appreciable improvements occur in image quality with continuous tone film scanned in bitonal mode
Conversion Cost Issues
Imaging costs are driven by scanner pricing structures, labor costs, and the overall speed of the conversion system. The throughput speed of a given scanner is a product of at least three factors:
It is somewhat difficult to compare scanner speeds by studying manufacturer specifications.
- Image resolution (the lower the resolution the faster the output)
- Electrical engineering (fast refresh rate of the CCD array and fast data transfer rate equals fast output)
- Mechanical engineering (more rigorous film transport mechanisms provide for quicker throughput).
In its complex study, Project Open Book examined the cost of the imaging process in terms of equipment and human resources (Conway, D-Lib Magazine, 1996). The cost model factored in the actual costs of hardware, software, integration support, and optical storage media and then converted these costs to a range of per-book and per-image costs. Most importantly, the Yale study assessed costs for each of the processing steps of the conversion process.
The Yale study identified a number of factors that contribute to variation in costs, including the following:
- The impact of original source and microfilm characteristics varies among process steps.
- Most time-consuming conversion steps (scanning in continuous mode, indexing, scanner setup, and file transfer) are not greatly influenced by original source or microfilm factors.
- Original source characteristics influence costs more than microfilm characteristics.
- Original source and microfilm characteristics, combined, have dramatic impact on quality but only marginal impact on costs.
- Pre-scan inspection of microfilm (a relatively inexpensive processing step) is an important mechanism for predicting quality control challenges, but is not sufficient for identifying significant scanning and indexing complexities that arise during the conversion process.
Characteristics of the Original Source
(e.g., book, document, print, map)
- Characteristics of the original source that have a large impact on quality (e.g., faded text, bleed through) have little impact on the cost of digital conversion.
- The number of pages in the chunk of material being scanned has a significant financial impact on all conversion processes.
- Books without tables of contents or page numbers pose significant indexing challenges (and thus higher costs), but also complicate prescan inspection and all aspects of quality control.
- The presence of illustrations is only one of many factors that combine to explain variation in the cost of the most time-consuming processing steps.
- The costs of quality control processes carried out during scanning, indexing, and final acceptance are strongly influenced by original source characteristics (e.g., tight gutter margins, cropped text, illustrations).
- Preparation of a bound volume prior to microfilming (e.g., disbinding, careful cropping) can significantly reduce the cost of setting scanner parameters.
Characteristics of the Microfilm
- Reduction ratio is the single most important microfilm characteristic influencing costs. The smaller the ratio the lower the conversion cost.
- Skewed microfilm images, an all-too-common factor, increase the cost of scanning, quality control, and inspection.
- Splices inside a given volume influence the cost of several important steps, but occur too infrequently to matter much.
- The cost-per-item of scanner set up is not influenced by any characteristics of microfilm.
- Density variation has no impact on the cost of conversion.
- Investment in better quality microfilm has only marginal cost-reduction benefits.
Service Bureaus
Vendors can do the hard work. It is not necessary to purchase microfilm scanning hardware and software for in-house use in order to accomplish the conversion of microfilm. A number of companies in the United States offer conversion services, including:
- Preservation Resources of Bethlehem, PA <http://www.oclc.org/oclc/presres/index.htm>
- Northern Micrographics of La Crosse, WI <http://normicro.com>, and
- microMedia Imaging Systems, Inc. of Lake Success, NY.
- sources for information on service bureaus are:
- Imaging Magazine <http://www.imagingmagazine.com> and the Association for Information and Image Management (AIIM) <http://www.aiim.org>. You must be a member ($125 individual) to take advantage of AIIM's excellent library and referral services.
It is very important to test the products (deliverables) of a service bureau before finalizing a contract. Most service bureaus will conduct scanning tests for free or for a modest fee as part of a competitive bidding process. It is your responsibility to specify the quality level of the digital images in terms of resolution, dynamic range (bit depth), and postscan image processing (e.g., deskew, despeckle, and tone adjustment). It is also your responsibility to specify whether it is acceptable for the vendor to use equipment that uses synthetic resolution tools to offset the resolution limitations of the equipment. Finally, it is also your responsibility to specify the characteristics of the output files in terms of file format, naming conventions and directory structures, and delivery mechanism (e.g., CD-ROM, FTP server, magnetic tape).
Equipment Options
If you are working with a contractor to accomplish your imaging goals, it will not be necessary to purchase scanning equipment. Nevertheless, you can and should learn as much as you can about the capabilities of scanning equipment by contacting the manufacturers of hardware and software systems.
Hardware/software capabilities must be understood in order to develop quality and cost specifications, regardless of whether a scanning program is carried out in the library. Scanning results will vary across machines, however, depending on how the software for a given machine defines the thresholds (analogous to contrast settings on a photocopier), sets the various filter options, and applies various algorithms that interpret and adjust pixel encoding. The more that is known about how the scanner interprets and codes what it sees, the better the resulting images. Ultimately, quality specifications, technology capabilities, and the visual characteristics of the original source combine to determine the quality and cost of the image product.
The following five companies either manufacture or resell four systems for microfilm scanning in the United States. In general, hardware and software are bundled as a single package. The amount of customization that can be specified by the buyer for either hardware or software varies from none (Minolta) to extensive (Amitech). The amount of end-user control over the equipment also varies widely. It is important to view and test equipment in real-world settings before purchasing equipment. The best way to undertake this testing is to ask hardware companies for a short list of client-references in the area and then contact these references directly.
Amitech Corporation <http://www.amitech.com>
5501 Backlick Road
Suite 200
Springfield, VA 22151
Phone: 703-256-2020 Fax: 703-256-9153
Amitech resells three of the four microfilm scanners (Mekel, SunRise, Wicks & Wilson) that are presently available and also provides a variety of software packages (customizable) that control the scanner operation and carry out various postscan data management tasks (e.g., deskew, despeckle, compression).Mekel Engineering, Inc. <http://www.mekel.com>
2800 Saturn Street, Suite B
Brea, CA 92821-6201
Phone: 714-996-5600 Fax: 714-996-5696
The Mekel M500 is the premier high-speed microfilm conversion product. It is capable of handling 35 mm or 16-mm roll film. The Mekel M560 is the associated hardware for fiche scanning.Minolta Corporation <http://www.minoltausa.com/eprise/main/MinoltaUSA/MUSAContent/index.htm>
101 Williams Drive
Ramsey, NJ 07446
Phone: 800-964-6658 Minolta manufactures the MS 3000 Microform Scanner, which can handle a full suite of formats if the transport mechanism is changed. The scanner is highly automated and provides limited operator flexibility.SunRise Imaging, Inc. <http://www.sunriseimaging.com>
1250 N. Tustin
Anaheim, CA 92807
Phone: 714-632-2160 Fax: 714-632-2161
The SunRise ProScan III is the most complex and comprehensive microfilm scanner on the market. It converts in both bitonal and gray scale mode and can handle a variety of formats depending on the configuration of the film support mechanisms.Wicks & Wilson, Inc. <http://www.amitech.com>
Morse Road Basingstoke
Hampshire RG226PQ England
Phone: 011441256842211
The Wicks & Wilson 4000 and 4001 Scanstations are the newest arrivals to the U.S. market. They are manufactured in England by a company that specializes in high-tech imaging applications, such as virtual reality gloves. At publication time, the WW machines are available only through Amitech. The manufacturer claims high-resolution scanning and ease of use are key features.Further Research Needed
Research needs to be done to certify the feasibility of converting nonbook materials, especially newspapers and manuscripts. Additionally, the challenges of working with microfilm that has not been created with rigorous archival standards are not well understood, including:
- Older film
- Scratched or damaged film
- 16 mm film
- Continuous tone film
- Positive polarity film
- Third generation film.
Conclusion
In the past decade, microfilm-scanning technology has matured to the point where you have distinct options regarding hardware and software capabilities, as well as choices about the quality of the end products and the cost of the technology. Quality is increasing; per-image costs are declining. You should have confidence that the digital image conversion of primarily text-based materials from preservation microfilm is both technically feasible and economically competitive with other conversion technologies.
Sources
Chapman, Stephen, Paul Conway, and Anne R. Kenney. Digital Imaging and Preservation Microfilm: The Future of the Hybrid Approach for the Preservation of Brittle Books. Washington, DC: Council on Library and Information Resources, 1999. [Online] Available: http://www.clir.org/programs/cpa/hybridintro.html#description
Conway, Paul. "Selecting Microfilm for Digital Preservation: A Case Study from Project Open Book." Library Resources & Technical Services 40 (January 1996): 67-77.
-----. "Yale University Library's Project Open Book: Preliminary Research Findings," D-Lib Magazine (February 1996) [Online]. Available: http://www.dlib.org/magazine.html
------. Conversion of Microfilm to Digital Imagery: A Demonstration Project. New Haven, CT: Yale University Library, 1996.
Conway, Paul and Shari Weaver. The Setup Phase of Project Open Book. Washington, DC: Commission on Preservation and Access, June 1994. [Online]. Available: http://www.clir.org/pubs/reports/conway/conway.html.
Elkington, Nancy, ed. RLG Preservation Microfilming Handbook. Mountain View, CA: The Research Libraries Group, Inc., 1992.
Fox, Lisa, ed. Preservation Microfilming: A Guide for Librarians & Archivists, 2nd ed. Chicago: American Library Association, 1996.
Kenney, Anne R. and Stephen Chapman. Digital Imaging for Libraries and Archives. Ithaca, NY: Cornell University Library, 1996.
Watters, Donald J. From Microfilm to Digital Imagery. Washington, DC: Commission on Preservation and Access, June 1991.
Waters, Donald J. and Shari Weaver. The Organizational Phase of Project Open Book. Washington, DC: Commission on Preservation and Access, September 1992. http://www.clir.org/pubs/reports/openbook/openbook.html/openbook.html
Weber, Hartmut and Marianne Dorr. Digitization as a Method of Preservation? Final Report of a Working Group of the Deutsche Forschungsgemeinschaft. Washington, DC and Amsterdam: Commission on Preservation and Access and European Commission on Preservation and Access, 1997.
Willis, Don. A Hybrid Systems Approach to Preservation of Printed Materials. Washington, DC: Commission on Preservation and Access, 1992. http://www.clir.org/pubs/reports/willis/index.html
6. Cooperative Imaging: Scans Well with Others
Steven D. Smith
Amigos Library Services, Inc.
Digital imaging technology can assist libraries, archives, and museums in achieving a level of cooperation never before possible. Institutions traditionally have cooperated in filling voids within local collections --microfilming archives and offering them for sale, supplying missing journal issues, and, most obviously, participating in interlibrary loan. However, digital imaging offers the ability to create virtual collections from items held at a number of geographically disparate institutions. It also enables a single network interface, allowing researchers access to materials without concern for their physical location. Cooperative projects using digital imaging also can link primary source materials together with secondary resources to provide users with a strong collection capable of satisfying the requirements of all but the deepest research.
What is Cooperative Digital Imaging?
Cooperative imaging can take a number of forms. At its most basic, cooperative projects have consisted of institutions pooling resources to purchase an imaging workstation(s) for use by all participants, or to use their aggregate buying power to secure lower per-image conversion costs from service bureaus. Another possibility is for institutions to scan and network images independently but provide a single access point for all collections (a large-scale example is the Association of Research Libraries [ARL] Digital Image Database http://www.arl.org/did/).
The type of cooperation most often associated with digital imaging creates the virtual collections described above. Examples include Research Libraries Group's Studies in Scarlet project (http://www.rlg.org/scarlet/sis.html) and the Library of Congress' American Memory project (http://memory.loc.gov/ammem/). In both cases, lead organizations provided the leadership and guidelines (and even partial funding), and contribution of collections was opened to libraries and archives across the country. Although these examples represent the efforts of the large research libraries, the activity is open to libraries, museums, and historical societies with all sizes and types of collections. In fact, the advantages of cooperation for small institutions may be greater than for larger research libraries.
Why Cooperate?
The main and obvious reason for cooperation is to provide users with enhanced access to collections. But there are additional reasons that benefit the institutions themselves. Cooperation offers opportunities to:
- Share expertise
- Save costs on conversion
- Increase opportunities for funding
- Heighten visibility for the collections by linking with similar collections and to other institutions.
Perhaps the biggest selling point for smaller institutions is the ability to share expertise. Several institutions can work together to solve problems of converting paper- and film-based collections to a digital format and networking these collections, along with the attendant problems of cataloging and creating metadata.
How Does Cooperation on Digital Imaging Differ from Cooperative Microfilming Projects?
The biggest difference between microfilming and digital imaging projects is complexity. In addition, there are established procedures and standards for microfilming, whereas we are still learning about optimal digitizing methods (hence this publication). Preservation microfilming, while requiring the participation of selectors and catalogers, is largely an undertaking of preservation reformatting staff. Selector and cataloger expertise and involvement is certainly necessary, but such folks are not asked to do anything out of the ordinary. Digital imaging requires an altogether wider level of participation from every institution, involving more involvement from a variety of staff, especially the inclusion of systems personnel.
In addition, digitization projects are not as fixed as microfilming, where the end product is essentially just cataloged and shelved. Imaging projects are not completed with the creation of digital images and their associated metadata (a complex issue in and of itself). The technical and administrative issues of networking and providing access are legion, and they must be considered and resolved before the first page hits the platen.
- Will the images be available via the Internet?
- How will rights be managed?
- Will computer-searchable text be provided along with the images of textual items?
- Who is responsible for maintaining access to the images?
- Who owns the aggregate collection of digital images?
But digital imaging can result in a more useful end product than microfilming -- one that allows simultaneous access to collections by multiple users.
In addition to the benefits discussed above, cooperative projects increase the chances of obtaining outside funding, as many grant agencies have demonstrated a preference for coordinated, multi-institution projects. Cooperative projects may actually prove less expensive (i.e., more cost effective) on a per-image basis, as many of the costs relating to imaging are not so dependent on the number of images or participants and, if outsourcing, the conversion cost per image can be less.
From the standpoint of the user, cooperative projects are more likely to produce a desirable end product, both in terms of content (using the most relevant items from several collections) and form (benefiting from shared expertise, database design, intellectual access, web interface, and so forth). This is especially true for smaller institutions, where by pulling together or working with larger institutions their collections can become more useful to the researcher.
Concluding Thoughts -- or How NOT to Cooperate
There are many examples of successful cooperative imaging projects: Studies in Scarlet, American Memory, the Colorado Digitization Project (http://coloradodigital.coalliance.org/), and the various implementations of Making of America (see, for example, http://moa.cit.cornell.edu/MOA/moa-mission.html and http://sunsite.berkeley.edu/moa2/). Much can be learned from these examples, but it is also worth considering those projects that fail to get off the ground -- or to move from planning to implementation.
The most common causes of such failure include the reluctance to commit resources (especially staff, and especially staff with the technical expertise), the desire to wait for industry standards to appear before moving forward, and the failure to define project objectives. Digital imaging is a resource- intensive activity and cannot be undertaken without the commitment of staff. Waiting for standards to appear is no reason to hesitate. Although there are few standards relating to digitization, it must be recognized that best practices and other guidelines are appearing.
Particularly difficult with cooperative projects is the last point: developing firm project objectives. Many institutions are interested in undertaking imaging because it is a hot activity. They are only able to state a project's purpose in vague and unmeasurable terms related to improved access. For a project to be successful, it must have firm, quantifiable objectives.
On a broader level, cooperative imaging projects fail because planners have yet to establish independently what role imaging will play within their own institution. Institutions need to confront the complexities of and the myriad issues raised by digital imaging -- networking, metadata applications, database creation and maintenance, rights, reference services for networked users -- before a project is planned. Although it is unlikely that such issues can be resolved before a project can begin, they must be understood by all parties before moving forward.
When all is said and done, a cooperative project may seem on the surface to complicate an already complicated activity. But cooperation offers the considerable advantage of bringing together a larger number of experts with greater and more varied knowledge and experience than a single institution could ever field, thus increasing the chances of success.
One final reminder: Although the examples cited above involve some of the largest libraries in the country, cooperative imaging is open to institutions of any size. In fact, smaller institutions have much to gain from cooperation and much to offer. In brittle books microfilming, the best or only copy of an item is often located outside of the participating research libraries. The same is true with digital projects. Small and medium-sized libraries, as well as specialized libraries such as museum or medical libraries, have much to contribute when it comes to archives, manuscripts, photographs, serials, and other desirable materials. Whether participation is decided by geographic proximity, library type, or simply by joining with other institutions with similar or sympathetic holdings, all institutions can take part in an activity that brings collections together to a degree never before possible.
Summary of Key Points
- Define scope of project, including appropriate collections and level of indexing.
- Define roles of participating institutions.
- Define areas of responsibility for each institution.
- Establish measurable objectives to evaluate success of project upon completion.
- Agree on long-term maintenance of digital images and associated metadata.
Table of Contents
Northeast Document Conservation Center
Copyright 2000. Northeast Document Conservation Center