CS169 - Software Engineering

Project Specification and Requirements

OCRchie
Project Team CS169-AB
James Hopkin
Kathey Marsden
Archie Russell
Cynthia Si Tian

Introduction
System Model
Data Representations
System Evolution
Functional Requirements and Solutions
NonFunctional Requirements
Glossary
Appendix A
Appendix B

I. Introduction

While there are currently many OCR Software packages available for retail sale, to the best of our knowledge, there are no publicly available free OCR libraries. Our goal is to produce a basic OCR library in C++ which will be freely available on the Internet to all who want to use it. Our goals are to

  1. expand OCR research by allowing researchers to build on an existing package and
  2. provide OCR libraries that can easily be integrated into application software.

The general function of the software will be to translate scanned document TIFF files into ASCII text. The program will first provide the capability to deskew the image (if scanned at an angle). Then all or part of the document can be translated into ASCII text. The system will have the capability to "learn" and store new characters which will be used in future comparisons. In addition we will be creating a simple TCL/TK interface for learning new characters, and demonstrating the use of our OCR package.

This software package is to be based in large part on the Lisp OCR system as written by professor Fateman. We are using his general algorithms for this OCR project. We would like to thank him for the copious support, input, and source code he has given us already.


II. System Model

General I/O and Data Flow -
Input Files:

Data Representation:

The Page

Our primary data object is a Page. A Page can represent either an entire scanned page as read in from the TIFF file or some subsection of that page that has been extracted for character recognition. Page inherits from RLEMap, a class which has the run length encoded representation of the TIFF file and various access, manipulation and I/O routines (as specified in assignment 1 and Appendix A).

A Page has two significant data members in addition to those of the RLEMap.

The first is an array of lists of connected components. Each element of the array points to a theoretical row of text. Each connected component normally represents a section of the RLEMap which corresponds to one character. Connected components, however, do not hold any graphic representation, but rather contain information about the section of the RLEMap from which they were extracted. This information includes the upper left point and lower right point, as well as a property vector for the region. (Property vectors are outlined in Appendix B). Finally, a connected component has a field for its ASCII identification and a confidence rating low, medium, or high.

|fRowComponents[0]->"A"->"E"->"B" ....
|fRowComponents[1]->"E"->"A" .....
|fRowComponents[2]->"A"->"E"->"B" ...
| .
|fRowComponents[fnumRows -1]->

The other major data member of a Page is a list of CharGroups. A CharGroup represents a group of connected components whos property vectors are similar enough to classify them as the same character. A CharGroup class contains a property vector which is the average of all of the property vectors of the components belonging to that group and a list of pointers to candidate components. Initial groups are created from the previously learned characters. A heuristic "distance" is calculated from each candidate component to the existing groups. If the distance is within a predefined threshold, the component is added to the group, otherwise a new group is created with the ASCII identification initialized as NULL.

Each Component is assigned to exactly one CharGroup. The idea is that the user interface would allow unidentified groups to be given an ASCII identification. In the above example there would be three character groups, one with a list of pointers to all of the A's one to the B's and one to the E's. (Note: We could allow group reassignment but this would require doubly linked lists and might be ugly.)

Both the Row Components and CharGroups are null upon creation of the page but are built up by member functions. Detailed data representation information is included in Appendix B.) The User Interface

(Current big holes in representation for white space between words and disjoint characters (i,j,;,:)

The user interface will be constructed in TCL/TK. We feel that TCL/TK will provide an easy GUI to illustrate our OCR library.

System Evolution:

Our simple OCR system will make certain assumptions about the type of data that will be encountered. In future releases, we hope that some of these restrictions can be relaxed.

  1. Only TIFF files will be supported. We are using the Sam Leffler/ Silicon Graphics TIFF library for reading TIFF files. GIF and JPEG will not be supported.
  2. Because of our TCL/TK user interface, an XWindows environment is required to run the complete software package. Also the TIFF library mentioned above must be available for your machine.

Beyond these first two restrictions our plan is to make our code very portable. To use the OCR suite on a different machine with a different TIFF library would entail only modifying the readMap and writeMap functions.

Additional Restrictions

3) The software will correct skewed images of up to ten degrees. Skewed images of a greater angle cannot be corrected by the system.

4) Only single column text based, black on white images are supported.

5) Only the English character set is expected and supported by the software.

6) Text cannot be underlined (really bad for connected components) or highlighted. Nice neat standard black text on white paper is best.

Functional Requirements Specifications and Solution

The functional requirements and specifications of our system can be divided into four categories, 1) Skew detection/correction, 2)Page layout analysis, 3) Character recognition and 4) The user interface. These are outlined below.

  1. Skew detection/correction.

The lines in a bitmap may not be aligned with text lines of the document due to human factors and the limitations of mechanical devices. The bitmap is usually skewed at a slight angle. A deskewing process is needed to detect a skew angle and produce a new unskewed bitmap. Below is an analysis of the different methods we may use.

There are three basic methods available: projection methods, Hough transform methods and nearest-neighbor methods. The projection methods are most efficient among those three methods but are proved to effective only up to a skew angle of 10 degrees and less accurate. Hough transform methods are much more accurate than projections methods but considerable slower than the projection methods. The Hugh transform methods are effective to a skew angle of 15 degrees. The nearest-neighbor methods are more accurate than both methods and have no limitation on a skew angles, but the cost is very higher and the methods are much more sensitive to the layout of a text page. Although there are many improvements on the those methods, due to the limitation of time and resources, we can not fully investigate them.

Since we assume a bitmap from a scanned text page is skewed slightly up to an angle of 10 degrees and the limitation of time and resources, we pick the projection methods as our prototype method to deskew a bitmap. A further improvement will be made based on the projection methods.

Roughly, at different skew angles, the black dots along each row of bitmaps are counted. An angle dominates the other angle if at this angle, the number of rows with most dots is far great than other angles statistically. The skew angle is the angle which dominates other angle.

The process will be implemented at the increment of 1 degree.

The difficulty in the process is to recalculate the distribution of black dots after a slight rotation of a bitmap. This is the main obstacle to limit the capability of this method.

One feature is to consider an interactive deskewing process.

There are two utility functions:

  1. deskew(): to find a skew angle;
  2. unskew(): to produce a bitmap without the zero skew angle.
  3. Page Layout Analysis.

Very little page layout analysis will be done. However, the end of text rows and paragraphs will be detected and appropriate carriage returns entered in the ASCII file.

3) Character Recognition (??? Need stronger specification )

  1. Segments page into individual characters and represents them as connected components.
    • A determination is made as to the top and bottom of each row of text.
    • A vertical projection is used to separate characters within a row and determine the connected components which are the black pixels of each character.
      • Each text row in the Page is represented as a list of connected components.
  2. Create a property vector for each connected component. A 27 dimensional property vector is created for each connected component.
    • The region is subdivided into a 5X5 grid
      • The gray scale of each of the 25 regions is determined.

        The gray scale is a number between 0 (all white) and 255 (all black).

      • The gray scale for each region is stored in the first 25 dimensions of the property vector.
      • Two other properties the height/width ratio and the absolute height (in pixels) of the character are determined in placed in the last two dimensions of the property vector.
  3. The property vector of each Component is matched against a list of prelearned character groups for identification. A CharGroup is a class inherited from Component and a list of pointers to components, so each CharGroup has its own property vector.
    • Preset distance thresholds are set for low, medium and high confidence. User will be able to select confidence rating. Also an acceptable height/width ratio difference and character height difference is determined. These last two properties will be used as a quick check before calculating a distance between the Component and CharGroup.
      • The distance between the Component and each CharGroup is calculated as follows:

        if height/width ratio difference is unacceptable

        distance = +oo else if height difference is unacceptable

        distance = +oo else

        for region = 0 to 24

        distance += square(grayScale(CharGroup,region) - grayScale(Component, region)) endif

      • If distance < threshold then add Components address to CharGroups list of components. re-average CharGroups property vector. Set Components asciiId to be that of the CharGroup.
    • If no character groups match create a new group
  4. User Interface The user interface allows the user to
  5. Load a file of learned characters (maybe automatic)
  6. Load a TIFF file for OCR
  7. Deskew the image.
  8. Mark the section of the file to be translated into ASCII text.
  9. Initiate the OCR process.
  10. Identify ASCII value of unrecognized character groups.
  11. Save the ASCII file.
  12. Save the updated list of learned characters to file

III. NonFunctional Requirements

Product requirements:

Speed and Memory Efficiency
Our focus in writing this software is functionality, rather than speed or memory efficiency. Of course, the reality is that time and space are valuable resources, but they are, in fact, valuable resources that OCARchie (at least in its initial implementation) will consume with a voracious appetite. Even so, OCArchie will hold only one scanned image in memory. It is assumed that user must have enough memory to hold a page representation in three formats:

  1. RLE representation for skew detection and component extraction,
  2. Component representation, including a 27 dimensional property vector for each component.
  3. RLEMap representation for display by TCL/TK

Process Requirements

Implementation and Portability
Software is to be written in C++ and TCL/TK. The software must be portable to other Unix systems when recompiled, provided the Silicon Graphics TIFF library is present.

Delivery
Software must be completed, tested and documented by December 1995.

IV. Glossary
OCR-Optical Character Recognition

RLE-Run Length Encoding. Pixels from each row of a TIFF image are

        represented as a list of black pairs.  Each pair contains the
        starting and ending column of a group of black pixels.

TIFF-Grafic storage of scanned image. Accessed by Silicon Graphics

library. (???? Don't know what TIFF stands for)

APPENDIX A
Functional requirements specification

RLEMap functional specification

Since Page inherits from RLEMap, the functions listed below are of course available to Page as well. Listed below each requirement is the member function which will perform the function, a description of arguments, constraints, effects and return values.

There are several supporting classes which are referred to in this section. These are outlined briefly here and in more detail in Appendix B.

Point - Consists of an X and Y coordinate.

Component - consists of an upper-left point, lower right-point, the number of black bits in the component, the property vector, an ASCII identification and a confidence rating.

Components - a linked list of objects of type Component. CompPtrs - a linked list of pointers to Components.

CharGroup - Inherits from Component and CompPtrs. As described in the previous section, the property vector of the character group is the average of each of the components pointed two.

RLEPair - A starting and ending column, used for Run Length Encoding.

RLEMap Functions

Unary Commands

  1. Create an RLEMap

    Member function : RLEMap() (constructor) Arguments: none Effects: The constructor creates the new object. The RLEMap is empty and memory is not allocated for pixel information until readMap() is called on the object. fImageWidth, fImageLength, and fStat are initialized to 0,0 and EMPTY respectively Return Value: Pointer to the new RLEMap object

  2. Delete a RLEMap

    Member function : ~RLEMap() Arguments: none Effects: The destructor releases all memory holding pixel

    information. Invoked on delete operation. Return Value: none

  3. Copy a RLEMap

    Member Function: RLEMap * copy(Point ul=NOPNT, Point lr=NOPNT) Arguments: ul is the upper left point at which the copy

    should start. lr is the lower left point at which the copy should start. Effects: If the default values of ul=NOPNT and lr=NOPNT are

    used, the entire map is copied. The copy function allocates a new RLEMap as well as identical bit information memory. Then it copies size, status and bit information from the server to the new object. If a upper left point and lower right point are specified, a submap is copied and size information adjusted accordingly. Return Value: Returns a pointer to the new object.

Selector Functions

      7. Determine Length of a RLEMap
         Member Function: int imageLength()
         Arguments: None
         Effect - This Selector returns the number of vertical
            bits in map either as read from the TIFF file or
            determined by the submap operation (see copy operation).

      8. Determine Width of a RLEMap
         Member Function: int imageWidth()
         Arguments: None
         Return Value - This Selector returns  the number of horizontal
         bits in either as read from the TIFF file or
         determined by the submap operation (see binary operations).

Selector/Modifier functions

      9. Determine/Set  the status of a RLEMap
         Member Function: MapStatus & status()
         Arguments: None
         Effect: As a modifier allows client to set the status of a
            RLEMap.
         Return Value:  As a selector returns the current status of the
            RLEMap.
         Comment:  A client may need to reset the status to VALID after
            an I/O error.

I/O Functions
      3. Read a RLEMap from a TIFF file.
         Member function : MapStatus readMap(Char * filename)
         Arguments: filename is name of the TIFF file to open.
         Constraints: filename must be the name of a TIFF formatted file
         to ensure a proper read .
         Effects: Opens filename and reads TIFF file into RLEMap,
            recording size and pixel information for access by other
            member functions.   On exit from this function the TIFF file
            is closed.If the RLEMap is not empty, memory containing old
            pixel information is released before the new read.
         Return Value: Returns a value of type MapStatus.  MapStatus
            is an enumerated type.  Potential return values are VALID,
            OPENERROR, READERROR

      4. Write a RLEMap to TIFF file.
                (This is Probably not needed, but left here for now)
         Member Function:  MapStatus writeMap(char * filename)
         Arguments: The name of the TIFF file to write to.
         Effects: Opens filename and writes RLEMap to filename in TIFF
            format.
         Return Value: Returns a value of type MapStatus.  Potential
            return values are VALID, OPENERROR, WRITEERROR

Data Access and low level manipulation functions

      5. Clear a RLEMap.
         Member function: MapStatus clear(Color clr)
         Arguments: clr is of enumerated type Color which can take on
            the values BLACK or WHITE.
         Effect: RLEMap is cleared to all one color.  If an error occurs
            the status of the map is set to OTHERERROR and returned
         Return Value: Status of RLEMap

      6. Set a single bit in a RLEMap to a specified color.
         Member function: MapStatus setBit(Point point, Color clr)
         Arguments: X and Y values of point are the row and column to
            change clr is BLACK or WHITE
         Effect: Changes the color of BMPA bit in position (X,Y) to
            the color clr. If an error occurs changes status to
            OTHERERROR
         Return Value: Status of RLEMap

      7. Read and return a single bit from a RLEMap.
         Member function: Color readBit(Point point)
         Arguments:point contains the column and row of the bit to read.
         Constraints: point.x() < imageWidth() and
                      point.y() < imageLength
         Return Value:  Returns color of bit.

High Level Access and manipulation

       8. Determine angle which the document must be rotated so that
          rows are orhthoganal to document edge.
          Member function: Angle skewAngle();
          Arguments: none
          Constraints:  Image must be rotated at no more than a 10
          degree angle in either direction to ensure accurate result.
         Return Value:  Angle at which the document must be rotated
           in a counterclockwise direction before OCR can be performed.
           Angle is just an alias for float.
         Comment: No rotation is actually performed in this function

        9. Rotate the map a specified angle.
           Member function: MapStatus rotateMap(Angle angl)
           Arguments: Angle to rotate map
           Constraints: -10.0 < angl < 10.0
           Effects: RLE representation is actually changed to
             represent rotated map.
           Return Value: Status of map after operation
      10. Display on Standard output  // Just for testing
         Member function :
            MapStatus display(Point point, int termWidth,
                         int termLength, int scaleDownX, int scaleDownY)
         Arguments:  point's X and Y values indicate the row and column
            at which to start the print. termWidth and termlength are
            the number of columns and rows on stdout.scaleDown is a
            scale down factor.  If scaleDownX = i, every ith pixel in
            row will print.  If scaleDownY = j, every jth pixel in a
            column will print.
         Constraints: point.X >=0, point.Y>=0
            0 < scaleDownX <imageWidth
            0 < termWidth <= # columns on screen
            0 < termLength <= # rows on screen
            0 < scaleDownX < imageLength.
            0 < scaleDownY < imageLength.
         Effects: An X will print on stout for each non-white pixel.
            If scaleDown factors are 1, the rectangle bounded by
            (X,Y) and (X+termWidth,Y+termlength) will print.
            For  scaledown > 1 the rectangle bounded by (X,Y)
            (X + (termwidth * scaleDownX),
            Y + (termlength * scaleDownY)) will print.

       10. Extract a Bitmap representation for TCL/TK
           Member function: BitMap * ExtractComp(Point ul, Point lr,
                                        int scaleDownx, int scaleDowny)
           Arguments: ul and lr bound the region to be extracted.
                 scaleDownx and scaledowny are as described for display.
           Constraints: ul and lr must be within map range.
                 Scaledown factors constraints as described for display.
           Effect: Allocate a BitMap for TCL/TK use.  Copy designated
              contents.
           Return Value: Pointer to new BitMap

Binary operations

The binary operations allow for comparison of two bit maps. Operator overloading is used for these functions to simplify use. The operators for RLEMaps overload the operators for bitwise logical operations. We may not need these but leave them in for now, unchanged from assignment 1.

  1. Logical IOR (Inclusive or)

    Member function: MapStatus operator |= (LowMemMap & BMPB)

    MapStatus operator |= (SpeedMap & BMPB) Arguments: BMPB is a reference to the Map with which the server

    object (BMPA) will be compared. Constraints: BMPA and BMPB are of equal size and are VALID. Effect: Each white bit BMPA(i,j) will be set to BMPB(i,j).

    The result will be a logical or. If an error occurs in the operation the status of BMPA will be set to OTHERERROR Return Value: The status value for BMPA. Comment: BMPA will be changed with the use of this operator.

    Sample use: BMPA |= BMPB

  2. Logical XOR (Exclusive or)

    Member function: MapStatus operator ^= (LowMemMap & BMPB) Member function: MapStatus operator ^= (SpeedMap & BMPB) Comparable to IOR described above but performs XOR operation.

  3. Logical AND

    Member function: MapStatus operator &= (LowMemMap & BMPB) Member function: MapStatus operator &= (SpeedMap & BMPB) Comparable to IOR described above but performs AND operation.

  4. Test two RLEMaps for equality

    Member function: Boolean operator == (LowMemMap & BMPB)

    Boolean operator == (SpeedMap & BMPB) Arguments: BMPB is a reference to the Map with which the server

    object (BMPA) will be compared. Constraints: BMPA and BMPB are VALID. Effect: The function compares sizes and then compares bit

    information of BMPA and BMPB. Return Value: False if BMPA does not have the same size and/or

    content as BMPB. Comment: Sample use if (BMPA == BMPB) ...

Page Functions - additional functionality of a page.

  1. Extract row component Information

    Member function: MapStatus extractRows(); Arguments: none Constraints: Map must already be deskewed Effect: Sets fnumrows. Allocates array for fRowcomponents

    and creates a list of components for each row. Each component is given a property vector as described in functional requirements section. Return Value: Status of Page after operation

  2. Read learned characters from file for OCR Comparison

    Member function: MapStatus readLearnedChars(char *filename); Arguments: name of file where characters are stored Constraints: filename must have been previously created

    by writeLearnedChars. Return Value: Status of Page after operation

  3. Write learned characters to file.

    Member function writeLearnedChars(char *filename) Arguments: name of file where characters are to be stored Effect: Writes learned character information to file Return Value: Status of Page after operation

  4. Perform Character Recognition on a row of components

    Member function: CompPtrs recognize(int row); Arguments: row is the row number to analyze Constraints: 0 < row < fnumRows Effect: For each component C in row, finds an existing character group G such that distance(C,G) < threshold. the asciiId for component c is set to match that of the character group. The character group is modified to point to the component and the character group property vector is modified to average in the new component. The addresses of any Groups that could not be recognized are appended to a list for return Return Value: Return list of new (unrecognized) Character

    groups.

Component Functions

To be filled in later

APPENDIX B
Non Functional requirements specification ??? Need to do some math on memory requirements

Refer to .h files
page.h
pagesup.h
rlemap.h
bitmap.h

Hardware

Database requirements

Index