A Look Inside the Data: The OCR Process and Tables in CWD

I intend to describe how I converted a table from the California Water Documents digital collection into an Excel spreadsheet. I used the OCR (Optical Character Recognition) software ABBYY Finereader.

Table Extraction from the CCDL

The first step in the OCR process was to identify a table 1 in the water documents collection. The file I worked with was Report of Sacramento-San Joaquin water supervision for year 1938 . The first table in these reports tends to be at the end of Chapter One.

Chapter I Report of Sacramento-San Joaquin water supervision for year 1938.

By clicking the “Expand” button (shown by a red arrow in the image above) I was provided a full screen and downloadable view of the chapter.

I scrolled through the document to find Table 1. I clicked the button indicated by the red arrow to download the page from the digital collection.

Once I had the table in a PDF it was time to use ABBYY Finereader to OCR and convert it into an Excel workbook.

ABBYY Finereader

ABBYY Finereader is an excellent OCR software. I hope detailing this process makes it easy enough for anyone to use.

First, I opened the PDF with the program’s OCR Editor.

The left window displayed the original PDF document. The right window showed all of the recognized text and data from the PDF. The third narrow window at the bottom of the page provided a close-up for the purpose of editing manually.

After ABBYY recognized the document, I needed to verify the material . In the right window, underlined in red, there were highlighted characters the software recognized as questionable. I clicked the “Verify” button circled in red in the image above.

The screen that popped up is similar to the spell check window in Microsoft Word. In order to correct the “Low-Confidence Characters,” I either selected from the suggestions, edited the text in the window itself, or skipped and manually adjusted the characters later. In the example above, the software recognized 1924 as 1524. I simply corrected the numbers in the window and hit “Confirm.”

After checking every “Low-Confidence Character,” my screen looked like the screenshot below.

I made sure Excel was selected as the export file and proceeded to save the table as a workbook.

Export to Excel

Here is what the finished product looked like. I made sure the cells were all filled in correctly, there were no spelling errors in the text, and the overall image appeared similar to the table grabbed from the original.

And that’s it! I made sure the Excel spreadsheet was saved and uploaded it to our cloud file management software.

Conclusion

There is still so much that goes on using ABBYY that is difficult to discuss in a blog post, but I hope I did a sufficient job of highlighting the basics. There is nothing more satisfying than exporting an Excel file and gazing upon a glorious and organized table that is easy to share, formulate, and research with. I look forward to keeping up with this work and successfully converting multiple water document tables into handy Excel workbooks.