Published July 13, 2011 | Version v1 | https://doi.org/10.59350/sjd8b-2jc06

Correcting OCR using hOCR in Firefox

  • 1. ROR icon University of Glasgow
Feature image

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output. Given my interest in OCR and the Biodiversity Heritage Library I decided to take it for a spin.

moz-hocr-edit uses the hOCR, which is a format for representing the output of OCR software, and is used by tools such as OCRopus (you can see the public specification for hOCR here). Basically it's a microformat, that is, it's HTML with some additional tags. Given some hOCR, moz-hocr-edit enables you to edit the OCR output line-by-line.

Demo
I've created a simple demo based upon Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation. For the demo to work you will need to use the Firefox web browser with the moz-hocr-edit installed.

  1. Go to http://dl.dropbox.com/u/639486/hocr/80780.html
  2. You will see a simple HTML representation of the OCR text from "Case 3368 Eatoniella Dall, 1876 and EATONIELLIDAE Ponder, 1965 (Mollusca, Gastropoda): proposed conservation". I created this HTML from the original ABBYY FineReader XML from the Internet Archive.
  3. On the bottom right-hand of the Firefox browser window you should see hOCR. Click on it and select "Edit this hOCR document":
    Statusbar
  4. Firefox will open a new tab that will look something like this:
    Screenshot
  5. You can now edit individual lines of text, and see your edits applied to the HTML below.
moz-hocr-edit is a neat little tool. With appropriate web server settings (and, as the tool's author Jim Garrison suggests, autoversioning) it could the basis of a great tool for correcting OCR errors in BHL.

Additional details

Description

Quick post on a little tool I came across, moz-hocr-edit. This Firefox add-on lets you proofread Optical Character Recognition (OCR) output.

Identifiers

UUID
3fed0f99-e255-4fa9-b7d9-095df2eac89f
GUID
tag:blogger.com,1999:blog-16081779.post-2425236243751210928
URL
https://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html

Dates

Issued
2011-07-13T14:12:00
Updated
2011-07-13T14:17:09