DECA-287: genpdf omits large sections of human-readable text when generating Type 3 or 4 PDF

Metadata

Source
DECA-287
Type
Bug
Priority
Major
Status
Open
Resolution
N/A
Assignee
N/A
Reporter
Jonathan Hung
Created
2012-06-27T15:51:37.201-0400
Updated
2013-01-27T12:05:45.948-0500
Versions
  1. 0.5
  2. 0.6
  3. 0.7
Fixed Versions
  1. Future
Component
  1. genpdf

Description

For some images, large sections of text are omitted when generating Type 3 or Type 4. Typically the top few lines of text would be missing.

To reproduce, run the following on the relevant image:
./decapod-genpdf.py -d test-t4 -t 4 -p test-t4.PDF filename.png/jpeg
./decapod-genpdf.py -d test-t3 -t 3 -p test-t3.PDF filename.png/jpeg

The following two images reproduce this error:
2-1-1.jpg
faithful-to-the-book-page-4-copy.jpeg (see attached PDF to see the results of a Type 3 export)

The following two images do not produce this error (despite being somewhat similar):
4-1-01-grey.jpg
Image_0016-grey.png

Format and colour do not appear to play a role as colour or TIFF versions of problematic images exhibit the same behaviour.

Comments

  • tamir@tamirhassan.com commented 2013-01-27T12:03:52.545-0500

    The reason is because the line-finding stage of layout analysis has failed and the lines have not been found – and used for further processing.

    I've tried it out with the current version and get a much better result – only the page number at the top is missing.

    Ideally, all content not recognized as text lines would be included as part of a background image.

  • tamir@tamirhassan.com commented 2013-01-27T12:05:08.641-0500

    (this comment relates to the file test-t3.pdf) This is the output that I got when running genpdf on the same pdf (t3). Only the page number at the top (not recognized as text?) is missing. Tamir