The curious case of the disappearing PDF text

30 minute read

A case of disappearing text in an auto-generated PDF document led me to dig deeper and learn more about ReportLab: a PDF generation library. Here’s the story of what happened, with an excursion into how to build basic PDF documents in ReportLab.

Vanish in a puff of smoke — Image credits: Federica Zambon

A user of an app that we developed ages ago at work noticed a weird thing: when text in part of an automatically generated PDF document got too large, the text disappeared from the document entirely. This is a journey of debugging (as well as a journey of discovery) in which I work out what the bug was, why it appeared, and what the final fix was.

It all began with a bug

The application in question is a Django app that we still maintain, part of which produces automatically generated PDF output (via the ReportLab library) from various pieces of information stored in the backend database. A user contacted us and asked why text in one particular field (a field for comments and additional information) was missing. We investigated and worked out that when the text in the field was too long, the text within the relevant section in the generated PDF disappeared. This struck me as weird, because usually when some text is too big for a page it often just extends beyond its given margins: disappearing completely was new to me.

But where to begin? And how could we work out what the underlying problem was? Trying to hone in on the problem in the production code wasn’t an option as a lot of the document construction had been abstracted into various methods, so isolating the problem there turned out to be difficult. Also, since I hadn’t written this code (and the original author doesn’t wasn’t available to ask) I wasn’t very familiar with how the code went about its job.

Therefore it was back to basics: I needed to build a basic document using ReportLab which very roughly approximated the document the production code is trying to build. Then I needed to see if I could get the bug to show itself in the example document. The idea being that by reducing the scope of the problem, the bug would be easier to reproduce and therefore a solution would be more likely to present itself.

We need to know where we’re going before we can go there

Before we start simplifying the problem, let’s first get an idea of what the production code is trying to do. The document being generated has the following layout from top to bottom:

a header block containing the document title and a logo image
two columns of information; one column containing text, the other containing images
a block of text containing comments and additional information

Graphically, it looks like this:

Full document layout

However that’s way too complicated to try to reproduce (at least, not just yet). Squinting a bit and considering the document layout from a higher level, we can think of this document layout as having just three blocks, each containing only text. Visually, that looks something like this:

Rough document layout

Ok, that’s a much easier to work with! Let’s build this example document now.

A very basic document

After investigating the production code in more detail, I found that it uses a Canvas object to describe the page as a whole, and then breaks up the document into parts using Frame objects (e.g. for top, middle and bottom rows), each containing Paragraph or Image objects depending upon the exact content to be displayed. We’ll therefore build a basic document along these lines: a Canvas containing three Frame objects, each containing a Paragraph. If you’re interested in the details about these objects, have a look at the ReportLab user guide.

However, before we get started, it’s a good idea to make a directory to contain our work and create a Python virtual environment in there so that we can install any third party libraries we want without needing to install them system wide:

$ mkdir disappearing-reportlab-text && cd disappearing-reportlab-text
$ virtualenv --python=/usr/bin/python3 venv
$ source venv/bin/activate
$ pip install reportlab

We can now construct a very simple document containing a single Frame with this code:

from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import Paragraph, Frame
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import cm


def render():
    canvas = Canvas("basic-canvas-doc.pdf", pagesize=A4)

    frame = Frame(0, 0, 20*cm, 5*cm, showBoundary=True)
    frame.addFromList([Paragraph("A basic canvas document")], canvas)

    canvas.save()


if __name__ == "__main__":
    render()

Here’s the output:

Very basic canvas-based document

Although this doesn’t represent the document we’re trying to (roughly) reproduce (for instance, the frame is displayed at the bottom of the page; the rest of the page is white), it does get the ball rolling.

What’s happening in this code? Well, this is just a basic script where we define a render() function. This function is called only when the script is run directly, this is what the check

if __name__ == "__main__":

does. The render() function constructs a Canvas with the name of the file to be generated and specifies the page size to use (we use A4 here to be explicit). Then a Frame is defined which contains the text (via a Paragraph object) “A basic canvas document”.

Frame objects are fairly low-level objects and hence need to be specified very explicitly: the first two arguments define the lower left coordinate position of the frame within the canvas, the next two arguments are the width and height of the frame respectively (in this case I’ve just set them to be 20cm and 5cm explicitly by using the cm unit imported from the reportlab.lib.units package). The showBoundary option draws a box around the frame to aid in debugging when positioning frames within a document. We then add the paragraph to the frame and specify the canvas object that the frame will use as a reference for placing its contents. Finally we call save() on the canvas object which writes the document to a file with the name we gave to the canvas object when we constructed it.

A closer approximation

There are few issues here that I’m not happy with (apart from the fact that there’s only one frame in the document):

the frame is at the very bottom of the page,
the page isn’t completely filled with text,
the frame isn’t aligned with any margins in the document,
and the text is way too short to cause any strange bugs to appear by being too long.

Nevertheless, it’s put us on the right path, so let’s fix these issues and get a closer approximation to the document we’re trying to represent.

Let’s define some margins for the document, say 1cm for the left and right margins respectively and 2cm for the top and bottom margins respectively:

LEFT_MARGIN = 1*cm
RIGHT_MARGIN = 1*cm
TOP_MARGIN = 2*cm
BOTTOM_MARGIN = 2*cm

I’ve used all capitals for these parameters because they represent constants and hence should be treated differently to variables (which are usually lower or mixed-case in Python).

From the margins and knowing that we’re using A4 paper, we can work out the width and height of the text:

PAGE_WIDTH, PAGE_HEIGHT = A4
TEXT_WIDTH = PAGE_WIDTH - LEFT_MARGIN - RIGHT_MARGIN
TEXT_HEIGHT = PAGE_HEIGHT - TOP_MARGIN - BOTTOM_MARGIN

Again, these are constants and hence we use all capitals for their names.

Now that we have this information, we can more easily position frames within the canvas and hence the document as a whole.

To save us some work in trying to think up some random text to add to each frame, let’s use the lorem-text package, so that we can automatically generate long-ish blind text to fill the frames with content (which will be a header, a bit of vertical space (via a Spacer object), and some more blind text).

We can install the lorem-text package with pip:

$ pip install lorem-text

and can use it in code to automatically generate text blocks like so:

from lorem_text import lorem


lorem.paragraph()    # create a single paragraph of blind text
lorem.paragraphs(3)  # create given number of paragraphs of blind text

where the .paragraph() method just creates a single paragraph; if one wants to make multiple paragraphs in one go, one can pass a number to the paragraphs() method.

Making the changes described above gives us this code:

from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import Paragraph, Spacer, Frame
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm, cm

from lorem_text import lorem

LEFT_MARGIN = 1*cm
RIGHT_MARGIN = 1*cm
TOP_MARGIN = 2*cm
BOTTOM_MARGIN = 2*cm
PAGE_WIDTH, PAGE_HEIGHT = A4
TEXT_WIDTH = PAGE_WIDTH - LEFT_MARGIN - RIGHT_MARGIN
TEXT_HEIGHT = PAGE_HEIGHT - TOP_MARGIN - BOTTOM_MARGIN

HEADSKIP = Spacer(1, 2*mm)


def render():
    canvas = Canvas("rough-canvas-doc.pdf", pagesize=A4)

    top_of_text = TEXT_HEIGHT + BOTTOM_MARGIN
    frame = Frame(
        LEFT_MARGIN, top_of_text - 0.2*TEXT_HEIGHT,
        TEXT_WIDTH, 0.2*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("One lorem"),
        HEADSKIP,
        Paragraph(lorem.paragraph())], canvas)

    frame = Frame(
        LEFT_MARGIN, top_of_text - 0.7*TEXT_HEIGHT,
        TEXT_WIDTH, 0.5*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("Multiple lorems"),
        HEADSKIP,
        Paragraph(lorem.paragraphs(4))], canvas)

    frame = Frame(
        LEFT_MARGIN, top_of_text - 1.0*TEXT_HEIGHT,
        TEXT_WIDTH, 0.3*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("One lorem"),
        HEADSKIP,
        Paragraph(lorem.paragraph())], canvas)

    canvas.save()


if __name__ == "__main__":
    render()

which then generates this document:

A rough approximation of the production canvas-based document

We can see that the code got a lot harder to read and understand very quickly; especially because we have to position the frames explicitly.¹ For instance, we need to know that the frames are positioned relative to the bottom left corner of the page (in this case the canvas and the page are the same thing). Therefore, each frame needs to be shifted to the right by the size of the left margin and the width of a frame needs to be set so that its right edge doesn’t encroach into the right margin.

The natural way to lay out the page would be to place place elements from the top of the page downwards, thus we need to work out where the top of text would be, and hence we calculate the position of the top of the text and work down from there. We can then work out where the bottom of the frame should be by subtracting the fraction of the text height that we’re allocating to the given frame. Also, each frame needs to “know” how much space has been used up by each previous frame so that the correct proportion of the text height can be subtracted from the top_of_text value in order to get the correct bottom coordinate of the respective frame.

Not only is the code harder to understand, it’s also not easy to explain in words what the code is doing. Further, this code structure only supports creation of a single-page document; text can’t easily flow onto further pages if it becomes too large for the current page.

Nevertheless, this structure reflects the production code’s internals fairly well. The fact that the code is hard to understand and explain is a signal to us that something needs to be improved here.

Note that this isn’t a criticism of the original developer’s work: there are probably very good reasons as to why the code was developed that way (if I remember correctly, one reason was that it was easier to test the preliminary output from a Canvas object than it was to check the generated PDF). Note that our perspective at this point in time is significantly different to that available when the code was written. Much has been learned in between times and time pressure will have been a factor, hence getting something working (even if it only supports creation of a single page document) is better than the perfect solution which is never delivered.

We’re now in a position to reproduce the bug we’re currently seeing in production.

Reproducing the bug

We can reproduce the bug by making the text in the final frame much too large (the final frame in the production document showed the problem of the disappearing text), e.g.:

    frame.addFromList([
        Paragraph("Too many lorems"),
        HEADSKIP,
        Paragraph(lorem.paragraphs(10))], canvas)

which then gives us:

from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import Paragraph, Spacer, Frame
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm, cm

from lorem_text import lorem

LEFT_MARGIN = 1*cm
RIGHT_MARGIN = 1*cm
TOP_MARGIN = 2*cm
BOTTOM_MARGIN = 2*cm
PAGE_WIDTH, PAGE_HEIGHT = A4
TEXT_WIDTH = PAGE_WIDTH - LEFT_MARGIN - RIGHT_MARGIN
TEXT_HEIGHT = PAGE_HEIGHT - TOP_MARGIN - BOTTOM_MARGIN

HEADSKIP = Spacer(1, 2*mm)


def render():
    canvas = Canvas("overfull-frame-in-canvas-doc.pdf", pagesize=A4)

    top_of_text = TEXT_HEIGHT + BOTTOM_MARGIN
    frame = Frame(
        LEFT_MARGIN, top_of_text - 0.2*TEXT_HEIGHT,
        TEXT_WIDTH, 0.2*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("One lorem"),
        HEADSKIP,
        Paragraph(lorem.paragraph())], canvas)

    frame = Frame(
        LEFT_MARGIN, top_of_text - 0.7*TEXT_HEIGHT,
        TEXT_WIDTH, 0.5*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("Multiple lorems"),
        HEADSKIP,
        Paragraph(lorem.paragraphs(4))], canvas)

    frame = Frame(
        LEFT_MARGIN, top_of_text - 1.0*TEXT_HEIGHT,
        TEXT_WIDTH, 0.3*TEXT_HEIGHT,
        showBoundary=True)
    frame.addFromList([
        Paragraph("Too many lorems"),
        HEADSKIP,
        Paragraph(lorem.paragraphs(10))], canvas)

    canvas.save()


if __name__ == "__main__":
    render()

And, in a puff of smoke, the text in the last frame has vanished! This has reproduced the bug!

An overfull frame in a canvas-based document showing the bug

Although this is what I’ve been hinting at all along, this confirms what my suspicions were from playing around with the comment text box in the production system: over-filling a frame with text causes the text to disappear from the generated PDF output. It’s odd that ReportLab doesn’t issue a warning or an error when this happens, but it’s good to know that we’ve isolated the problem.

I love it when a bug gets isolated

Understanding bugs can create new requirements

Now we’re confronted with the next problem: how do we handle this situation? We know that we have a fixed page layout, however do the users of this application always expect a single-page document, or is it ok to generate a multi-page document? After contacting the main stakeholder, we found out that it’s acceptable to let text spill onto a second page as long as it is clear that the text on the second page carries on from the first.

This new requirement means we need to completely rethink how the document is structured: we need to move away from hard-coded sizes to objects which are able to grow dynamically and which allow text to flow across multiple pages if necessary. We also need to be able to handle single- and multi-page documents from the same code.

Bugs are opportunities

This bug is not only a great opportunity to learn something (for instance, I’ve never worked with ReportLab before) but also for me to sit back and think more carefully about the application and how parts of it could be structured so as to avoid issues cropping up in the future. For instance, how could we avoid assuming that the output will always be a single page? What else is lurking in there that could be a risk for future changes?

We need to step back/zoom out from the task at hand (i.e. bug fixing) and look at the code at larger scales. It can be instructive to take a moment and consider the code and the system in which it is contained, even if one doesn’t actively make any changes; it’s helpful just to load everything into your head and take a new perspective to see how things could look differently and how one could reduce future risk.

A short excursion into building documents with ReportLab

Before the bug can be fixed in the production code, it’s instructive to take a quick excursion into how to build a PDF document from scratch using ReportLab. This way, we’ll have a much better idea of what the code and its generic structure should look like before it’s actually changed “for real”.

ReportLab has a framework called PLATYPUS (see the ReportLab documentation in Chapter 5 which one can use to create documents. The name is derived from “Page Layout and Typography Using Scripts”. Paraphrasing the documentation: within the PLATYPUS framework one builds a PDF document by using a DocumentTemplate as the main container, which then contains one or more PageTemplates, each of which contains one or more Flowable objects (such as Paragraph, Table, or Image). To create a PDF document, one simply calls the build() method of the instantiated document object.

Remember the full document layout I mentioned at the beginning?

Full document layout

Well, let’s slowly build an example document with this layout in mind, but this time using DocumentTemplate and Flowable objects.

For simple documents, one can use the SimpleDocTemplate class provided by the ReportLab library because it already provides its own PageTemplate and Frame setup so we don’t have to. A document is created via the PLATYPUS layout engine from a list of Flowable objects passed to the SimpleDocTemplate instance. The ReportLab docs have a nice way of doing this: the variable containing the list of flowables is called story, which makes sense, because a document “tells a story” even if it is automatically generated from some collection of database entries.

Here’s probably the simplest code one can use to create a PDF document:

from reportlab.platypus import SimpleDocTemplate, Paragraph

def render():
    story = [Paragraph("Hello, world!")]

    doc = SimpleDocTemplate("basic-doc-template-doc.pdf")
    doc.build(story)

if __name__ == "__main__":
    render()

To add text to the document we simply pass a list containing a Paragraph of text to the SimpleDocTemplate. What’s also good here is how nicely this reads. In particular, the code

    doc.build(story)

basically says “build the document from this story” (or equivalently “document: build this story”); it reads almost like an English sentence and makes the code much easier to understand at a glance.

Text spills onto subsequent pages

Now, just to make sure that creating a document from a “story” of paragraphs does flow automatically from one page onto another (as we expect it does), let’s take the basic example above and extend it to use multiple paragraphs of lorem ipsum text, separated by a bit of vertical whitespace to visually separate the blocks of text.

from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.units import mm

from lorem_text import lorem


def render():
    story = []
    story.append(Paragraph(lorem.paragraphs(3)))
    story.append(Spacer(1, 5*mm))
    story.append(Paragraph(lorem.paragraphs(3)))
    story.append(Spacer(1, 5*mm))
    story.append(Paragraph(lorem.paragraphs(3)))
    story.append(Spacer(1, 5*mm))
    story.append(Paragraph(lorem.paragraphs(3)))

    doc = SimpleDocTemplate("multi-page-doc-template-doc.pdf")
    doc.build(story)


if __name__ == "__main__":
    render()

Running this code we find that the text in the generated document flows naturally from the first page onto the second as we intend.

Text flows naturally from first page to second page

We now have the building blocks required to reproduce the full original document without being restricted to a single page.

Finally building the full document

Because the full document is split into three main blocks, let’s build these separately and then stitch everything together at the end.

Building the document title block

Remember that the document title block is an element with the document title text set to the left, and a logo image set to the right of the text:

Document title block

To build this, we’re going to create a table with two columns and one row, the left-hand column will contain a Paragraph with the title text and the right-hand column will contain an Image with a sample logo image. Here’s the code:

from reportlab.platypus import SimpleDocTemplate, Paragraph, Image, Table
from reportlab.lib.units import mm


def render():
    logo_filename = "sample-logo.png"
    logo = Image(logo_filename, width=20*mm, height=10*mm)

    title_text = Paragraph("Document title")

    title = [Table([[title_text, logo]])]

    doc = SimpleDocTemplate("document-title-block.pdf")
    doc.build(title)


if __name__ == "__main__":
    render()

Even though they are hard coded, the dimensions of the logo image are sufficient for our purposes here. Note that a Table takes a 2-D array of strings or Flowable objects as its data input, therefore to get a table with one row and n columns, we need to pass a 1xn array into the Table constructor, i.e.:

[['a', 'b', 'c', 'd']]

Similarly, to get a table with n rows and one column, we pass an nx1 array:

[['a'], ['b'], ['c'], ['d']]

Running this script we get the following output:

Document title block without styles

Which is a start, but it’s not styled the way we’d like it to be: the text should be larger, the logo should be aligned to the right-hand margin, and everything should be vertically distributed around the middle of the objects’ respective heights. We need to use styles to get things to look right.

A very brief introduction to ReportLab styles

ReportLab defines a set of base styles which one can access by calling the getSampleStyleSheet function defined in the styles package:

from reportlab.lib.styles import getSampleStyleSheet

I prefer to import this as getBaseStyleSheet as this name seems to better represent what this function returns: namely the base stylesheet defined by the ReportLab library:

from reportlab.lib.styles import getSampleStyleSheet as getBaseStyleSheet

Calling the .list() method on the stylesheet returns the available styles in all their glory:

stylesheet = getBaseStyleSheet()
stylesheet.list()

Running this code one sees that there are definitions for paragraphs, headings, unordered lists and many more typical textual elements.

Using ReportLab styles for the title text

The output of stylesheet.list() contains a Title style, so it seems logical to use that for our title text. Looking at the output for the Title style very carefully, we see that its alignment property is set to 1:

Title title
    name = Title
    parent = <ParagraphStyle 'Normal'>
    alignment = 1
<snip>

What does the value of 1 mean for the alignment? Searching through the docs we find this text:

There are four possible values of alignment, defined as constants in the module reportlab.lib.enums. These are TA_LEFT, TA_CENTER or TA_CENTRE, TA_RIGHT and TA_JUSTIFY, with values of 0, 1, 2 and 4 respectively. These do exactly what you would expect.

In other words, alignment = 1 means that the text is centred, which isn’t what we want: we’d like the title text to be left-justified, hence we import the TA_LEFT enum from the enums package and set the alignment parameter of the imported Title style to TA_LEFT, in other words, we extend the code with this import:

from reportlab.lib.enums import TA_LEFT

and these settings:

stylesheet = getBaseStyleSheet()
title_style = stylesheet["Title"]
title_style.alignment = TA_LEFT

title_text = Paragraph("Document title", title_style)

which builds the title text the way we’d like. Note that we can select which of the various styles we want to use by using the style’s name as a key to the stylesheet data structure.

Styling a table with ReportLab

When styling a table, things are a bit more tricky, however a lot more flexible, because one can style individual table cells if one wants to. A table style is defined as a list of tuples defining the formatting commands to use for the given cells. For instance, to right-justify the cell in the second column of the first row, we use a tuple like this:

('ALIGN', (1, 0), (1, 0), 'RIGHT')

where the formatting command is ALIGN.

The next two tuples define the cell (or range of cells) to format, by–confusingly–using column-row ordering for the tuples, rather than row-column ordering which one might expect. This is, however mentioned in the documentation:

The coordinates are given as (column, row) which follows the spreadsheet ‘A1’ model, but not the more natural (for mathematicians) ‘RC’ ordering.

Finally, we specify the alignment setting to use, in this case RIGHT.

Since we also want to vertically centre the elements in the document title block, we also use the VALIGN command like so:

('VALIGN', (0, 0), (-1, -1), 'MIDDLE')

where we have defined a range of cells from the upper-left cell (0, 0) to the “last” cell in each dimension (-1, -1) by using the Python convention of specifying the last element of a list by using negative indices. The vertical alignment is then set to MIDDLE so that the objects are nicely vertically distributed around their centre lines.

from reportlab.platypus import SimpleDocTemplate, Paragraph, Image, Table
from reportlab.lib.styles import getSampleStyleSheet as getBaseStyleSheet
from reportlab.lib.units import mm
from reportlab.lib.enums import TA_LEFT


def render():
    logo_filename = "sample-logo.png"
    logo = Image(logo_filename, width=20*mm, height=10*mm)

    stylesheet = getBaseStyleSheet()
    title_style = stylesheet["Title"]
    title_style.alignment = TA_LEFT

    title_text = Paragraph("Document title", title_style)

    table_style = [
        ('ALIGN', (1, 0), (1, 0), 'RIGHT'),
        ('VALIGN', (0, 0), (-1, -1), 'MIDDLE')
        ]
    title = [Table([[title_text, logo]], style=table_style)]

    doc = SimpleDocTemplate("document-title-block.pdf")
    doc.build(title)


if __name__ == "__main__":
    render()

which gives this output:

Document title block with styles

and is much closer to the styling we want for this part of the document.

Let’s now turn our attention to the middle part of the document, which, as we’ll see, requires a bit more thought to get the layout right.

Building the info and image boxes

As a reminder, the section of the document that we want to build looks like this:

Info and image box block

Naively, (and after having read the documentation) one could assume that since we want a two-column layout here, that the BalancedColumns class would be the best thing to use. Unfortunately, this isn’t the right thing to use in this case because we want the info boxes to always be in the left-hand column and we want the image boxes to always be in the right-hand column. Using the BalancedColumns class would make elements flow from one column into the other in order to balance things out, which is probably what one wants in most situations, however (just to be different!) we don’t want that here: the columns have to be unbalanced and we don’t want any excess from one column spilling over into the other column automatically.

To get this layout right, I’m going to use a single-column Table just for the info boxes, and a single-column Table just for the image boxes. Then I’m going to embed these single-column tables into a single-row, two-column Table so that we can get the desired effect. Embedding tables within a table allows the vertical content to stay within its given column and hence allows us to have unbalanced columns in the main table.

The left-hand column ends up being quite simple, it’s just a single-column table of Paragraph objects:

    left_table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ]
    left_column = Table(
        [
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
        ], style=left_table_style)

where I’ve added a red border to highlight the table’s extent. Note that the red colour is imported from the colors package:

from reportlab.lib.colors import red

For the right-hand column, we load an image, scale it to be a bit under half the width of the document (so that it fits nicely within the column) and then add it twice to a single-column Table via an Image object:²

    doc = SimpleDocTemplate("info-and-image-box-block.pdf")

    image_fname = "colosseum-rome-2004.jpg"
    img_width, img_height = PILImage.open(image_fname).size
    scaled_img_width = doc.width/2.2
    scaled_img_height = img_height*scaled_img_width/img_width

    right_table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ]
    right_column = Table(
        [
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
        ], style=right_table_style)

where I’ve again added a red border to highlight this table’s extent and I’ve had to instantiate the doc object earlier so that I can get access to its width parameter.

Now we just combine these two tables in a Table with one row and two columns and build the document:

    table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('ALIGN', (1, 0), (-1, -1), 'CENTER')
        ]
    columns = [Table([[left_column, right_column]], style=table_style)]

    doc.build(columns)

where all elements in this table (namely the two sub-tables) are aligned vertically to the top of the encapsulating table, and the right-hand column is aligned in the centre which looks a bit nicer than the default left-justification.

Putting this all together we get:

from reportlab.platypus import SimpleDocTemplate, Paragraph, \
    Image, Table
from reportlab.lib.colors import red

from PIL import Image as PILImage

from lorem_text import lorem


def render():
    doc = SimpleDocTemplate("info-and-image-box-block.pdf")

    left_table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ]
    left_column = Table(
        [
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
        ], style=left_table_style)

    image_fname = "colosseum-rome-2004.jpg"
    img_width, img_height = PILImage.open(image_fname).size
    scaled_img_width = doc.width/2.2
    scaled_img_height = img_height*scaled_img_width/img_width

    right_table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ]
    right_column = Table(
        [
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
        ], style=right_table_style)

    table_style = [
        ('BOX', (0, 0), (-1, -1), 2, red),  # highlight table border
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('ALIGN', (1, 0), (-1, -1), 'CENTER')
        ]
    columns = [Table([[left_column, right_column]], style=table_style)]

    doc.build(columns)


if __name__ == "__main__":
    render()

which looks like this:

Info and image box block with extents marked in red

Note that this output can be cleaned up (i.e. the red borders can be removed) by removing the BOX table formatting commands.

Building the comments block

Creating the comments block is by far the easiest part about constructing this document:

from reportlab.platypus import SimpleDocTemplate, Paragraph

from lorem_text import lorem

def render():
    comments = [Paragraph(lorem.paragraphs(4))]

    doc = SimpleDocTemplate("comments-block.pdf")
    doc.build(comments)

if __name__ == "__main__":
    render()

which, as one would expect, gives this output:

Comments block

We now just need to stitch all these pieces together in a single document.

Building the full document

Using the code from the previous three sections, we can create a story which contains all of the objects we want to add to the document as well as their styling. The main task of laying out the code on the page is handled by the PLATYPUS framework. We remove the red borders from the tables in the info and image box block and reduce the length of the comments block slightly. Making these changes gives us the following code for the full document:

from reportlab.platypus import SimpleDocTemplate, Paragraph, Image, Table
from reportlab.lib.styles import getSampleStyleSheet as getBaseStyleSheet
from reportlab.lib.units import mm
from reportlab.lib.enums import TA_LEFT

from PIL import Image as PILImage
from lorem_text import lorem


def render():
    doc = SimpleDocTemplate("full-document.pdf")

    # document title block
    logo_filename = "sample-logo.png"
    logo = Image(logo_filename, width=20*mm, height=10*mm)

    stylesheet = getBaseStyleSheet()
    title_style = stylesheet["Title"]
    title_style.alignment = TA_LEFT

    title_text = Paragraph("Document title", title_style)

    table_style = [
        ('ALIGN', (1, 0), (1, 0), 'RIGHT'),
        ('VALIGN', (0, 0), (-1, -1), 'MIDDLE')
        ]
    title = Table([[title_text, logo]], style=table_style)

    # info and image box block
    left_column = Table(
        [
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
            [Paragraph(lorem.paragraph())],
        ])

    image_fname = "colosseum-rome-2004.jpg"
    img_width, img_height = PILImage.open(image_fname).size
    scaled_img_width = doc.width/2.2
    scaled_img_height = img_height*scaled_img_width/img_width

    right_column = Table(
        [
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
            [Image(image_fname,
                   width=scaled_img_width, height=scaled_img_height)],
        ])

    table_style = [
        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
        ('ALIGN', (1, 0), (-1, -1), 'CENTER')
        ]
    columns = Table([[left_column, right_column]], style=table_style)

    # comments block
    comments = Paragraph(lorem.paragraphs(2))

    story = [
        title,
        columns,
        comments
    ]

    doc.build(story)


if __name__ == "__main__":
    render()

which, when run, gives this output:

Full document output

Playing around with the amount of text provided by the lorem package, we can see that the comments block text naturally and automatically flows onto a subsequent page should the amount of content on the first page be too large.

We did it!

We did it! Woohoo!

Fixing the bug

Now comes the tricky part: fixing the bug in the production code.

Fortunately, we have lots of tests (at the time of writing, 259 tests with 84% total code coverage and 97% coverage in the affected module), therefore we can be fairly confident that any changes we make won’t randomly break something else.

First, we’ll need to add a failing test which reproduces the problem; this will catch any problems should a regression occur in the future. Then, we’ll update the production code to fix the issue and consequently should see the test pass.

Writing a failing test to expose the problem

In order to avoid leaking internal information about the system, I’m just going to describe the test I ended up creating in words:

set up the object used to generate the PDF document.
ensure that the text in the comment block is too big to fit into its assigned space (by repeating the string This text is visible in the generated PDF. many times.
assert that the string should appear in the output PDF (when compression is turned off, the string will appear as text in the raw PDF file).

As expected, this test fails. I also checked that the test passes when the text in the comment block isn’t too big for its assigned space, that way we know that the test is working correctly.

Now that we know we’ve got the right test, we can comment it out temporarily and then refactor the current implementation to replace a collection of Frame objects with a SimpleDocTemplate; in other words (paraphrasing Kent Beck):

Make the change easy, then make the easy change.

Refactoring the implementation to allow for more flexibility

The original implementation had a special method which extracted the PDF page content before it was saved to file so that the content (which is basically a string at this point) could be tested. IIRC, the reason for this method existing was because (at the time) the output PDF couldn’t be read as a string and hence we couldn’t check that the expected text elements appeared in the output. While working on this bug fix I noticed that PDF compression was the reason that one couldn’t use the saved PDF file to check for expected text elements.

A bit of testing showed that (in the case for the documents produced here) compression didn’t bring much of a win (compressed and uncompressed files were ~400kB and ~500kB respectively), so by importing the rl_config module and setting pageCompression to zero:

from reportlab import rl_config

rl_config.pageCompression = 0

meant that we can now, theoretically, tap into the saved PDF output to check for the presence of expected text elements.

Although it’s a bit of an anti-pattern to have a method on a class just for testing, I decided to follow this pattern and add an extra method to access the PDF output built via SimpleDocTemplate before it was saved to file. This allowed me to reproduce the assertions in the currently-available test code (as well as maintain the general “shape” of the test code) so that I could build a “shadow” version of the PDF output using SimpleDocTemplate, however without affecting the production code. This means that one could have taken any commit and pushed it to production and the code would still have used the original PDF generation implementation even though the new implementation was only partially available.

This might sound like extra work (and it is), however it means that each commit passes the test suite (which then allows us to use git bisect automatically in the future if necessary) and means that each commit is potentially releasable, which is of great help in a continuous integration environment.

I extended each test so that it tested both the original and the new implementation which in turn helped guide making the new implementation. The new code was thus built incrementally and checked against an already-available test suite to ensure the desired functionality was available in both the new implementation as well as the original implementation. Having already made a prototype implementation as discussed in previous sections helped a lot in this process because the problem space had already been well explored.

Out with the old; in with the new

Once the new implementation was in place–and once the page formatting was fine-tuned to better match the original implementation–it was possible to simply delete the old implementation and the tests of the old implementation and things “just worked”!

In the end I was also able to remove the special function used to access the PDF output for testing because the final render() method was able to be used directly, since page compression was found to be no longer necessary. Therefore, an anti-pattern “smell” could be removed from the code as well! Win-win!

So, after squashing a few commits that were doing the same task and reorganising things a bit, it turned out to take 41 commits to refactor the code and make this bug fix.

Rolling out to production

I contacted the main stakeholder to get feedback as to whether or not the output from the new implementation is acceptable. In particular, the page layout changed slightly from the original implementation as there is less extra vertical whitespace present and I needed to confirm that such changes (which look ok to me) are ok for people using the system; after all, they’re the ones who are most affected by any changes.

They were happy with the new output and everything has now been rolled out on the production system. Job done! Yay!

That was fun!

Dumbledore - that was fun

Well, that was fun! Not only was the problem solved for the user, but I also got to learn something at the same time.

I hope that this story was helpful and even interesting! If you have any comments, questions or feedback, feel free to ping me on Mastodon or drop me a line via email.

Also, I’m sure there’s a classicist screaming somewhere at my use of the word “lorems”, but I digress. ↩
I use two images here in the right-hand column to simulate the situation in the production environment more closely. ↩

Support

If you liked this post and want to see more like this, please buy me a coffee!

Share on

Twitter Facebook LinkedIn Reddit Hacker News Mastodon

Paul Cochrane

The curious case of the disappearing PDF text

It all began with a bug

We need to know where we’re going before we can go there

A very basic document

A closer approximation

Reproducing the bug

Understanding bugs can create new requirements

Bugs are opportunities

A short excursion into building documents with ReportLab

Text spills onto subsequent pages

Finally building the full document

Building the document title block

A very brief introduction to ReportLab styles

Using ReportLab styles for the title text

Styling a table with ReportLab

Building the info and image boxes

Building the comments block

Building the full document

Fixing the bug

Writing a failing test to expose the problem

Refactoring the implementation to allow for more flexibility

Out with the old; in with the new

Rolling out to production

That was fun!

Support

Share on

You may also enjoy

German Perl/Raku Workshop Frankfurt 2024

Sailplane glide distance

Starting `ssh-agent` in Windows PowerShell

Letting mere mortals run Windows PowerShell scripts

Paul Cochrane

It all began with a bug

We need to know where we’re going before we can go there

A very basic document

A closer approximation

Reproducing the bug

Understanding bugs can create new requirements

Bugs are opportunities

A short excursion into building documents with ReportLab

Text spills onto subsequent pages

Finally building the full document

Building the document title block

A very brief introduction to ReportLab styles

Using ReportLab styles for the title text

Styling a table with ReportLab

Building the info and image boxes

Building the comments block

Building the full document

Fixing the bug

Writing a failing test to expose the problem

Refactoring the implementation to allow for more flexibility

Out with the old; in with the new

Rolling out to production

That was fun!

Support

Share on

You may also enjoy

German Perl/Raku Workshop Frankfurt 2024

Sailplane glide distance

Starting ssh-agent in Windows PowerShell

Letting mere mortals run Windows PowerShell scripts

Starting `ssh-agent` in Windows PowerShell