Serverless PDF Export

When we're in a browser we can easily download a page to a PDF using Print + Save as PDF. However sometimes we need to a programmatic way to automatically download a web page to PDF. From a terminal or when doing local development we can easily use Puppeteer to handle this as the desired functionality is built-in, but this depends upon the availability of a web browser, which isn't so simple within a serverless function, such as Firebase or Google Cloud Functions.

The principles involved include:

Install the Browser
Get a reference to the browser
Load the web page
Export to PDF
Save the PDF to the Storage Bucket

When trying to get this to work I found a few issues. First, there's no browser natively available within the cloud function environment on GCP. Then, when trying to install the browser at deploy time it either installed the browser only on the system performing the deploy (not in the target environment) or else it gets added to the deploy bundle to the tune of over 250 MiB, which is far too large for a typical serverless bundle.

Because of this, I chose to perform the browser installation as part of the main function execution. Initially, I assumed that installing the browser would be extremely time consuming, but in practice I found it added only a couple seconds to the cold start time of the function. Too slow for most purposes, but manageable in the context of converting a page to PDF.

Setup

Start by installing Puppeteer.

npm install puppeteer --save

Puppeteer Service

Create Puppeteer service to install a browser and retrieve a reference to it.

import { Browser, BrowserPlatform, install, InstalledBrowser } from '@puppeteer/browsers'

import * as puppeteer from 'puppeteer'

export class PuppeteerService {
  private installedBrowser: InstalledBrowser = null

  async installBrowser(): Promise<void> {
    if(!this.installedBrowser) {
      const installedBrowser = await install({
        browser: Browser.CHROME,
        buildId: '146.0.7680.177',
        platform: BrowserPlatform.LINUX,
        cacheDir: '/tmp/puppeteer-cache'
      })
      this.installedBrowser = installedBrowser
    }
  }

  async getHeadlessBrowser(): Promise<puppeteer.Browser> {
    await this.installBrowser()
    return puppeteer.launch({
      executablePath: this.installedBrowser.executablePath,
      args: [ '--no-sandbox', '--disable-setuid-sandbox' ],
      headless: true,
    })
  }

}

The installBrowser() function makes it possible to request browser installation at any time we want, and the getHeadlessBrowser() function returns a reference to a headless browser instance. In order to install the browser you must specify a specific build number that is available for installation, so this is a great thing to parameterize and will become an important ongoing maintenance task. The specific cacheDir I've provided works in the cloud functions but might need customized based upon your environment.

The full path to the executable of the installed browser must be provided to the launch(...) method along with the necessary parameters for headless use. The term "headless" here means "without rendering the UI" and this is important both because it's faster, but also because many server environments don't have the UI libraries that would be necessary to render the UI.

Function

The code below sets up an onCall function called createPDF which takes a URL parameter and produces a PDF file in the requesting user's download folder, using the Puppeteer page.pdf(...) function to do the heavy lifting. Here I've specified 2 CPUs and 2 GiB of memory for good measure. You may also find that you need to increase the timeout, based upon the load time of the website(s) you're exporting PDFs from.

export const createPDF = onCall<{ url: string }>(
  { cpu: 2, memory: '2GiB' },
  async request => {

  const uid = request.auth?.uid
  const { url } = request.data

  if (!uid) handleHttpError([ 'unauthenticated', 'Please log in.' ])

  // Setup the browser
  const puppeteerService = new PuppeteerService()
  const browser = await puppeteerService.getHeadlessBrowser()
  const page = await browser.newPage()

  // Load the page
  await page.goto(url,  { waitUntil: 'domcontentloaded' })

  // Export the PDF
  const title = await page.title()
  const path = join(tmpdir(), `${title}.pdf`)
  await page.pdf({
    path,
    format: 'Letter',
    printBackground: true,
  })

  // Setup the Storage Bucket
  const bucket = (await getStorageBucket()).bucket()
  const file = bucket.file(`downloads/${uid}/${title}.pdf`)
  const [ exists ] = await file.exists()
  if(exists) {
    await file.delete()
  }

  // Copy the PDF to the Storage Bucket
  const inStream = await fs.createReadStream(path)
  const outStream = file.createWriteStream()
  await pipeline(inStream, outStream)

  // close the browser
  await browser.close()
})

The initial download is made to the functions local filesystem in a temp directory, but then the stream pipeline(read, write) function is used to pipe the downloaded file into the storage bucket file that's been created. And finally, don't forget to clean up by closing the browser.

Summary

The simplicity of the code masks the fact that figuring out how to do this and get it tested in a live serverless environment took a bit of figuring out. If this helps you solve a problem on your project, please reach out and let me know!

Setup

Puppeteer Service

Function

Summary

Custom Websites