How to Extract the Text from PDFs Using Python and the Google Cloud Vision API



This winter, I discovered that Wellesley College, where I am currently a senior studying Media Arts and Sciences, has an archive of over a hundred year’s worth of course catalogues, admissions guidelines, and yearly bulletins. I was immediately electrified by the potential for fascinating data which could be drawn from these documents, but the first step would have to be converting them to text, as there are not many analytical methods which can be run on scans of old, browned PDFs.

Thus began my search for a way to quickly and effectively run OCR on a large volume of PDF files while retaining as much formatting and accuracy as possible. After trying several methods, I found that using the Google Cloud Vision API yielded by far the best results of any of the publicly available OCR tools I tried. As I could not find any single, comprehensive guide to using this amazing tool to run simple OCR applications, I decided to write this one, so that anyone with a little programming knowledge can put this wonderful tool to use.

What You Will Need to Follow These Instructions

  • An installation of Python 3 and pip on your computer

  • A text editor for editing code — I use Visual Studio Code

  • A way to run Python programs on your computer.

  • You will also need a payment method to enter into your Google Cloud account, although you will not need to spend any money to complete this tutorial. A debit card, credit card, or Google Wallet accoutn will do.