Searching Inside PDFs from the Terminal Prompt

I have reason, today, to want to search the Military Law Review. If you know which issue the info you’re looking for is located, then you can just jump right in.

When do we ever know that? There’s no search-inside feature. So we’ll build one ourselves. After a bit of futzing, you can see that all of the pdfs are available in this one directory:

https://www.loc.gov/rr/frd/Military_Law/Military_Law_Review/pdf-files/

so

$ wget https://www.loc.gov/rr/frd/Military_Law/Military_Law_Review/pdf-files/ -A .pdf

should just download them all directly. But it doesn’t. However, you can copy the source html to a text editor, and with a bit of regex you end up with a file with just the paths directly to the pdf.  Pass that file as -i urls.txt to wget, and you end up with a corpus of materials.

How do we search inside? This question on Stackoverflow will help us out.  But it requires pdftotext to be installed. Sigh. Always dependencies! So, following this, here we go.

On the command line (with Anaconda installed):

conda create -n envname python=3.7
conda activate envname
conda config --add channels conda-forge
conda install poppler

The pdfs are in a folder called ‘MLR’ on my machine. From one level up:

$ find /MLR -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "trophies"' \;

et voila!