Uncategorized

Tropy – OCR – Notes Workflow?

Here’s the ideal:

  • I take photos of (printed) documents I want with camera phone
  • photos save to google drive
  • Tropy project reads those photos from Google Drive
  • I use Tesseract to OCR those documents
  • The result is added as a note to each document in Tropy

Ideally, that’d all happen automatically. So far, here’s what I can do

  • take photos with the camera
  • find the photos on the camera, upload them to Google Drive
  • in Tropy, I import the photos from that folder
  • in R Studio, I run a batch OCR script that uses Tesseract
  • I manually add the resulting text into the notes field in Tropy

For reference, here’s my batch ocr script:

library("tesseract")
library("magick")
library("magrittr")

# load 'em up
dest <- "/path/to/images"
myfiles <- list.files(path = dest, pattern = "jpg", full.names = TRUE)

# improve the images
# ocr 'em
# write the output to text file

lapply(myfiles, function(i){
text <- image_read(i) %>%
image_resize("3000x") %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr()

outfile <- paste(i,"-ocr.txt",sep="")
cat(text, file=outfile, sep="\n")

})
Advertisements
Standard