Here’s the ideal:
- I take photos of (printed) documents I want with camera phone
- photos save to google drive
- Tropy project reads those photos from Google Drive
- I use Tesseract to OCR those documents
- The result is added as a note to each document in Tropy
Ideally, that’d all happen automatically. So far, here’s what I can do
- take photos with the camera
- find the photos on the camera, upload them to Google Drive
- in Tropy, I import the photos from that folder
- in R Studio, I run a batch OCR script that uses Tesseract
- I manually add the resulting text into the notes field in Tropy
For reference, here’s my batch ocr script:
library("tesseract") library("magick") library("magrittr") # load 'em up dest <- "/path/to/images" myfiles <- list.files(path = dest, pattern = "jpg", full.names = TRUE) # improve the images # ocr 'em # write the output to text file lapply(myfiles, function(i){ text <- image_read(i) %>% image_resize("3000x") %>% image_convert(type = 'Grayscale') %>% image_trim(fuzz = 40) %>% image_write(format = 'png', density = '300x300') %>% tesseract::ocr() outfile <- paste(i,"-ocr.txt",sep="") cat(text, file=outfile, sep="\n") })