MATLAB ships with the Apache PDFBox Java library which allows importing and rendering PDF files. Use the following MATLAB function PDFtoImg() to import a scanned PDF, and save each page as a separate PNG file:
function
images = PDFtoImg(pdfFile)
import
org.apache.pdfbox.*
import
java.io.*
filename = fullfile(pwd,pdfFile);
jFile = File(filename);
document = pdmodel.PDDocument.load(jFile);
pdfRenderer = rendering.PDFRenderer(document);
count = document.getNumberOfPages();
images = [];
for
ii = 1:count
bim = pdfRenderer.renderImageWithDPI(ii-1, 300, rendering.ImageType.RGB);
images = [images (filename +
"-"
+
"Page"
+ ii +
".png"
)];
tools.imageio.ImageIOUtil.writeImage(bim, filename +
"-"
+
"Page"
+ ii +
".png"
, 300);
end
document.close()
The input, variable "pdfFile", must be a string or a character array. For example,
Notes:
1. The function will split the input PDF data into one image for each PDF page. For example, if “example.pdf” contains 13 pages, it will convert the 13 pages to 13 images.
2. For subsequent OCR tasks, is important to render the PDF pages with 300 dpi or higher resolution:
>> bim = pdfRenderer.renderImageWithDPI(ii-1, 300, rendering.ImageType.RGB);