Contextures had a nice post last week on converting a PDF to an Excel file using pdftoexcelconverter.net. Talk about a URL that says what it does. Jeff Weir commented that he’s used the OCR capabilities of OneNote to do the same thing. I thought a test was in order.
The original Excel file that was printed to PDF using CutePDF
I followed Debra’s instructions and got this result from pdftoexcelconverter.net
It converted the font to Times New Roman and ignored borders, bold, merged cells, underlines, and italics. For pure data though, a pretty good job.
Next, I dragged-and-dropped the PDF into OneNote and chose the “Print Out” option. I right-clicked on the image and chose “Copy Text…” and pasted in Excel.
Yikes. Font conversion is the least of my worries here. Finally, I dragged-and-dropped the PDF into Google Docs and chose the options to convert to Google Docs format and to OCR it. Google Docs converts PDFs to Documents (not spreadsheets) so I wasn’t very hopeful. I didn’t see a way to convert the Doc to a spreadsheet so I saved it as HTML, then opened in Excel.
The whole table is one cell. My conclusion is that OCRing tables is hard.