PPRuNe Forums - View Single Post - extracting text from a PDF document

31st August 2013 | 06:59

#6 (permalink)

cattletruck

Joined: Apr 1998

Posts: 4

Likes: 1

From: Mesopotamos

Another mixture success story.

Googling around I discovered that Adobe Acrobat Pro has a built in OCR reader, I have Adobe CS3 but haven't installed everything and couldn't be bothered searching if I had it.

So I downloaded and installed the first free converter I saw, but the unlicensed version would only OCR the first 3 pages. Not a problem as Adobe Illustrator allows you to save individual pages of a PDF document in PDF format. So it was a case of chopping up the long PDF document into single pages and pumping them through this converter.

The end result is not perfect and I was expecting that. You lose all document formatting, Capital M often became |\/|, d often became cl, 7 often became 9, but most of it was there, and with the original PDF as reference I spent an hour and a half getting the MS-Word copy into shape.

The new MS-Word copy is for my own personal reference, hence the annotations I will be making to it, but I wouldn't trust this OCR process to produce important information - it's just too risky for a number of reasons.

As a final footnote on this subject, most OCR tools cannot read landscape documents and produce gobbledygook. Adobe Illustrator saved the day.

Many thanks mixture.

Reply

0