PDA

View Full Version : extracting text from a PDF document


cattletruck
30th Aug 2013, 09:33
I have a rather lengthy PDF document (legal document) where each page is an image.

I want to quote numerous statements from this PDF document but cannot cut and paste the text out of it, I can only select an image of the text which will only make my task harder when pasting the selections into a MS-Word document.

Does anyone know of a smart tool out there that can extract the text out of this kind of PDF document.

mixture
30th Aug 2013, 09:37
PDFs thankfully can be locked to prevent people like you copy pasting the contents... :E

Anyway, giving you the benefit of the doubt that your intentions are legitimate, Google the term "OCR" or the expansion of the acronym.... "optical character recognition"

cattletruck
30th Aug 2013, 10:11
It's just waffling witness statements that were faxed then scanned into a PDF document. All I want to do is annotate it in a printer friendly way.

Seems like there are many ways to 'OCR' a PDF on google. Thanks mixture.

mixture
30th Aug 2013, 11:39
faxed and scanned eh'..... OCR results might be interesting, but you never know, try a few different engines.

Good luck.

Peter47
30th Aug 2013, 18:52
Probably the best thing is try out a free trial of various pdf extract packages but you will probably have to end up paying.

Able to extract professional includes OCR (but I'm not certain if the free trial includes this). It works quite well.

Adobe

Nuance pdf reader all have free trial periods.

Be warned though OCR is only as good as the original scan (I've tried extracting old timetables and its not 100% acuurate.

Good luck!

cattletruck
31st Aug 2013, 06:59
Another mixture success story.

Googling around I discovered that Adobe Acrobat Pro has a built in OCR reader, I have Adobe CS3 but haven't installed everything and couldn't be bothered searching if I had it.

So I downloaded and installed the first free converter I saw, but the unlicensed version would only OCR the first 3 pages. Not a problem as Adobe Illustrator allows you to save individual pages of a PDF document in PDF format. So it was a case of chopping up the long PDF document into single pages and pumping them through this converter.

The end result is not perfect and I was expecting that. You lose all document formatting, Capital M often became |\/|, d often became cl, 7 often became 9, but most of it was there, and with the original PDF as reference I spent an hour and a half getting the MS-Word copy into shape.

The new MS-Word copy is for my own personal reference, hence the annotations I will be making to it, but I wouldn't trust this OCR process to produce important information - it's just too risky for a number of reasons.

As a final footnote on this subject, most OCR tools cannot read landscape documents and produce gobbledygook. Adobe Illustrator saved the day.

Many thanks mixture. :ok:

mixture
31st Aug 2013, 08:14
Pleasure. Glad to hear it all worked out.

Adobe CS is a great bag of tools. Not cheap, but don't know what I'd do without it !

As for landscape documents, I think corporate type tools generally have some image orientation capabilities which might be lacking in cheaper implementations of OCR. But then I've never had much occasion to test landscape docs on either sort of tool.

OverRun
2nd Sep 2013, 12:10
I use quite a few pdf documents. I experimented for several years with the various free/low cost pdf programmes and had some success. I could extract pages, add pages, combine documents, and print documents to pdf. But then I needed to work with plans (in pdf format) and the ability of the full-price Adobe Acrobat Pro to measure up plans became important to me.

I had to get Pro, but I couldn’t come at paying the full price (which was horrendous a few years ago but at least it has come down to just being bl**dy expensive today). So I looked on eBay and found a cheap original which was a couple of versions old; installed it and it worked perfectly. Actually it worked too perfectly because I can’t do without it now. The support for my full Pro version ended a while back and I got a new computer; same story – need a new copy, look on eBay and a genuine non-student earlier version of Pro a couple of years old. Then brought a laptop – had to buy another copy. Ouch. But I am hooked on Pro, and if you use a few pdfs or need to measure up, it is the way to go.

PS - I am using version X, and if the text cannot be copied easily, it has a "recognise text" tool [OCR] for exactly your need.
PPS - It would be jolly bad form to use Advanced PDF Password Recovery (Pro) to crack and remove any password protection.

Heathrow Harry
2nd Sep 2013, 17:06
I've used APDFR from Elmcomsoft for years with some success to "break" pdf's

Password recovery, forensic, forensics, system and security software from ElcomSoft : recover or reset lost or forgotten password, remove protection, unlock system (http://www.elcomsoft.com)

They're a Russian outfit but it's quality stuff and not very expensive

Feline
2nd Sep 2013, 20:57
I had a similar problem with some legal documents (happened to be a Company Memorandum of Association) which I had to update and was only available in PDF format.

Opened the PDF in Adobe Acrobat version XI - then click on "File", click on "Save As" and save the document as a text file.

This only seems to work with some documents (maybe those without any security settings) and it only saves "pure" text (ie. any images do not appear so it might not worked on scanned FAX images). The formatting is often a dog's breakfast. But it did save me quite a lot of tedious re-typing, albeit at a the cost of some editing effort.

FWIW - your mileage might differ!

JeroNi
24th Oct 2013, 14:00
hey i have also been searching for a way to legally extract text from a pdf...
so far ive only found Create Word document | PDFtoWord Pro (http://pdftoword.pro/) but the text ends up kind of jumbled up...

does anyone know a good extractor / converter to word or whatever?

thanks a lot :)

OverRun
25th Oct 2013, 07:32
FWIW there are some .pdfs which seemingly cannot be copied or OCR'd or somehow moved into Word, as they just look like a jumbled mess afterwards. Almost as if the chosen font is nonsense. I have tried various fonts to get around that but without success.

pdf versions of ICAO Annex 14 seem to be one of them. :ugh:

cattletruck
25th Oct 2013, 10:01
I've noticed PDFs saved as an image optimised for the internet rather than for printing produces very blurry text. I doubt any free OCR would have a chance of reading that. I also got a feeling that the same OCRs get thrown off course if there are also pictures present in the text.

If that is the case then the text in the image needs to be sharpened up and the pictures removed before passing it over to the OCR to do its thing. You can screen capture a PDF page and edit it in Gimp/Photoshop - sadly one page at a time.

Once you have the PDF page saved in image format then there are more free OCR tools to choose from. Just be aware that all text formatting will be lost, and the OCR converter is at best about 95% right.