Go Back  PPRuNe Forums > Misc. Forums > Computer/Internet Issues & Troubleshooting
Reload this Page >

extracting text from a PDF document

Wikiposts
Search
Computer/Internet Issues & Troubleshooting Anyone with questions about the terribly complex world of computers or the internet should try here. NOT FOR REPORTING ISSUES WITH PPRuNe FORUMS! Please use the subforum "PPRuNe Problems or Queries."

extracting text from a PDF document

Thread Tools
 
Search this Thread
 
Old 30th Aug 2013, 09:33
  #1 (permalink)  
Thread Starter
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
extracting text from a PDF document

I have a rather lengthy PDF document (legal document) where each page is an image.

I want to quote numerous statements from this PDF document but cannot cut and paste the text out of it, I can only select an image of the text which will only make my task harder when pasting the selections into a MS-Word document.

Does anyone know of a smart tool out there that can extract the text out of this kind of PDF document.
cattletruck is offline  
Old 30th Aug 2013, 09:37
  #2 (permalink)  
 
Join Date: Aug 2002
Location: Earth
Posts: 3,663
Likes: 0
Received 0 Likes on 0 Posts
PDFs thankfully can be locked to prevent people like you copy pasting the contents...

Anyway, giving you the benefit of the doubt that your intentions are legitimate, Google the term "OCR" or the expansion of the acronym.... "optical character recognition"

Last edited by mixture; 30th Aug 2013 at 09:39.
mixture is offline  
Old 30th Aug 2013, 10:11
  #3 (permalink)  
Thread Starter
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
It's just waffling witness statements that were faxed then scanned into a PDF document. All I want to do is annotate it in a printer friendly way.

Seems like there are many ways to 'OCR' a PDF on google. Thanks mixture.
cattletruck is offline  
Old 30th Aug 2013, 11:39
  #4 (permalink)  
 
Join Date: Aug 2002
Location: Earth
Posts: 3,663
Likes: 0
Received 0 Likes on 0 Posts
faxed and scanned eh'..... OCR results might be interesting, but you never know, try a few different engines.

Good luck.
mixture is offline  
Old 30th Aug 2013, 18:52
  #5 (permalink)  
 
Join Date: Sep 2007
Location: London
Posts: 581
Likes: 0
Received 3 Likes on 3 Posts
Probably the best thing is try out a free trial of various pdf extract packages but you will probably have to end up paying.

Able to extract professional includes OCR (but I'm not certain if the free trial includes this). It works quite well.

Adobe

Nuance pdf reader all have free trial periods.

Be warned though OCR is only as good as the original scan (I've tried extracting old timetables and its not 100% acuurate.

Good luck!
Peter47 is offline  
Old 31st Aug 2013, 06:59
  #6 (permalink)  
Thread Starter
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
Another mixture success story.

Googling around I discovered that Adobe Acrobat Pro has a built in OCR reader, I have Adobe CS3 but haven't installed everything and couldn't be bothered searching if I had it.

So I downloaded and installed the first free converter I saw, but the unlicensed version would only OCR the first 3 pages. Not a problem as Adobe Illustrator allows you to save individual pages of a PDF document in PDF format. So it was a case of chopping up the long PDF document into single pages and pumping them through this converter.

The end result is not perfect and I was expecting that. You lose all document formatting, Capital M often became |\/|, d often became cl, 7 often became 9, but most of it was there, and with the original PDF as reference I spent an hour and a half getting the MS-Word copy into shape.

The new MS-Word copy is for my own personal reference, hence the annotations I will be making to it, but I wouldn't trust this OCR process to produce important information - it's just too risky for a number of reasons.

As a final footnote on this subject, most OCR tools cannot read landscape documents and produce gobbledygook. Adobe Illustrator saved the day.

Many thanks mixture.
cattletruck is offline  
Old 31st Aug 2013, 08:14
  #7 (permalink)  
 
Join Date: Aug 2002
Location: Earth
Posts: 3,663
Likes: 0
Received 0 Likes on 0 Posts
Pleasure. Glad to hear it all worked out.

Adobe CS is a great bag of tools. Not cheap, but don't know what I'd do without it !

As for landscape documents, I think corporate type tools generally have some image orientation capabilities which might be lacking in cheaper implementations of OCR. But then I've never had much occasion to test landscape docs on either sort of tool.
mixture is offline  
Old 2nd Sep 2013, 12:10
  #8 (permalink)  
Prof. Airport Engineer
 
Join Date: Oct 2000
Location: Australia (mostly)
Posts: 726
Likes: 0
Received 0 Likes on 0 Posts
I use quite a few pdf documents. I experimented for several years with the various free/low cost pdf programmes and had some success. I could extract pages, add pages, combine documents, and print documents to pdf. But then I needed to work with plans (in pdf format) and the ability of the full-price Adobe Acrobat Pro to measure up plans became important to me.

I had to get Pro, but I couldn’t come at paying the full price (which was horrendous a few years ago but at least it has come down to just being bl**dy expensive today). So I looked on eBay and found a cheap original which was a couple of versions old; installed it and it worked perfectly. Actually it worked too perfectly because I can’t do without it now. The support for my full Pro version ended a while back and I got a new computer; same story – need a new copy, look on eBay and a genuine non-student earlier version of Pro a couple of years old. Then brought a laptop – had to buy another copy. Ouch. But I am hooked on Pro, and if you use a few pdfs or need to measure up, it is the way to go.

PS - I am using version X, and if the text cannot be copied easily, it has a "recognise text" tool [OCR] for exactly your need.
PPS - It would be jolly bad form to use Advanced PDF Password Recovery (Pro) to crack and remove any password protection.

Last edited by OverRun; 2nd Sep 2013 at 12:14.
OverRun is offline  
Old 2nd Sep 2013, 17:06
  #9 (permalink)  
 
Join Date: Apr 2010
Location: London
Posts: 7,072
Likes: 0
Received 0 Likes on 0 Posts
I've used APDFR from Elmcomsoft for years with some success to "break" pdf's

Password recovery, forensic, forensics, system and security software from ElcomSoft : recover or reset lost or forgotten password, remove protection, unlock system

They're a Russian outfit but it's quality stuff and not very expensive
Heathrow Harry is offline  
Old 2nd Sep 2013, 20:57
  #10 (permalink)  
 
Join Date: Sep 1999
Location: Deepest Dark Afrika
Posts: 175
Likes: 0
Received 0 Likes on 0 Posts
Sometimes works ...

I had a similar problem with some legal documents (happened to be a Company Memorandum of Association) which I had to update and was only available in PDF format.

Opened the PDF in Adobe Acrobat version XI - then click on "File", click on "Save As" and save the document as a text file.

This only seems to work with some documents (maybe those without any security settings) and it only saves "pure" text (ie. any images do not appear so it might not worked on scanned FAX images). The formatting is often a dog's breakfast. But it did save me quite a lot of tedious re-typing, albeit at a the cost of some editing effort.

FWIW - your mileage might differ!
Feline is offline  
Old 24th Oct 2013, 14:00
  #11 (permalink)  
 
Join Date: Oct 2013
Location: Austria
Age: 36
Posts: 1
Likes: 0
Received 0 Likes on 0 Posts
hey i have also been searching for a way to legally extract text from a pdf...
so far ive only found Create Word document | PDFtoWord Pro but the text ends up kind of jumbled up...

does anyone know a good extractor / converter to word or whatever?

thanks a lot
JeroNi is offline  
Old 25th Oct 2013, 07:32
  #12 (permalink)  
Prof. Airport Engineer
 
Join Date: Oct 2000
Location: Australia (mostly)
Posts: 726
Likes: 0
Received 0 Likes on 0 Posts
FWIW there are some .pdfs which seemingly cannot be copied or OCR'd or somehow moved into Word, as they just look like a jumbled mess afterwards. Almost as if the chosen font is nonsense. I have tried various fonts to get around that but without success.

pdf versions of ICAO Annex 14 seem to be one of them.
OverRun is offline  
Old 25th Oct 2013, 10:01
  #13 (permalink)  
Thread Starter
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
I've noticed PDFs saved as an image optimised for the internet rather than for printing produces very blurry text. I doubt any free OCR would have a chance of reading that. I also got a feeling that the same OCRs get thrown off course if there are also pictures present in the text.

If that is the case then the text in the image needs to be sharpened up and the pictures removed before passing it over to the OCR to do its thing. You can screen capture a PDF page and edit it in Gimp/Photoshop - sadly one page at a time.

Once you have the PDF page saved in image format then there are more free OCR tools to choose from. Just be aware that all text formatting will be lost, and the OCR converter is at best about 95% right.
cattletruck is offline  

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.