PDA

View Full Version : PDF to TXT conversion


ChristiaanJ
26th Apr 2009, 17:52
Most of current-day .PDF files have been created from computer-readable files (.DOC, .RTF, etc.), and most of the latest Acrobat Readers let you select and copy text (and even images) from such files, to quote elsewhere, for instance.

Some older .PDF files are simply PDF-compressed copies of scans (hence bit maps) of old documents, and the "select and copy text" function of Acrobat Reader no longer works, even though the documents are text, and the Acrobat Reader "select and copy text" function seems to be based on some kind of OCR.

Does anybody here know anything about tools to extract a text file from such an ancient .PDF file?
(Short, obviously, of printing, scanning, using an OCR program, and cleaning up afterwards.)

CJ

Saab Dastard
26th Apr 2009, 18:45
Omnipage (pro) does support importing directly from PDF files, so you could bypass the printing and scanning bit!

I'm sure that other OCR programs do so also.

I've seen older versions - e.g. 14 instead of 15 - at a fraction of the current version price.

SD

ChristiaanJ
26th Apr 2009, 19:55
SD,
Thanks already!
Any links to a cheap or free download that might do the job?
I don't mind paying sumpin', but not £500 for what will be a once-only.

Found one "converter", but that turned 36Mb into 1.3Gb, so obviously not what I wanted.

CJ

Saab Dastard
26th Apr 2009, 21:31
As I said, a previous version (http://www.serif.com/serifExtra/TemplatePages/Product/97/61/1.htm) can be yours for a LOT less!!

I just picked that as an example - I've never bought from them, so it's not a recommendation or endorsement.

You could also try ebay.

Try a google on "Free OCR software" - you might pick up something useful there.

SD

hellsbrink
26th Apr 2009, 21:38
Trying to remember what I used before, but this says it will do the job for you and is FREE. Can't test it myself because I'm running Linux at the mo (because Winblows decided to bork itself. Again)

Free PDF Text Extractor: Convert PDF file to plain text file. [A-PDF.com] (http://www.a-pdf.com/text/)

cdtaylor_nats
26th Apr 2009, 22:06
This looks like it will do the job

Download Free OCR - Freeware Download List (http://www.freewarezoom.com/archives/free-ocr)

Jofm5
26th Apr 2009, 23:55
If you have windows vista.............


Print the copy protected pdf document to the logical printer provded by vista called "Microsoft XPS document Writer" - this will then prompt you for a file name and save location. After its completed you can then double click on the document which will then open in internet explorer and copy the text with ease.

Saab Dastard
27th Apr 2009, 10:45
JOF,

That's neat - and you can get the XPS writer for Windows XP too, as an add-on.

SD

ChristiaanJ
27th Apr 2009, 15:32
JOF,
I Googled "XPS Document Writer", and the most popular downloads seem to be those for the XPS Document Writer Removal Tool !!!
Nuff said, maybe?

SD,
Thanks for the link to Omnipage 14, I couldn't find it. Since mine's a one-off job I'll still try for a freebie first, but otherwise that's a reasonable price. Omnipage is known well enough.

cdtaylor_nats
I looked at FreeOCR, but when you click "Download" you get an announcement of a new version, but no way to download it yet.

hellsbrink
A-PDF states specifically that it does not convert text-as-images PDF files.

Any other offers?

BTW, the "target" is no secret...
The BEA Habsheim report in French (http://www.bea.aero/docspa/1988/f-kc880626/pdf/f-kc880626.pdf)
Warning, it's about 38Mb.
So if anybody can turn it into .DOC, .RTF, .TXT, or whatever.. yes please (even if the formatting gets lost and it's not faultless; it's meant to be used as a raw base for a translation).

CJ

hellsbrink
27th Apr 2009, 17:12
Christiaan

This Advanced PDF to Word Converter Free - Free software downloads and reviews - CNET Download.com (http://download.cnet.com/Advanced-PDF-to-Word-Converter-Free/3000-10743_4-10900586.html?tag=mncol) has just converted that pdf to a doc file.

You can't edit the outputted (is that a word? who cares) file, but you can open it in M$ Office, etc.

Will keep a-hunting and find the program I used before because it was gratis too and let you edit the converted file.

ChristiaanJ
27th Apr 2009, 19:30
hellsbrink,
Downloaded your suggestion and let it loose on the PDF file.
As you say, it produces a .DOC file, that you can open with Word.

Unfortunately.... the result is a 74 page .DOC file (the right number of pages) but with all the text pages as ..... 74 pages of graphics.

So I'm no further ahead.....

Will keep a-hunting and find the program I used before because it was gratis too and let you edit the converted file.Yes, please, because that's what I'm looking for.

CJ

green granite
27th Apr 2009, 20:02
Adobe acrobat pro allows you to export a pdf as a .txt file which is fully editable but whether it works on old pdf's I've no idea.

taxydual
27th Apr 2009, 20:02
Would this do it?

PDF Ripper - Convert PDF to Word, PDF to RTF, PDF to HTML, PDF to Text, PDF to TXT (http://www.pdfpdf.com/pdfconverter.html)

ChristiaanJ
27th Apr 2009, 20:27
Adobe acrobat pro allows you to export a pdf as a .txt file which is fully editable but whether it works on old pdf's I've no idea.Acrobat Reader 6.0 (my favourite) lets you do that too - but only if the original file was text-based. The moment the text itself is image-based (raw scan converted to .pdf) it no longer works.

Would this do it?
PDF Ripper - Convert PDF to Word, PDF to RTF, PDF to HTML, PDF to Text, PDF to TXTI'll look, but again it sounds very much like it will do only PDFs that are text-based...

I just tried it, no joy...

CJ

Jhieminga
27th Apr 2009, 20:54
ChristiaanJ, it looks as if the PDF is based on graphics as has been stated before, in that case you will not be able to do anything but use an OCR program. I have done this before with Omnipage and that works quite well but it is not a one-click fix (mind you, there is no one-click solution for this!). If you're still stuck with this next week I could try to find the time to run it through my Omnipage version.

ChristiaanJ
27th Apr 2009, 21:41
ChristiaanJ, it looks as if the PDF is based on graphicsThat's exactly the problem.

...in that case you will not be able to do anything but use an OCR program.I totally agree.
I just vaguely hoped there already was a free program about somewhere that would do OCR directly on a graphics-based PDF, even if less-than-perfect...

I have done this before with Omnipage and that works quite well but it is not a one-click fix (mind you, there is no one-click solution for this!).My main question about Omnipage is really, whether it will open the PDF and OCR it, or whether I will have to print all, scan all and OCR it. If I can use Omnipage to open the PDF, click and click and click, and end up with at least a .TXT file at the other end, I'll get it.

If you're still stuck with this next week I could try to find the time to run it through my Omnipage version.If you would, I'd be most grateful.
As I said, a basic text version, even with all the OCR "typos", would be a real help.

CJ

Saab Dastard
27th Apr 2009, 21:53
If I can use Omnipage to open the PDF, click and click and click, and end up with at least a .TXT file at the other end, I'll get it.

Yes, you can. Omnipage has the ability to OCR from a file, and it can import a multitude of file types, including PDF.

However, not all versions of Omnipage support all the file types - the "SE" versions that are shipped "free" with a lot of scanners, for example, don't support PDF.

The "pro" versions, like the one that I linked to - Omnipage 14 - do open and OCR PDF files directly.

I don't understand your problem with the link I gave you - it opens right up on the item.

SD

ChristiaanJ
27th Apr 2009, 22:07
I don't understand your problem with the link I gave you - it opens right up on the item.SD, the link is fine. :ok: and thanks again! My only problem is the $40 price tag... for something I'm unlikely to use more than once. But it's probably the answer, unless Jhieminga can come up with a crude text version of the original file.

CJ

Saab Dastard
27th Apr 2009, 22:17
Thanks for the link to Omnipage 14, I couldn't find it

Ahhh - I think I misunderstood what you wrote previously. All is clear now!

SD

treadigraph
27th Apr 2009, 22:26
I'm running it through Acrobat Writer's OCR at the moment to see what happens - if that doesn't work, I have some kit at work that might do, can do it in the morning. Have had some good results with it in the past.

The print quality doesn't look too great though, so might be a lot of errors.

Saab Dastard
27th Apr 2009, 23:02
Top man, treaders!

Long time no see, btw - keeping well I trust?

SD

Jofm5
28th Apr 2009, 04:00
JOF,
I Googled "XPS Document Writer", and the most popular downloads seem to be those for the XPS Document Writer Removal Tool !!!
Nuff said, maybe?



Christian, not sure why someone would want to remove it as it is simply a printer driver but alas some ppl will - its basically Microsoft's attempt to encroach upon pdf territory by making an XML based open document format.

That's pretty much irrelevant as I missed as it would seem others that your documents were bitmaps (taking a closer look I noticed the scan underlined but missed it whilst skim reading) so it would not be relevant for those - but any other pdf where acrobat is disabling cut/paste/text selection it works well. Doubt it will ever replace acrobat but to get around acrobat is the only use I found so far lol.

treadigraph
28th Apr 2009, 08:08
Christiaan

Ah well, the Acrobat Writer route didn't work too well, but Microsoft Imaging seems have to done it, though there seem to be a fair few "typos" and it needs a fair bit of tidying up. I think it's the quality of the original image I'm afraid.

I've done the first 22 pages which seems to be the germain part of the report, but I can have a crack at the tables, etc if you like. It's in Word format.

PM me your email and I'll send it to you.

Hi SD, good thank you mate, hope you are too!

Cheers

Treadders

Nightrider
28th Apr 2009, 09:13
Had my fair share of trials and due to requirement of converting a lot of docs which in turn were filled with unusual formatting etc. I ended up with Solid converter (http://www.soliddocuments.com/products.htm?product=SolidConverterPDF). Does a smashing job and is available on a trial basis with just a few limitations.

hellsbrink
28th Apr 2009, 12:30
Christiaan

If you find something to convert the pdf to a tiff file, you CAN use simpleOCR SimpleOCR - freeware OCR software and royalty free OCR engine! (http://www.simpleocr.com/) to convert said tiff into a DOC file

ChristiaanJ
28th Apr 2009, 16:55
treadigraph,

Check your e-mail.
Whatever you did, it did the job.
.... there seem to be a fair few "typos"... I think it's the quality of the original image I'm afraid.There were far fewer than I expected, and in the context most are blindingly obvious as OCR artefacts, like "vci" instead of vol". As I said, the original is pretty cruddy.

Looks as if the ball is now back squarely in my camp to produce the translation!

Nightrider,
I'll download it and see what gives, in case I need this sort of thing again.

hellsbrink,
That would have meant finding an "ancient PDF - to - TIFF" converter. Is there still such a beast?

ChristiaanJ
28th Apr 2009, 19:40
MANY THANKS TO ALL OF YOU!

One 'Pan Pan Pan', and all of you have come up on the frequency to help me out.

That's PPRuNE for me....

I'll try to pay all of you back with a half-way decent translation of the Habsheim report.
I'll probably do a first draft by skipping all the information that's not directly relevant to the accident itself (such as the details of the subsequent evacuation, that can go into a second draft).

Again, thanks, friends !

Christian