PDF to TXT conversion
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
PDF to TXT conversion
Most of current-day .PDF files have been created from computer-readable files (.DOC, .RTF, etc.), and most of the latest Acrobat Readers let you select and copy text (and even images) from such files, to quote elsewhere, for instance.
Some older .PDF files are simply PDF-compressed copies of scans (hence bit maps) of old documents, and the "select and copy text" function of Acrobat Reader no longer works, even though the documents are text, and the Acrobat Reader "select and copy text" function seems to be based on some kind of OCR.
Does anybody here know anything about tools to extract a text file from such an ancient .PDF file?
(Short, obviously, of printing, scanning, using an OCR program, and cleaning up afterwards.)
CJ
Some older .PDF files are simply PDF-compressed copies of scans (hence bit maps) of old documents, and the "select and copy text" function of Acrobat Reader no longer works, even though the documents are text, and the Acrobat Reader "select and copy text" function seems to be based on some kind of OCR.
Does anybody here know anything about tools to extract a text file from such an ancient .PDF file?
(Short, obviously, of printing, scanning, using an OCR program, and cleaning up afterwards.)
CJ
Spoon PPRuNerist & Mad Inistrator
Omnipage (pro) does support importing directly from PDF files, so you could bypass the printing and scanning bit!
I'm sure that other OCR programs do so also.
I've seen older versions - e.g. 14 instead of 15 - at a fraction of the current version price.
SD
I'm sure that other OCR programs do so also.
I've seen older versions - e.g. 14 instead of 15 - at a fraction of the current version price.
SD
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
SD,
Thanks already!
Any links to a cheap or free download that might do the job?
I don't mind paying sumpin', but not £500 for what will be a once-only.
Found one "converter", but that turned 36Mb into 1.3Gb, so obviously not what I wanted.
CJ
Thanks already!
Any links to a cheap or free download that might do the job?
I don't mind paying sumpin', but not £500 for what will be a once-only.
Found one "converter", but that turned 36Mb into 1.3Gb, so obviously not what I wanted.
CJ
Spoon PPRuNerist & Mad Inistrator
As I said, a previous version can be yours for a LOT less!!
I just picked that as an example - I've never bought from them, so it's not a recommendation or endorsement.
You could also try ebay.
Try a google on "Free OCR software" - you might pick up something useful there.
SD
I just picked that as an example - I've never bought from them, so it's not a recommendation or endorsement.
You could also try ebay.
Try a google on "Free OCR software" - you might pick up something useful there.
SD
Join Date: Jan 2008
Location: The Land of Beer and Chocolate
Age: 56
Posts: 798
Likes: 0
Received 0 Likes
on
0 Posts
Trying to remember what I used before, but this says it will do the job for you and is FREE. Can't test it myself because I'm running Linux at the mo (because Winblows decided to bork itself. Again)
Free PDF Text Extractor: Convert PDF file to plain text file. [A-PDF.com]
Free PDF Text Extractor: Convert PDF file to plain text file. [A-PDF.com]
Join Date: Feb 2003
Location: Scotland
Posts: 144
Likes: 0
Received 0 Likes
on
0 Posts
Join Date: Jan 2008
Location: LONDON
Age: 51
Posts: 525
Likes: 0
Received 0 Likes
on
0 Posts
If you have windows vista.............
Print the copy protected pdf document to the logical printer provded by vista called "Microsoft XPS document Writer" - this will then prompt you for a file name and save location. After its completed you can then double click on the document which will then open in internet explorer and copy the text with ease.
Print the copy protected pdf document to the logical printer provded by vista called "Microsoft XPS document Writer" - this will then prompt you for a file name and save location. After its completed you can then double click on the document which will then open in internet explorer and copy the text with ease.
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
JOF,
I Googled "XPS Document Writer", and the most popular downloads seem to be those for the XPS Document Writer Removal Tool !!!
Nuff said, maybe?
SD,
Thanks for the link to Omnipage 14, I couldn't find it. Since mine's a one-off job I'll still try for a freebie first, but otherwise that's a reasonable price. Omnipage is known well enough.
cdtaylor_nats
I looked at FreeOCR, but when you click "Download" you get an announcement of a new version, but no way to download it yet.
hellsbrink
A-PDF states specifically that it does not convert text-as-images PDF files.
Any other offers?
BTW, the "target" is no secret...
The BEA Habsheim report in French
Warning, it's about 38Mb.
So if anybody can turn it into .DOC, .RTF, .TXT, or whatever.. yes please (even if the formatting gets lost and it's not faultless; it's meant to be used as a raw base for a translation).
CJ
I Googled "XPS Document Writer", and the most popular downloads seem to be those for the XPS Document Writer Removal Tool !!!
Nuff said, maybe?
SD,
Thanks for the link to Omnipage 14, I couldn't find it. Since mine's a one-off job I'll still try for a freebie first, but otherwise that's a reasonable price. Omnipage is known well enough.
cdtaylor_nats
I looked at FreeOCR, but when you click "Download" you get an announcement of a new version, but no way to download it yet.
hellsbrink
A-PDF states specifically that it does not convert text-as-images PDF files.
Any other offers?
BTW, the "target" is no secret...
The BEA Habsheim report in French
Warning, it's about 38Mb.
So if anybody can turn it into .DOC, .RTF, .TXT, or whatever.. yes please (even if the formatting gets lost and it's not faultless; it's meant to be used as a raw base for a translation).
CJ
Join Date: Jan 2008
Location: The Land of Beer and Chocolate
Age: 56
Posts: 798
Likes: 0
Received 0 Likes
on
0 Posts
Christiaan
This Advanced PDF to Word Converter Free - Free software downloads and reviews - CNET Download.com has just converted that pdf to a doc file.
You can't edit the outputted (is that a word? who cares) file, but you can open it in M$ Office, etc.
Will keep a-hunting and find the program I used before because it was gratis too and let you edit the converted file.
This Advanced PDF to Word Converter Free - Free software downloads and reviews - CNET Download.com has just converted that pdf to a doc file.
You can't edit the outputted (is that a word? who cares) file, but you can open it in M$ Office, etc.
Will keep a-hunting and find the program I used before because it was gratis too and let you edit the converted file.
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
hellsbrink,
Downloaded your suggestion and let it loose on the PDF file.
As you say, it produces a .DOC file, that you can open with Word.
Unfortunately.... the result is a 74 page .DOC file (the right number of pages) but with all the text pages as ..... 74 pages of graphics.
So I'm no further ahead.....
Yes, please, because that's what I'm looking for.
CJ
Downloaded your suggestion and let it loose on the PDF file.
As you say, it produces a .DOC file, that you can open with Word.
Unfortunately.... the result is a 74 page .DOC file (the right number of pages) but with all the text pages as ..... 74 pages of graphics.
So I'm no further ahead.....
Will keep a-hunting and find the program I used before because it was gratis too and let you edit the converted file.
CJ
More bang for your buck
Join Date: Nov 2005
Location: land of the clanger
Age: 82
Posts: 3,512
Likes: 0
Received 0 Likes
on
0 Posts
Adobe acrobat pro allows you to export a pdf as a .txt file which is fully editable but whether it works on old pdf's I've no idea.
Join Date: Apr 2008
Location: Well, Lincolnshire
Age: 69
Posts: 1,101
Likes: 0
Received 0 Likes
on
0 Posts
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
Originally Posted by green granite
Adobe acrobat pro allows you to export a pdf as a .txt file which is fully editable but whether it works on old pdf's I've no idea.
Originally Posted by taxydual
Would this do it?
PDF Ripper - Convert PDF to Word, PDF to RTF, PDF to HTML, PDF to Text, PDF to TXT
PDF Ripper - Convert PDF to Word, PDF to RTF, PDF to HTML, PDF to Text, PDF to TXT
I just tried it, no joy...
CJ
Last edited by ChristiaanJ; 27th Apr 2009 at 20:35. Reason: Added comment
ChristiaanJ, it looks as if the PDF is based on graphics as has been stated before, in that case you will not be able to do anything but use an OCR program. I have done this before with Omnipage and that works quite well but it is not a one-click fix (mind you, there is no one-click solution for this!). If you're still stuck with this next week I could try to find the time to run it through my Omnipage version.
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
Originally Posted by Jhieminga
ChristiaanJ, it looks as if the PDF is based on graphics
...in that case you will not be able to do anything but use an OCR program.
I just vaguely hoped there already was a free program about somewhere that would do OCR directly on a graphics-based PDF, even if less-than-perfect...
I have done this before with Omnipage and that works quite well but it is not a one-click fix (mind you, there is no one-click solution for this!).
If you're still stuck with this next week I could try to find the time to run it through my Omnipage version.
As I said, a basic text version, even with all the OCR "typos", would be a real help.
CJ
Spoon PPRuNerist & Mad Inistrator
If I can use Omnipage to open the PDF, click and click and click, and end up with at least a .TXT file at the other end, I'll get it.
However, not all versions of Omnipage support all the file types - the "SE" versions that are shipped "free" with a lot of scanners, for example, don't support PDF.
The "pro" versions, like the one that I linked to - Omnipage 14 - do open and OCR PDF files directly.
I don't understand your problem with the link I gave you - it opens right up on the item.
SD
Thread Starter
Join Date: Jan 2005
Location: France
Posts: 2,315
Likes: 0
Received 0 Likes
on
0 Posts
Originally Posted by Saab Dastard
I don't understand your problem with the link I gave you - it opens right up on the item.
CJ
Spoon PPRuNerist & Mad Inistrator
Thanks for the link to Omnipage 14, I couldn't find it
SD
Gnome de PPRuNe
Join Date: Jan 2002
Location: Too close to Croydon for comfort
Age: 60
Posts: 12,627
Received 298 Likes
on
166 Posts
I'm running it through Acrobat Writer's OCR at the moment to see what happens - if that doesn't work, I have some kit at work that might do, can do it in the morning. Have had some good results with it in the past.
The print quality doesn't look too great though, so might be a lot of errors.
The print quality doesn't look too great though, so might be a lot of errors.