Making a book. [Archive]

tubby linton

5th May 2012, 17:31

I have a book the pages of which exist on my pc as jpegs of the original printed book pages. Would anybody be able to suggest something to convert the jpegs into an e-book?

Milo Minderbinder

5th May 2012, 18:32

messy
and success depends on the quality of those images.

In a nutshell you are going to have to OCR each page image and convert in into text. Each page will need correcting as OCR technology is not 100% - by a long way
You then need to paste the corrected text files into one large text file, and then output it - as a PDF, Amazon or some other format.

So three questions
1) How good are the jpg files?
2) How many pages
3) What format do you want the final output to be? PDF? EPUB? Kindle/Mobi? something else?

green granite

5th May 2012, 18:36

You can turn them into a PDF file but you'll need to get a program to do it, I think Foxit do a reasonably priced one.
The other option is to use an OCR program to turn the pictures into text and create an e-book from that.

edit: Milo beat me to it.

Milo Minderbinder

5th May 2012, 18:48

if the object is simply to create a PDF file, then PDFCreator is a simple open-source program
PDFCreator | Free Business & Enterprise software downloads at SourceForge.net (http://sourceforge.net/projects/pdfcreator/)

However all that will do iwith the jpg files is convert the image format from jpg to pdf. Nothing will directly convert the jpg file to an editable / searchable text-based pdf. To do that you will have to OCR it and then convert

PDFCreator will create a searchable PDF once the OCR has been done

if you want to use Amazon's Kindle, then start here at Amazons online publishing site
https://kdp.amazon.com/self-publishing/signin
or find their downloadable program at
http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621

If you want another format then look at Calibre
calibre - E-book management (http://calibre-ebook.com/)

But your first problem is extracting the text from those image files
If they are only small it could be easier to retype them

tubby linton

5th May 2012, 19:06

There are 300 + images and they are two pages from the book per image. Can you recommend some OCR software?

Milo Minderbinder

5th May 2012, 19:16

ABBYY Finereader is easily the most accurate I've ever used - but thats a very limited sample! However it does have a good reputation
Old or cut down versions are often supplied free with scanners
OCR software for text recognition OCR PDF features - ABBYY FineReader (http://finereader.abbyy.com/)

If you already have a scanner, you'll probably find you already have some bundled OCR software

edit
PS - something just remembered
The Open Source "Tesseract" program had a good reputation also, though I've never used it
https://code.google.com/p/tesseract-ocr/
Another Google project

tubby linton

5th May 2012, 20:23

I have an HP psc , will I haveto print the jpegs and then manually scan them or can the task be performed within the software?

PPRuNeUser0171

5th May 2012, 20:39

Primo PDF will do it.

Install it as a printer driver then print all of the JPG files into that 'printer' and job done.

Milo Minderbinder

5th May 2012, 20:58

As far as I know PrimopDF does not have an OCR element, so that all it would do is convert the jpg image to a PDF image - NOT a PDF with embedded text.
So while you have a PDF output file, you would not be able to index or search it.
If you just wanted to do that then you may as well use PDFCreator or one of the other free PDF programs
Yes it would create a PDF file (or series of files), but for use as an Ebook those files would be functionally useless

Milo Minderbinder

5th May 2012, 21:02

Tubby

You can set the software to OCR the existing image file. You don't have to print and rescan
There will be OCR software with that HP scanner, though just what I don't know. It varied with age. It may even be possible to batch scan several files, though trying to do 300 pages at once will overload the RAM by a long way

bnt

5th May 2012, 22:59

One free option is the Tesseract OCR (http://code.google.com/p/tesseract-ocr/) program, which is now maintained by Google. I just tried it on a scanned letter, and it works very well. It is a command line program, so it's not the easiest to use, but can be faster on a lot of pages.

Saab Dastard

5th May 2012, 23:04

bnt - see post #6 above ;)

SD

Milo Minderbinder

6th May 2012, 10:07

There is a graphical front end for Tessearct - see Freeocr Scanning OCR Software - OCR PDF Document Scanner Software (http://www.paperfile.net/)

There are other plugins for it listed at https://code.google.com/p/tesseract-ocr/wiki/AddOns#GUI
One that looks interesting for this project is at https://code.google.com/p/ocrivist/ - but its for Linux only