Converting PDF to Excel (XLS) data file

Reply Subscribe

Thread Tools

Search this Thread

12th February 2013 | 11:39

#1 (permalink)

seacue

Thread Starter

Joined: Nov 2002

Posts: 1,630

Likes: 0

From: 39N 77W

Converting PDF to Excel (XLS) data file

I have a data file with 6 columns which is in PDF format. I would like to remove the PDF encoding/restrictions and put it in an XLS spreadsheet.

If necessary, once in text form, I think that I can handle the columns and get it into the spreadsheet.

I find a number of downloadable programs which purport to eliminate the PDF restrictions.

Please suggest a suitable / safe one which works on Windows XP systems.

seacue

12th February 2013 | 12:18

#2 (permalink)

BOAC

Per Ardua ad Astraeus

Joined: Mar 2000

Posts: 18,575

Likes: 4

From: UK

Zamzar makes a reasonable fist of it, but as with all these file conversions, formatting may suffer.

12th February 2013 | 16:16

#3 (permalink)

CelticRambler

Joined: Jan 2011

Posts: 192

Likes: 0

From: France

Foxit PDF viewer (free) allows you to view any PDF page in text format. A simple copy to the clip-board and paste into Excel or OpenOffice usually retains the table formatting quite accurately (depending on how complex was the original table).

12th February 2013 | 16:17

#4 (permalink)

Milo Minderbinder

Joined: Jan 2012

Posts: 2,173

Likes: 0

From: .

Depending on whether the PDF is formatted as image only, or image plus text you may find the best route is simply to run it through some OCR software
You probably have something that came with your printer/scanner - Omnipage or similar.
The supposed special programs to extract text from PDFs are often scamware product (Foxit is OK as long as you decline the toolbar...)

Last edited by Milo Minderbinder; 12th February 2013 at 16:19.

12th February 2013 | 18:57

#5 (permalink)

seacue

Thread Starter

Joined: Nov 2002

Posts: 1,630

Likes: 0

From: 39N 77W

I can copy and paste from the screen into OO Calc ... but there are 28 pages in the PDF file, 750 lines of data. I was hoping for something that would do it in one swell foop. Agreed that page headings would be a problem. Maybe 28 copy and pastes is the easiest way. Not much fixup afterwards.

seacue

13th February 2013 | 20:12

#6 (permalink)

KiloB

Joined: Aug 2007

Posts: 271

Likes: 46

From: Wilds of Warwickshire

I would try "Open Office". PDFs are their Product and the Spreadsheet included can Save as Excel etc. it will depend on how the PDF was originally formatted though. The whole package is Freeware, but very good.
Good luck!

KB

14th February 2013 | 03:17

#7 (permalink)

seacue

Thread Starter

Joined: Nov 2002

Posts: 1,630

Likes: 0

From: 39N 77W

KiloB,

I don't seem to understand what you wrote. In fact, I generally use OpenOffice - I just specified my question in terms of Excel because that's what more people know. I can easily go from spreadsheet to PDF, but don't see how that helps me with the reverse process.

My present plan is to approach the process as follows:

1) Convert PDF to plain text using the Adobe Reader's "save as text", which separates all columns by spaces. But one, and only one, of the data fields has imbedded spaces.

2) Write a C program to process the plain-text result of (1) and insert TABS instead of spaces to separate the "columns". The data makes this look practical. I estimate this to only require 2 or 3 dozen lines of source code. Suppressing the page-header information would be the largest effort.

3) Import the output of the C program to a spreadsheet using TAB as the column separator.

In the immediate term, I imported the 28-page latest PDF to a spreadsheet by copy and paste, so the pressure is off on having long-term automation right away.

Thank you,

seacue

14th February 2013 | 08:49

#8 (permalink)

cattletruck

Joined: Apr 1998

Posts: 4

Likes: 1

From: Mesopotamos

I did something similar for someone who received PDF data files from guvmint once a month and wanted it loaded into Excel for a mail merge. Initially he just wanted to extract the data from the PDF file, what he got after a few iterations was a menu system with 5 steps that did the whole shebang for him. He still calls me once or twice a year to thank me for making such a frustrating and inefficient activity into a sequence of easy menu options.

This is what I did before I added the smarts like file selection and auto launching mail merge.

1) Convert the PDF file into XML format.
pdftoxml.exe -noImage -noImageInline step1.pdf step1.xml

2) Clean up the data. I used a stylesheet to just select the xml elements I was interested in and filter out the stuff I wasn't.
xsltproc -o step2.xml step2script.xslt step1.xml

3) Run Excel with the xml file as an argument.
excel.exe step2.xml

The files pdftoxml and xsltproc are freeware and easy to find on the internet.

HOWEVER: The hard bit was writing the XSLT stylesheet, in my case the PDF data was structured in a complicated way meaning the resulting XML file was just as bad (and being guvmint I had inconsistent data fields that needed attention). Google XSLT Stylesheets and see if you are comfortable with learning it before embarking down this path.

14th February 2013 | 08:51

#9 (permalink)

Saab Dastard

Administrator

Joined: Mar 2001

Aviation Qualifications: PPL

Posts: 8,121

Likes: 686

From: Twickenham, home of rugby

There's no need to write a C program to convert test to tabs. Word will do it much more quickly with find and replace.

If there is more than one space separating two columns, then you can replace 2 spaces with a tab, saving a step later.

When I have to do this kind of thing, with multiple spaces between columns, I usually replace 3 spaces with 2 until there's only 2 spaces left, then replace the 2 with a tab.

SD

14th February 2013 | 09:51

#10 (permalink)

seacue

Thread Starter

Joined: Nov 2002

Posts: 1,630

Likes: 0

From: 39N 77W

Thank you Mr. Dastard,

I followed your suggestion (using pipe | instead of TAB). Unfortunately the text file produced by the Adobe Reader has single spaces instead of double in about 10% of the records. But only between the same two columns. There are even double spaces within columns with a double-name. Rather careless data entry. The rest of the double spaces are correct. Not hard to fix in a program (C), but it would required a lot of fixup in the "simple" approach.

It now looks as though I can fix these problems with a global find & replace on the text file.

I feel really stupid for not thinking of that approach. I was trying to use the double-spaces as the delimiter in the spreadsheet import utility.

Thank you,

seacue

Last edited by seacue; 14th February 2013 at 09:57.

14th February 2013 | 10:42

#11 (permalink)

Saab Dastard

Administrator

Joined: Mar 2001

Aviation Qualifications: PPL

Posts: 8,121

Likes: 686

From: Twickenham, home of rugby

Seacue,

I find that one can do quite complex things with creative use of search & replace.

For example, if there's leading space(s) on a line to be removed, replacing [Paragraph mark space(s)] with [Paragraph mark] will eliminate them (except for line 1, but that's not too taxing to do by hand!).

For Windows users, the GNU text tools are very useful for more complex text manipulation.

SD

30th July 2014 | 12:46

#12 (permalink)

ExGrunt

Joined: Mar 2003

Posts: 285

Likes: 0

From: England

Hi Seacue,

As an alternative 'in excel' approach to solving the data field with spaces import, this is a technique I have used:

EG:
Cell B2 has: The
Cell C2 has: cat
Cell D2 has: sat
Cell E2 has: on
Cell F2 has: the
Cell G2 has: mat

Insert a column before column B (All the other refs will move one letter to the right) then in the new cell B2 enter the following concatenation formula:

=C2&" "&D2&" "&E2&" "&F2&" "&G2&" "&H2

Copy that down for as many rows of data as you have:

B2 will now read 'The cat sat on the mat', but will be a formula. So select the whole column B, then copy (Ctrl-c) and Paste, Special, Values (Alt-e, s, v).

You can then delete columns C-H.

HTH

EG

30th July 2014 | 13:38

#13 (permalink)

CISTRS

Joined: Mar 2004

Aviation Qualifications: PPL

Posts: 347

Likes: 49

From: Hong Kong SAR

Try this:

Converts pdf to different formats including excel and word.

30th July 2014 | 14:26

#14 (permalink)

mixture

Joined: Aug 2002

Posts: 3,663

Likes: 0

From: Earth

Write a c program for basic text manipulation ? What are you ? Retired? Unemployed ? Talk about a sledge hammer to crack a nut and reinventing the wheel !

There is no reason to even contemplate writing a c program to manipulate text unless you are doing something really really specialist and niche ... There are hundreds of perfectly good tools out there , and if you are a Linux or Mac user, all the best ones are built in (no doubt even windows powershell has some built in functions).

Also... between 1) PDF -> TXT and 2) Write your C Program.... you appear to have left out 1b) Data Validation....

GIGO as they say.... I've seen what these PDF to TXT conversion engines can do ....suffice to say I wouldn't want to be the one relying on the output of your "step 2)" if you're not going to be implementing any validation or QA !

Last edited by mixture; 30th July 2014 at 21:48.

1st August 2014 | 23:01

#15 (permalink)

Keef

Official PPRuNe Chaplain

Joined: Apr 2001

Posts: 3,498

Likes: 0

From: Witnesham, Suffolk

Didn't this ship sail in February 2013?

I do a fair amount of extracting content from PDFs (historical stuff, research, that sort of thing). I tried some "paid for" solutions which worked (sort-of) but didn't do the whole job. These days, I "copy and paste" into a very old text editor and work upwards from there. I doubt that method would please technical purists, but it gets large amounts of data out of old PDFs and into Excel spreadsheets with relative ease.