PDA

View Full Version : Converting PDF to Excel (XLS) data file


seacue
12th Feb 2013, 11:39
I have a data file with 6 columns which is in PDF format. I would like to remove the PDF encoding/restrictions and put it in an XLS spreadsheet.

If necessary, once in text form, I think that I can handle the columns and get it into the spreadsheet.

I find a number of downloadable programs which purport to eliminate the PDF restrictions.

Please suggest a suitable / safe one which works on Windows XP systems.

seacue

BOAC
12th Feb 2013, 12:18
Zamzar makes a reasonable fist of it, but as with all these file conversions, formatting may suffer.

CelticRambler
12th Feb 2013, 16:16
Foxit PDF viewer (free) allows you to view any PDF page in text format. A simple copy to the clip-board and paste into Excel or OpenOffice usually retains the table formatting quite accurately (depending on how complex was the original table).

Milo Minderbinder
12th Feb 2013, 16:17
Depending on whether the PDF is formatted as image only, or image plus text you may find the best route is simply to run it through some OCR software
You probably have something that came with your printer/scanner - Omnipage or similar.
The supposed special programs to extract text from PDFs are often scamware product (Foxit is OK as long as you decline the toolbar...)

seacue
12th Feb 2013, 18:57
I can copy and paste from the screen into OO Calc ... but there are 28 pages in the PDF file, 750 lines of data. I was hoping for something that would do it in one swell foop. Agreed that page headings would be a problem. Maybe 28 copy and pastes is the easiest way. Not much fixup afterwards.

seacue

KiloB
13th Feb 2013, 20:12
I would try "Open Office". PDFs are their Product and the Spreadsheet included can Save as Excel etc. it will depend on how the PDF was originally formatted though. The whole package is Freeware, but very good.
Good luck!

KB

seacue
14th Feb 2013, 03:17
KiloB,

I don't seem to understand what you wrote. In fact, I generally use OpenOffice - I just specified my question in terms of Excel because that's what more people know. I can easily go from spreadsheet to PDF, but don't see how that helps me with the reverse process.

My present plan is to approach the process as follows:

1) Convert PDF to plain text using the Adobe Reader's "save as text", which separates all columns by spaces. But one, and only one, of the data fields has imbedded spaces.

2) Write a C program to process the plain-text result of (1) and insert TABS instead of spaces to separate the "columns". The data makes this look practical. I estimate this to only require 2 or 3 dozen lines of source code. Suppressing the page-header information would be the largest effort.

3) Import the output of the C program to a spreadsheet using TAB as the column separator.

In the immediate term, I imported the 28-page latest PDF to a spreadsheet by copy and paste, so the pressure is off on having long-term automation right away.

Thank you,

seacue

cattletruck
14th Feb 2013, 08:49
I did something similar for someone who received PDF data files from guvmint once a month and wanted it loaded into Excel for a mail merge. Initially he just wanted to extract the data from the PDF file, what he got after a few iterations was a menu system with 5 steps that did the whole shebang for him. He still calls me once or twice a year to thank me for making such a frustrating and inefficient activity into a sequence of easy menu options.

This is what I did before I added the smarts like file selection and auto launching mail merge.

1) Convert the PDF file into XML format.
pdftoxml.exe -noImage -noImageInline step1.pdf step1.xml

2) Clean up the data. I used a stylesheet to just select the xml elements I was interested in and filter out the stuff I wasn't.
xsltproc -o step2.xml step2script.xslt step1.xml

3) Run Excel with the xml file as an argument.
excel.exe step2.xml

The files pdftoxml and xsltproc are freeware and easy to find on the internet.

HOWEVER: The hard bit was writing the XSLT stylesheet, in my case the PDF data was structured in a complicated way meaning the resulting XML file was just as bad (and being guvmint I had inconsistent data fields that needed attention). Google XSLT Stylesheets and see if you are comfortable with learning it before embarking down this path.

Saab Dastard
14th Feb 2013, 08:51
There's no need to write a C program to convert test to tabs. Word will do it much more quickly with find and replace.

If there is more than one space separating two columns, then you can replace 2 spaces with a tab, saving a step later.

When I have to do this kind of thing, with multiple spaces between columns, I usually replace 3 spaces with 2 until there's only 2 spaces left, then replace the 2 with a tab.

SD

seacue
14th Feb 2013, 09:51
Thank you Mr. Dastard,

I followed your suggestion (using pipe | instead of TAB). Unfortunately the text file produced by the Adobe Reader has single spaces instead of double in about 10% of the records. But only between the same two columns. There are even double spaces within columns with a double-name. Rather careless data entry. The rest of the double spaces are correct. Not hard to fix in a program (C), but it would required a lot of fixup in the "simple" approach.

It now looks as though I can fix these problems with a global find & replace on the text file.

I feel really stupid for not thinking of that approach. I was trying to use the double-spaces as the delimiter in the spreadsheet import utility.

Thank you,

seacue

Saab Dastard
14th Feb 2013, 10:42
Seacue,

I find that one can do quite complex things with creative use of search & replace.

For example, if there's leading space(s) on a line to be removed, replacing [Paragraph mark space(s)] with [Paragraph mark] will eliminate them (except for line 1, but that's not too taxing to do by hand!).

For Windows users, the GNU text tools are very useful for more complex text manipulation.

SD

ExGrunt
30th Jul 2014, 12:46
Hi Seacue,

As an alternative 'in excel' approach to solving the data field with spaces import, this is a technique I have used:

EG:
Cell B2 has: The
Cell C2 has: cat
Cell D2 has: sat
Cell E2 has: on
Cell F2 has: the
Cell G2 has: mat

Insert a column before column B (All the other refs will move one letter to the right) then in the new cell B2 enter the following concatenation formula:

=C2&" "&D2&" "&E2&" "&F2&" "&G2&" "&H2

Copy that down for as many rows of data as you have:

B2 will now read 'The cat sat on the mat', but will be a formula. So select the whole column B, then copy (Ctrl-c) and Paste, Special, Values (Alt-e, s, v).

You can then delete columns C-H.

HTH

EG

CISTRS
30th Jul 2014, 13:38
Try this:

http://i1227.photobucket.com/albums/ee422/pimlican/PDFConverter_zps0d3f8d46.gif

Converts pdf to different formats including excel and word.

mixture
30th Jul 2014, 14:26
Write a c program for basic text manipulation ? What are you ? Retired? Unemployed ? Talk about a sledge hammer to crack a nut and reinventing the wheel !

There is no reason to even contemplate writing a c program to manipulate text unless you are doing something really really specialist and niche ... There are hundreds of perfectly good tools out there , and if you are a Linux or Mac user, all the best ones are built in (no doubt even windows powershell has some built in functions).

Also... between 1) PDF -> TXT and 2) Write your C Program.... you appear to have left out 1b) Data Validation....

GIGO as they say.... I've seen what these PDF to TXT conversion engines can do ....suffice to say I wouldn't want to be the one relying on the output of your "step 2)" if you're not going to be implementing any validation or QA !

Keef
1st Aug 2014, 23:01
Didn't this ship sail in February 2013?

I do a fair amount of extracting content from PDFs (historical stuff, research, that sort of thing). I tried some "paid for" solutions which worked (sort-of) but didn't do the whole job. These days, I "copy and paste" into a very old text editor and work upwards from there. I doubt that method would please technical purists, but it gets large amounts of data out of old PDFs and into Excel spreadsheets with relative ease.

alexmartin11
5th Aug 2014, 06:32
Try to use this pdftoexcelonline (https://www.pdftoexcelonline.com/en/). It's useful to convert pdf to any microsoft office format.