Go Back  PPRuNe Forums > Misc. Forums > Computer/Internet Issues & Troubleshooting
Reload this Page >

Converting PDF to Excel (XLS) data file

Wikiposts
Search
Computer/Internet Issues & Troubleshooting Anyone with questions about the terribly complex world of computers or the internet should try here. NOT FOR REPORTING ISSUES WITH PPRuNe FORUMS! Please use the subforum "PPRuNe Problems or Queries."

Converting PDF to Excel (XLS) data file

Thread Tools
 
Search this Thread
 
Old 12th Feb 2013, 11:39
  #1 (permalink)  
Thread Starter
 
Join Date: Nov 2002
Location: 39N 77W
Posts: 1,630
Likes: 0
Received 0 Likes on 0 Posts
Converting PDF to Excel (XLS) data file

I have a data file with 6 columns which is in PDF format. I would like to remove the PDF encoding/restrictions and put it in an XLS spreadsheet.

If necessary, once in text form, I think that I can handle the columns and get it into the spreadsheet.

I find a number of downloadable programs which purport to eliminate the PDF restrictions.

Please suggest a suitable / safe one which works on Windows XP systems.

seacue
seacue is offline  
Old 12th Feb 2013, 12:18
  #2 (permalink)  
Per Ardua ad Astraeus
 
Join Date: Mar 2000
Location: UK
Posts: 18,579
Likes: 0
Received 0 Likes on 0 Posts
Zamzar makes a reasonable fist of it, but as with all these file conversions, formatting may suffer.
BOAC is offline  
Old 12th Feb 2013, 16:16
  #3 (permalink)  
 
Join Date: Jan 2011
Location: France
Posts: 191
Likes: 0
Received 0 Likes on 0 Posts
Foxit PDF viewer (free) allows you to view any PDF page in text format. A simple copy to the clip-board and paste into Excel or OpenOffice usually retains the table formatting quite accurately (depending on how complex was the original table).
CelticRambler is offline  
Old 12th Feb 2013, 16:17
  #4 (permalink)  
 
Join Date: Jan 2012
Location: .
Posts: 2,173
Likes: 0
Received 0 Likes on 0 Posts
Depending on whether the PDF is formatted as image only, or image plus text you may find the best route is simply to run it through some OCR software
You probably have something that came with your printer/scanner - Omnipage or similar.
The supposed special programs to extract text from PDFs are often scamware product (Foxit is OK as long as you decline the toolbar...)

Last edited by Milo Minderbinder; 12th Feb 2013 at 16:19.
Milo Minderbinder is offline  
Old 12th Feb 2013, 18:57
  #5 (permalink)  
Thread Starter
 
Join Date: Nov 2002
Location: 39N 77W
Posts: 1,630
Likes: 0
Received 0 Likes on 0 Posts
I can copy and paste from the screen into OO Calc ... but there are 28 pages in the PDF file, 750 lines of data. I was hoping for something that would do it in one swell foop. Agreed that page headings would be a problem. Maybe 28 copy and pastes is the easiest way. Not much fixup afterwards.

seacue
seacue is offline  
Old 13th Feb 2013, 20:12
  #6 (permalink)  
 
Join Date: Aug 2007
Location: Wilds of Warwickshire
Posts: 240
Received 8 Likes on 6 Posts
I would try "Open Office". PDFs are their Product and the Spreadsheet included can Save as Excel etc. it will depend on how the PDF was originally formatted though. The whole package is Freeware, but very good.
Good luck!

KB
KiloB is offline  
Old 14th Feb 2013, 03:17
  #7 (permalink)  
Thread Starter
 
Join Date: Nov 2002
Location: 39N 77W
Posts: 1,630
Likes: 0
Received 0 Likes on 0 Posts
KiloB,

I don't seem to understand what you wrote. In fact, I generally use OpenOffice - I just specified my question in terms of Excel because that's what more people know. I can easily go from spreadsheet to PDF, but don't see how that helps me with the reverse process.

My present plan is to approach the process as follows:

1) Convert PDF to plain text using the Adobe Reader's "save as text", which separates all columns by spaces. But one, and only one, of the data fields has imbedded spaces.

2) Write a C program to process the plain-text result of (1) and insert TABS instead of spaces to separate the "columns". The data makes this look practical. I estimate this to only require 2 or 3 dozen lines of source code. Suppressing the page-header information would be the largest effort.

3) Import the output of the C program to a spreadsheet using TAB as the column separator.

In the immediate term, I imported the 28-page latest PDF to a spreadsheet by copy and paste, so the pressure is off on having long-term automation right away.

Thank you,

seacue
seacue is offline  
Old 14th Feb 2013, 08:49
  #8 (permalink)  
 
Join Date: Apr 1998
Location: Mesopotamos
Posts: 5
Likes: 0
Received 0 Likes on 0 Posts
I did something similar for someone who received PDF data files from guvmint once a month and wanted it loaded into Excel for a mail merge. Initially he just wanted to extract the data from the PDF file, what he got after a few iterations was a menu system with 5 steps that did the whole shebang for him. He still calls me once or twice a year to thank me for making such a frustrating and inefficient activity into a sequence of easy menu options.

This is what I did before I added the smarts like file selection and auto launching mail merge.

1) Convert the PDF file into XML format.
pdftoxml.exe -noImage -noImageInline step1.pdf step1.xml

2) Clean up the data. I used a stylesheet to just select the xml elements I was interested in and filter out the stuff I wasn't.
xsltproc -o step2.xml step2script.xslt step1.xml

3) Run Excel with the xml file as an argument.
excel.exe step2.xml

The files pdftoxml and xsltproc are freeware and easy to find on the internet.

HOWEVER: The hard bit was writing the XSLT stylesheet, in my case the PDF data was structured in a complicated way meaning the resulting XML file was just as bad (and being guvmint I had inconsistent data fields that needed attention). Google XSLT Stylesheets and see if you are comfortable with learning it before embarking down this path.
cattletruck is offline  
Old 14th Feb 2013, 08:51
  #9 (permalink)  
Spoon PPRuNerist & Mad Inistrator
 
Join Date: Sep 2003
Location: Twickenham, home of rugby
Posts: 7,387
Received 244 Likes on 162 Posts
There's no need to write a C program to convert test to tabs. Word will do it much more quickly with find and replace.

If there is more than one space separating two columns, then you can replace 2 spaces with a tab, saving a step later.

When I have to do this kind of thing, with multiple spaces between columns, I usually replace 3 spaces with 2 until there's only 2 spaces left, then replace the 2 with a tab.

SD
Saab Dastard is offline  
Old 14th Feb 2013, 09:51
  #10 (permalink)  
Thread Starter
 
Join Date: Nov 2002
Location: 39N 77W
Posts: 1,630
Likes: 0
Received 0 Likes on 0 Posts
Thank you Mr. Dastard,

I followed your suggestion (using pipe | instead of TAB). Unfortunately the text file produced by the Adobe Reader has single spaces instead of double in about 10% of the records. But only between the same two columns. There are even double spaces within columns with a double-name. Rather careless data entry. The rest of the double spaces are correct. Not hard to fix in a program (C), but it would required a lot of fixup in the "simple" approach.

It now looks as though I can fix these problems with a global find & replace on the text file.

I feel really stupid for not thinking of that approach. I was trying to use the double-spaces as the delimiter in the spreadsheet import utility.

Thank you,

seacue

Last edited by seacue; 14th Feb 2013 at 09:57.
seacue is offline  
Old 14th Feb 2013, 10:42
  #11 (permalink)  
Spoon PPRuNerist & Mad Inistrator
 
Join Date: Sep 2003
Location: Twickenham, home of rugby
Posts: 7,387
Received 244 Likes on 162 Posts
Seacue,

I find that one can do quite complex things with creative use of search & replace.

For example, if there's leading space(s) on a line to be removed, replacing [Paragraph mark space(s)] with [Paragraph mark] will eliminate them (except for line 1, but that's not too taxing to do by hand!).

For Windows users, the GNU text tools are very useful for more complex text manipulation.

SD
Saab Dastard is offline  
Old 30th Jul 2014, 12:46
  #12 (permalink)  
 
Join Date: Mar 2003
Location: England
Posts: 286
Likes: 0
Received 0 Likes on 0 Posts
Hi Seacue,

As an alternative 'in excel' approach to solving the data field with spaces import, this is a technique I have used:

EG:
Cell B2 has: The
Cell C2 has: cat
Cell D2 has: sat
Cell E2 has: on
Cell F2 has: the
Cell G2 has: mat

Insert a column before column B (All the other refs will move one letter to the right) then in the new cell B2 enter the following concatenation formula:

=C2&" "&D2&" "&E2&" "&F2&" "&G2&" "&H2

Copy that down for as many rows of data as you have:

B2 will now read 'The cat sat on the mat', but will be a formula. So select the whole column B, then copy (Ctrl-c) and Paste, Special, Values (Alt-e, s, v).

You can then delete columns C-H.

HTH

EG
ExGrunt is offline  
Old 30th Jul 2014, 13:38
  #13 (permalink)  
 
Join Date: Mar 2004
Location: Hong Kong SAR
Age: 80
Posts: 321
Received 26 Likes on 9 Posts
Try this:



Converts pdf to different formats including excel and word.
CISTRS is offline  
Old 30th Jul 2014, 14:26
  #14 (permalink)  
 
Join Date: Aug 2002
Location: Earth
Posts: 3,663
Likes: 0
Received 0 Likes on 0 Posts
Write a c program for basic text manipulation ? What are you ? Retired? Unemployed ? Talk about a sledge hammer to crack a nut and reinventing the wheel !

There is no reason to even contemplate writing a c program to manipulate text unless you are doing something really really specialist and niche ... There are hundreds of perfectly good tools out there , and if you are a Linux or Mac user, all the best ones are built in (no doubt even windows powershell has some built in functions).

Also... between 1) PDF -> TXT and 2) Write your C Program.... you appear to have left out 1b) Data Validation....

GIGO as they say.... I've seen what these PDF to TXT conversion engines can do ....suffice to say I wouldn't want to be the one relying on the output of your "step 2)" if you're not going to be implementing any validation or QA !

Last edited by mixture; 30th Jul 2014 at 21:48.
mixture is offline  
Old 1st Aug 2014, 23:01
  #15 (permalink)  

Official PPRuNe Chaplain
 
Join Date: Apr 2001
Location: Witnesham, Suffolk
Age: 80
Posts: 3,498
Likes: 0
Received 0 Likes on 0 Posts
Didn't this ship sail in February 2013?

I do a fair amount of extracting content from PDFs (historical stuff, research, that sort of thing). I tried some "paid for" solutions which worked (sort-of) but didn't do the whole job. These days, I "copy and paste" into a very old text editor and work upwards from there. I doubt that method would please technical purists, but it gets large amounts of data out of old PDFs and into Excel spreadsheets with relative ease.
Keef is offline  
Old 5th Aug 2014, 06:32
  #16 (permalink)  
 
Join Date: Jul 2014
Location: Sydney
Posts: 12
Likes: 0
Received 0 Likes on 0 Posts
Try to use this pdftoexcelonline. It's useful to convert pdf to any microsoft office format.
alexmartin11 is offline  

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off



Contact Us - Archive - Advertising - Cookie Policy - Privacy Statement - Terms of Service

Copyright © 2024 MH Sub I, LLC dba Internet Brands. All rights reserved. Use of this site indicates your consent to the Terms of Use.