PDA

View Full Version : Use of curl or wget


davidjohnson6
15th May 2009, 02:08
Anyone out there tried using cURL (or wget / pavuk for that matter) on airline websites ?

If so, any hints about how to get it to work ?
I've managed to get cURL to download a homepage, follow links, have checked the cookie stuff, can send a convincing string to assume the identity of Firefox 3.0.6 but am having trouble posting urlencoded form data (yes, I've gone through the list of form fields) to get a satisfactory result.

I'm not so ambitious as to try to build PHP scripts around it - I'd just like a single (rather long) command line call to cURL to work first !

Comments and suggestions welcome

bnt
15th May 2009, 11:21
So, does that mean you're still pursuing the screen-scraping (http://www.pprune.org/computer-internet-issues-troubleshooting/366394-automated-fare-query-engine.html#post4795825) thing? You're pretty much at the mercy of the site and its developers, so if standard methods don't work reliably, you may want to ask yourself why. You ought to be aware that airlines are actively fighting that e.g. Ryanair (http://www.telegraph.co.uk/travel/travelnews/2510867/Ryanair-takes-action-against-website-screen-scraping.html), FlyBE (http://www.breakingtravelnews.com/article/20080808133034569), possibly more. I don't agree with anything that overloads the airline websites and makes them harder for people to use. FlyBE are actually going to open up an XML API (http://www.flybe.com/news/0902/3065.htm) to 3rd parties, because of what screen-scrapers are doing to the site.

davidjohnson6
15th May 2009, 21:48
I completely understand from the point of view of a web services manager the need to keep a site up and running 99.9% (or better) of the time - I've learnt from experience that when manually searching airline websites, 8 pm is a bad time to do any kind of heavy manual searching, while at 3 am a website is rather more responsive. I do not intend to make a time when a webserver is struggling even worse.

However, I expect the company with that website to act in a reasonably transparent way. If a company wishes to trumpet continuously how cheap its products are, I expect to be able to search through its inventory reasonably easily to find these cheap products, instead of finding that the product I'm interested in is cheap for only 5 arbitrary days out of a 5 month period (along with an exhortation to keep returning to the website to search out cheap products). Easyjet as an example is one of the LCCs that has made a reasonable good job of coming up with a website that meets these expectations. My ire is directed at websites that are less consumer friendly. Presenting a website that is frankly difficult to search was acceptable in 1999; companies should be able to do better now.

An example of the importance of search is buying train tickets in the UK. Thetrainline.com had a website that was frankly awkward to use if one was price sensitive rather than time sensitive in choosing the train to book. Gner.com came up with a website that made it *much* easier to see the prices of trains at different times, giving me considerable encouragement to purchase from their website instead.

If companies can't or won't produce a website that is easy to search or act in a non-transparent manner over pricing, then it might be worth my considering building my own tools to search a website !

Saab Dastard
15th May 2009, 22:15
Interesting that you should mention the trainline.com - the website (and subsequently the call centre application) actually used screenscraping itself in the early days, to read the availability for a journey from the venerable CRS (reservations service) that was hosted by (then) SEMA, but dated back to the BRBS (British Rail Business Systems) days.

It (CRS) should have been superseded by now with the new Reservations Service, so I imagine there's now an API for handling these requests from thetrainline.com (and other sites) now!

SD

bnt
16th May 2009, 00:35
I think the problem I have with this is related to my experiences on the internet with "middlemen" who insert themselves in between businesses and their customers, skimming a little off every transaction. You see it if you try and search Google for a product or service, because these "agents" use "search engine optimisation" (a.k.a. gaming the search engines). Customers with less experience in interpreting the results get fooled in to dealing with those "bottom-feeders", instead of going directly to the business in question and paying less on average.

I know you say it's just a personal project, not a business, and I believe you, but it still leaves a bad taste in my mouth. Any half-awake webmaster can spot large numbers of queries from a single address, and take some kind of action in response, and then where will you be? It costs them to run their web services, after all, and they don't do it for the benefit of just any old third party: just for themselves, authorised agents (sometimes), and paying customers.

If airlines make it difficult for people to find bargains, they're either shooting themselves in the foot, or (more likely) the bargains were just a limited "loss leader" to attract customers, so you won't be able to find them anyway. I don't mean to pour cold water on your ideas, but you're a bit behind the curve on this, with the airlines actively fighting screen-scraping "agencies" who try to make money off the backs of their services.

davidjohnson6
16th May 2009, 23:44
bnt - I very much echo your sentiments on those who design websites in a way which tries to deliberately raise the Google page rank score more than would be reasonable.

When booking a room in a hotel, I prefer to book directly with the hotel. Not only does it avoid the risk of communication screwups (has he paid the bill in advance or hasn't he) but particularly for small family-owned hotels, it means the family doesn't have to pay the commission. The problem I had was that when searching Google for a hotel, I ended up getting huge numbers of generic hotel booking sites with a high page rank, but the hotel website itself would then end up somewhere far down the search ranking, necessitating a long and painful trawl through various websites to see which was the genuine hotel website rather than a company that claimed to be the local expert for hotel booking. Google has now changed this for hotels (presumably hitting the middlemen where it hurts), although the problem still remains for restaurants.

I fully expect an airline to give plenty of bluster... but in terms of action, I'm a little puzzled as to what they can effectively do. I know that some airlines impose a 'your session has been locked for 20 seconds' screen now and again, but this is not a major deterrent. I have a dynamic IP - continuously blocking new IP addresses doesn't seem (at least to me) terribly effective against a screen scraper who keeps using a new IP - furthermore this also blocks out other potential unrelated customers. Further, I won't be using this data for commercial or non-personal purposes, and will not be posting this data on another website or supplying to a 3rd party - I'm not stupid or daring enough to encourage an airline to sue me. The only money I make will be my getting personal access to the 'loss leader' fares on which the airline has tried so hard to be opaque and bury.

LH2
17th May 2009, 12:59
am having trouble posting urlencoded form data

What kind of trouble? A common mistake is not to set the appropriate Content-Type HTTP header, although at least curl does that automatically.

One approach to get it working would be to interact manually with the site while using Firefox's Live HTTP Headers extension (which in spite of the name it actually captures whole POST requests), or just fire up Wireshark.

Chances are that if the site is actively trying to stop crawlers it will have some sort of mechanism in its query/response process to make life difficult.

In any event, you'd be better off asking in a computer forum to start with.