PDA

View Full Version : Automated fare query engine


davidjohnson6
17th Mar 2009, 16:13
I'm not sure airlines will be happy about this but...

I find some airline websites a pain-in-the-a**e to use when trying to scan different dates and routes for flights in which I might be interested in buying. I know there are websites like skyscanner.net but they don't really do what I'm looking for. Furthermore, I've found that the prices quoted on Skyscanner can often be quite unreliable, even when marked as being from just 18 hours ago.

I'm pondering trying to write a bit of code that allows one to simulate the webform post action to the airline website, which would then return some HTML that I can save to my hard disk and parse automatically later.

I know the recommended approach is to screen-scrape rather than parse HTML / XML, but at the moment, I'm interested in how to simulate repeatedly the form post action (with different dates included for example).

Is anyone able to point me in the right direction as to how to go about coding this up ?

bnt
18th Mar 2009, 00:12
I don't know how screen-scraping would be preferable to XML - since XML is already structured and needs little parsing.

I think the deciding factor will be finding a site to provide that kind of output on demand. Another way of saying that is that you want a data provider with an API (Application Programming Interface), which means that they provide data and query methods in a form directly usable by any application you write - usually XML.

The "open" example I'm familiar with is the Twitter API (http://apiwiki.twitter.com/), but when it comes to air fares, I had to search for an example: the AirXML API (http://www.airkiosk.com/news_item_28.php?item=1) by AirKiosk. As expected, access is limited to travel agents, not members of the general public.

davidjohnson6
18th Mar 2009, 00:54
bnt - many thanks for your comment.

I was thinking of writing an app in something like C#, using the HttpWebRequest class to simulate the action of a user filling in a form and pressing the 'Submit' button. I've already done some sniffing of network packets - I can understand some of the data but not all of it. The HTML that comes back can then be handled as a stream, either saved to disk for later analysis, or parsed on-the-fly to extract the relevant bits of data (flight times and prices) and the results sent to a SQL Server database.

The point about the screen scraping, is that airlines have a habit of changing their websites in small ways from time to time. While humans are usually quite good at handling this change, this often necessitates a complete rewrite of the code that posts and parses the data to/from the airline website, making a screen scrape process potentially less maintenance.

Any more thoughts would of course be welcome

Jofm5
18th Mar 2009, 03:00
My advice is stop now.

Yes you can simulate a person doing web requests with the classes provided by c# but you will only ever be jumping on the boat after it has already sailed.

As bnt has already pointed out - your going to be at a disadvantage for a start in trying to screen scrape - you will find that it will be easy to do for the pure reason that when you screen scrape you will be scraping old news and at best you will be scraping values that have already been marked up by the vendor your scraping from.

If you are going to try and enter such a competitive market ( I suggest you dont because you are displaying a clear lack of understanding and key entry into the market) then you need to enter in before others have added their markup.

There is no money for old rope in adding more onto what is already a very slim margin as anyone searching the net will probably find your source - so it will be time ill invested.

If you really want to approach it in a serious manner then you need to negotiate to be on the same terms as the rest of the expedia's etc in this world. That is unless you can find a niche market they are not interested in.

I personally would say you should walk away unless you are doing one of two things - 1) Offering a clear cut service nobody else offers - or 2) offering it at a substantial cost saving. Seeing as the majority of ppl in the market place are trying to do both you really need an outstanding product to succeed - its not impossible - but not far from it.

bnt
18th Mar 2009, 12:24
I have a suspicion that travel agents deliberately alter their sites from time-to-time, to make life hard for screen-scrapers. It's within their perogative: why should they make life easier for "bottom-feeders" who want to make money off their hard work? I would not encourage any attempt to make money that way.

You mentioned skyscanner.net, and their reliability: are they screen-scraping? I don't see any IATA IDs on their site, and they call themselves a "search engine", not a travel agent, so that's what I suspect they're doing. It is often possible to reverse-engineer the queries websites use, if you examine the web pages and see what they're doing, even if it's in JavaScript.

I see that Skyscanner is offering an API (http://api.skyscanner.net/api/ajax/documentation.html), which you might like to play with for learning purposes. Their method requires the webpage to load a single JavaScript, which provides functions that you then call (in the page) to get data or a map back i.e. it responds to manual queries. if I had a need to try this, I would want to code some PHP or Python on the server, and try to examine their Javascript to see what they're actually doing - the POST or GET queries.

One last little thing: web pages returned by a server are often divided internally in to DIV sections, so it's possible to isolate a relevant portion of a larger page by splitting off that DIV. Again, that's a programming thing e.g. some would use JavaScript on the client browser, while I would prefer PHP or Python on the server.

davidjohnson6
18th Mar 2009, 21:46
Jofm5 - many thanks for your thoughts. I understand you're trying to be helpful, but I have absolutely no intention of trying to compete with the Expedias, Opodos, Skyscanners and Kayaks of this world.

Skyscanner, Kayak and others have teams of developers, with substantial investment from venture capital firms and are way ahead of what I could ever come up with. Further, I don't have the inclination to risk letters from lawyers from the more aggressive airlines because I've been selling their data on for commercial purposes.

I'm just one guy who sometimes finds it difficult to get the information I want from websites of specific airlines in a manner convenient to me and who is up for a challenge that also involves learning more about web-programming, as well as things like Javascript / ASP.

Jofm5
19th Mar 2009, 03:08
OK, I think I got the wrong end of the stick - My impression was that you were wanting to screen scrape with a view to having your own offering.

From what I think now you just wish to compare results from the given query from multiple sites which would require the http post to provide the parameters.

To perform a http post from c# is quite easy, just google http post c# and you will find numerous examples. Unfortunately its not going to be so easy to simulate a user posting from code.

The big problem you will encounter is that not all the session information that will be used in the query will be contained within the posted variables (quite often to deliberately avoid what your trying to do). Session information may be stored in session variables, cookies and hidden controls on the form.

The hidden controls on the form are easy as they are parseable and so long as you you provide the content of hidden form elements in your post as well as visible ones then it will work. Session variables and cookies are much harder as if it is an ASP website some may be stored server side and others you just wont have access to unless you know specific cookie information to interrogate locally and manipulate.

This is not to say it cannot be done, all you need to do is emulate the functionality of a browser - but you are getting into the territory of manipulating the raw http feed yourself which will probably render what you are trying to do being more trouble than it is worth.


There is no real easy answer to what you want to do other than either trial and error or investigating whether a publish API (Web Service) is available - being such a highly competitive market I would not be surprised that even if the sites use XHTML rather than straight HTML they changed the tags to scupper bots scraping the sites.

Sorry for not being optimistic - its a cool challenge to write for but I think most sites will be trying to make your job as hard as possible.