Date last modified: Sun May 27 2012 7:26 AM
This page is a record of my adventures with what is sometimes known as web scraping. Confusingly, this goes by many names such as web/form acquisition/capture, web/internet/data harvesting/mining, or screen scraping.
What it is about, as far as I am concerned, is extracting data from a website by repeatedly and automatically filling in a data request web page (usually some sort of form) and recording the information contained in the web page that comes back; for instance, recording the date and price of a flight so that a complete dataset can be built of all flights between certain destinations. Also, automatically sending certain information to a website to perform an action that is normally performed by manually entering data on a web page; an example would be sending an SMS message by automatically filling in a web form - the text to be sent might have been supplied from another application which can by this means send SMS messages automatically.
There are a number of software packages that claim to do this in a more-or-less automated way, and also a number of tools which assist a programmer in writing a script - see the links below. I've come up with my own approach which is discussed below.
In general, web scraping is quite a complicated process and will be performed by a script (in my case, written for bash) that is specific to the website that is being scraped. The reason for this is that websites usually perform this sort of data processing in unique ways and so any solution has to be tailor-made.
A bespoke solution is likely to use a number of utilities all of which are very powerful and none of which are particularly easy for the beginner. You will need to use a Linux command shell (if you only have a Windows computer then you can install Cygwin to give this), and to understand how regular expressions work (excellent information can be found here), and you will need to become familiar with each utility.
A common pattern might be:
- Run Wireshark with a suitable filter and then point your browser to the source website. Carry out a typical action to retrieve an example dataset: for example, fill in / select the dates and airports for an example flight and click on the 'submit' button to bring up the page of flight information. Now use Wireshark to analyze the data that was sent back by your browser to the website in order to request the dataset. Typically this will consist of a cookie and some 'POST' data; your script will need to do the same in order to generate return pages.
- Write a bash script. First, this retrieves (using curl) a 'starting' web page which might present the form which a user would normally complete manually. This page will often provide a cookie which is easily saved with curl.
- Begin a loop using curl to submit data back to the website in the specific format required (which you previously worked out by analyzing the output from Wireshark); this imitates the behaviour of a browser when the user has filled in the data on the first web page and then clicks 'submit'. Usually you use curl to send back the cookie that was previous saved, as this maintains the session state.
- The website responds with a web page and you extract from this the data that you want - let us say the times, flight numbers and prices, and save them in a database (or a text file). Extraction is usually done with one or more of grep, awk and sed.
- Now loop again for (say) the next date, until you reach the last (say) date.
Most scraping exercises will follow this general pattern but whether from your specific requirements or because of the idiosyncracies of the source website, they are likely to have some unique aspects. Varying date formats, extraneous required fields ('sid' is a common one), and more complex looping parameters (including nested loops where you want to vary more than one parameter) are a few of the complications that arise.
For a working example of this type of approach see my get-vt-cdrs.sh script.
- Screen-Scraper - a complete web-scraping GUI tool which has a free Basic version.
- Yahoo Pipes - a GUI tool to aggregate, manipulate, and mashup content from around the web, with a number of add-on modules to add functionality. But it cannot handle cookies or POST requests.
- Wireshark - an open source network protocol analyzer which runs under Windows Linux and OS X. Very helpful for finding out what information your browser is sending back to a web page when it is looking up information, so that you can then imitate this behaviour with curl. After capturing data using filter 'tcp port http', go to 'Analyze' / 'Follow TCP Stream'.
- Outwit Hub - an add-on for Firefox which can perform data capture and some manipulation/reformatting thereof, but only for incoming data.
- XPather - an add-on for Firefox which is an XPath generator, editor, inspector and simple extraction tool.
- Scrapy - a scraping and web crawling framework written in Python - currently (August 2009) under very active development.
- Web Scraping Proxy - a little like Wireshark but tailored for Perl programmers.
- curl (for Linux or Cygwin) - can retrieve web page and save cookies and can also send back data to websites. A less powerful alternative which can be sufficient in some situations is wget (a comparison page betwen the two can be found here.)
- grep (for Linux or Cygwin) - very powerful tool for extracting data from files (i.e. downloaded web pages) based upon regular expressions.
- awk (for Linux or Cygwin) - another powerful tool for extracting or reformatting data, it reads the source data as a series of records broken up into fields (the standard record separator being a newline and the standard field separator being whitespace, but both of these can be varied).
- sed (for Linux or Cygwin) - another powerful tool for manipulating and extracting data, especially with the 's' subcommand.
- bash (for Linux or Cygwin) - a Linux command line shell and scripting language.
- WebHarvy - WebHarvy can automatically scrape data (text & images) from web pages and save the scraped content in different formats. Single user license $99 (at January 2012).
- Screen-Scraper - a 'complete' web-scraping GUI tool which comes in a free Basic version as well as pay-for Professional ($549) and Enterprise ($2799) flavours (prices at January 2012).
- Helium Scraper - a commercial but inexpensive product ($80 single-user - January 2012) with free 10-day trial. Has a neat GUI for setting up and then extracting data.
- Visual Web Ripper - I haven't investigated this commercial product which costs $299 (January 2012) single-user.
- Automation Anywhere - a generic automation package but among its uses is web data extraction. Free trial version does not allow information to be saved, full program costs $695 (January 2012). They claim 'unparalleled service support - an extended team at your disposal'.
- Mozenda - this is probably the most ambitious attempt to create a user-friendly GUI web-scraping tool i.e. one that can be used by non-techies. You pay for a periodic license and then if you use more than the number of pages included for this period you pay additional fees - so it could get very expensive in practice. It is currently (January 2012) $99 per month including 5000 page downloads (there are other options too). You can sign up for a fully-operational free 14 day trial.
- Gogybot (for Windows .NET) - a library with specific functions for web-scraping, has a free trial version or compiled version for $199 (June 2011).
- Web scraping tutorial - introduction to web-scraping using php and the simplehtmldom library.
- The Data Mine - information about data mining generally (only some of which relates to web scraping)
- HTML Screen Scraping: A How-To Document - based on using urllib and sgrep to retrieve web pages and extract data within a python (Quixote) environment. As I don't have either of these tools I have not explored it further.
- Free'n'Easy Windows File Server - using Devil-Linux with Samba for network storage
- TimeDicer - Onsite/offsite data backup for Windows (uses rdiff-backup)
- Finding a 4D Backup Solution
- Free Virtualisation Solutions - about virtual machines (VMs)
Here is a selection of some (other) programs I have written, most of which run from the command line (CLI), are freely available and can be obtained by clicking on the links. Dependencies are shown and while in most cases written for a conventional Linux server, they should run even on a Raspberry Pi, and many can run under Windows using Cygwin. Email me if you have problems or questions, or if you think I could help with a programming requirement.
- TimeDicer - Onsite/offsite data backup for Windows (uses rdiff-backup) [GNU/Linux & MS Windows©: 2008-12]
- rdiff-backup-install - GNU/Linux script to install rdiff-backup. [GNU/Linux: 2012]
- rdiffweb-install - GNU/Linux script to install rdiffWeb, fixing various bugs that otherwise prevent it working correctly. [GNU/Linux: 2011-13]
- rdiff-backup-regress - GNU/Linux script to regress a rdiff-backup archive. [GNU/Linux: 2012-13]
Kernel, Boot and Device Utilities
- remove-kernel - Lists the installed GNU/Linux kernels in a Debian-based distro (e.g. Ubuntu), and can be used to remove an unwanted kernel and related packages, updating grub appropriately. (Ubuntu Tweak can do the same but remove-kernel.sh is a command-line script so does not require GUI.) [GNU/Linux-Debian/Ubuntu: 2010-12]
- disk-wiper - GNU/Linux script to wipe a disk drive comprehensively, especially one connected by USB, and also check it for bad blocks. For use on a surplus drive before passing to a third party. [GNU/Linux: 2011]
- lvm-usage - GNU/Linux script primarily for a machine where LVM (Logical Volume Management) is used (though runs happily without LVM). Checks available space and shows how space is used; run as cron job to warn if usage is above a set percentage. [GNU/Linux-Debian/Ubuntu: 2012-13]
- lvm-delete-snapshot - GNU/Linux script for a machine where LVM (Logical Volume Management) is used. Removes a snapshot that has been left over by another process. [GNU/Linux-Debian/Ubuntu: 2012-13]
Dellmont/Vodafone/Asterisk - VoIP and Phone Account Utilities
- dellmont-credit-checker - GNU/Linux script to check credit balance on Dellmont/Finarea/Betamax portals such as voicetrading.com and voipdiscount.com. [GNU/Linux: 2008-13]
- get-vt-cdrs - GNU/Linux script to download CDRs (call detail records) from Dellmont’s voicetrading.com. [GNU/Linux: 2010-13]
- saynoto0870 - For people in UK, a GNU/Linux script which performs automated lookup of the www.saynoto0870.com database, finding cheap or free geographic number replacements for expensive non-geographic (087* or 084*) numbers. [GNU/Linux: 2012]
- vodafone-compile-bills - GNU/Linux script which reprocesses downloaded call record 'csv' files from vodafone.co.uk so that they can be easily analysed via spreadsheet - including analysis of bundled minutes which even Vodafone do not seem able to perform! [GNU/Linux: 2012]
- kill-hung-call - GNU/Linux bash script for a system running asterisk (open source IP PBX) to check for, and if necessary terminate, long-running (i.e. hung) calls which are using an outside (FXO) line [GNU/Linux: 2012-13]
- sleepwalker - Windows© program which can be run from a remote machine to 'wake up' a Windows© machine behind a router, wait for it to start and then initiate Remote Desktop session. [MS Windows©: 2008-13]
- wake-thru-vigor - GNU/Linux script to wake up a machine behind a remote Draytek Vigor router. [GNU/Linux: 2012]
- huawei-fix-wol - GNU/Linux and Windows (Cygwin) script to fix Huawei HG532 router (e.g. from TalkTalk) so you can send WOL/wakeonlan packets from internet to a machine behind the router. [GNU/Linux & MS Windows©: 2012-13]
- form-extractor - GNU/Linux script to extract form tags from a web page or downloaded file. [GNU/Linux: 2012-13]
- websites-live-checker - GNU/Linux program to test webpages (including password-protected) or machines to check they are live; use as a cron job for your own websites, for hardware presenting a webpage, or for any machines with a presence on your local LAN or on the internet. [GNU/Linux: 2009-2013]
- dutree - GNU/Linux program to show a tree-style list of files and directories at the specified location and greater than the specified size (default 1GB). [GNU/Linux: 2012]
- man2text - GNU/Linux one-liner program to convert man page output to straightforward text. [GNU/Linux: 2012]
- Accounts - Multi-business multi-currency accounting software, uses Access [MS Windows©: 1996-2013]
- Rents Program - Residential lettings/landlord front office program, with many special features for UK market [MS Windows©: 1991-2013]
Thanks for a great article and for mentioning screen-scraper. I work at screen-scraper.com and I put together a list of screen-scraping software like our own.
Feel free to add/edit as you like.
You can upload an example of a web crawler?
It has an online API for web-scraping so that you can parse websites only with CURL.
If you have any questions or comments, please email Dominic firstname.lastname@example.org.