Web Scraping How To Valid XHTML 1.0 Transitional

Date last modified: Tue Dec 11 2018 5:35 PM

About This Page

This page is a record of my adventures with what is sometimes known as web scraping. Confusingly, this goes by many names such as web/form acquisition/capture, web/internet/data harvesting/mining, or screen scraping.

What it is about, as far as I am concerned, is extracting data from a website by repeatedly and automatically filling in a data request web page (usually some sort of form) and recording the information contained in the web page that comes back; for instance, recording the date and price of a flight so that a complete dataset can be built of all flights between certain destinations. Also, automatically sending certain information to a website to perform an action that is normally performed by manually entering data on a web page; an example would be sending an SMS message by automatically filling in a web form - the text to be sent might have been supplied from another application which can by this means send SMS messages automatically.

There are a number of software packages that claim to do this in a more-or-less automated way, and also a number of tools which assist a programmer in writing a script - see the links below. I've come up with my own approach which is discussed below.

Bespoke Web-Scraping: an Outline Example

In general, web scraping is quite a complicated process and will be performed by a script (in my case, written for bash) that is specific to the website that is being scraped. The reason for this is that websites usually perform this sort of data processing in unique ways and so any solution has to be tailor-made.

A bespoke solution is likely to use a number of utilities all of which are very powerful and none of which are particularly easy for the beginner. You will need to use a Linux command shell (if you only have a Windows computer then you can install Cygwin to give this), and to understand how regular expressions work (excellent information can be found here), and you will need to become familiar with each utility.

A common pattern might be:

  1. Run Wireshark with a suitable filter and then point your browser to the source website. Carry out a typical action to retrieve an example dataset: for example, fill in / select the dates and airports for an example flight and click on the 'submit' button to bring up the page of flight information. Now use Wireshark to analyze the data that was sent back by your browser to the website in order to request the dataset. Typically this will consist of a cookie and some 'POST' data; your script will need to do the same in order to generate return pages.
  2. Write a bash script. First, this retrieves (using curl) a 'starting' web page which might present the form which a user would normally complete manually. This page will often provide a cookie which is easily saved with curl.
  3. Begin a loop using curl to submit data back to the website in the specific format required (which you previously worked out by analyzing the output from Wireshark); this imitates the behaviour of a browser when the user has filled in the data on the first web page and then clicks 'submit'. Usually you use curl to send back the cookie that was previous saved, as this maintains the session state.
  4. The website responds with a web page and you extract from this the data that you want - let us say the times, flight numbers and prices, and save them in a database (or a text file). Extraction is usually done with one or more of grep, awk and sed; you can make it easier with my form-extractor.
  5. Now loop again for (say) the next date, until you reach the last (say) date.

Most scraping exercises will follow this general pattern but whether from your specific requirements or because of the idiosyncracies of the source website, they are likely to have some unique aspects. Varying date formats, extraneous required fields ('sid' is a common one), and more complex looping parameters (including nested loops where you want to vary more than one parameter) are a few of the complications that arise.

For a working example of this type of approach see my script.

Free Webpage Analysis Tools

Free Command Line (CLI) Data Submission/Extraction Tools

Non-Free Web Scrapers

More Information About Web-Scraping


I have provided this software free gratis and for nothing. If you would like to thank me with a contribution, please let me know and I will send you a link. Thank you!

My Other Sites

My Programs

Here is a selection of some (other) programs I have written, most of which run from the command line (CLI), are freely available and can be obtained by clicking on the links. Dependencies are shown and while in most cases written for a conventional Linux server, they should run on a Raspberry Pi, and many can run under Windows using Windows Subsystem for Linux (WSL) or Cygwin. Email me if you have problems or questions, or if you think I could help with a programming requirement.

Backup Utilities

Debian/Ubuntu kernel and LVM Utilities

Dellmont / Three - VoIP and Mobile Phone Account Utilities

Miscellaneous Programs


If you have a comment or question, please email me, thank you.

Richard 12 Jun 2011, 08:34
Hey, nice article about web scraping, but i was not found anything about Gogybot, scraping library for Microsoft .Net. This is very easy to use commercial product with free trial.
Dominic 12 Jun 2011, 09:06
Thanks Richard, I have added some text about it...
Scott Wilson 20 Jul 2012, 00:16

Thanks for a great article and for mentioning screen-scraper. I work at and I put together a list of screen-scraping software like our own.

Feel free to add/edit as you like. Nld0stV0NCV1E
Alexis 13 Sep 2012, 03:18
Excelent web site, thanks you!!!.

You can upload an example of a web crawler?
Dwayne 17 Sep 2012, 06:09
Nice article! I know a free web service that might interest you;

It has an online API for web-scraping so that you can parse websites only with CURL.
Uday 07 Sep 2013, 03:15
The link to your example script is broken ?
Dominic 07 Sep 2013, 13:53
The link to works fine for me Uday, what are you seeing? Maybe it is blocked by your browser settings because it is a shell program?
O. 04 Feb 2015, 04:06
any-dl: downloader-tool, initially written for downloading files from video-sites, is flexible enough for web-scraping.
paulblack 15 Jun 2016, 05:00
I've been using one called Octoparse to help me collect data from the web. It's a client-side software that turns websites into structured tables of data without having to use code. Pretty easy to use because it's for non-programmers.
But you have to know a little bit about web page formats and X path.
Jake 03 Mar 2018, 16:26
Thanks for the good article about web scraping, you could add new cloud based web scraping and data extraction platform Diggernaut

If you have any questions or comments, please email Dominic