Scraping tables from web pages

Released: Sat, 03 May 2014 21:36:00 +0000

One useful feature of Salstat that we recently included was the ability to scrape web pages for tables. These can be viewed and imported.

Say we see a great web page and we want to analyse the data ourselves. Let's consider this page about the classification and nomenclature of influenza.

My normal workflow would be to copy and paste the table, and failing that, to save the page and scrape it with a HTML parser. It's possible to do this fairly automatically but requires some setting up. Wouldn't it be so much easier if I could just copy the URL and have all that work done for me?

Salstat can download this page for you and preview all HTML tables simply by pressing CTRL-U (or select 'Scrape URL...' from the File menu) and pasting the page's URL into the dialog that appears. Then and there, up pops a dialog previewing the tables from that web page.

You can see that each HTML table is shown as a tab. This means you can select the correct one – we have to do this because Salstat isn't friendly but not smart enough to know if a table's being used for data or for some other purpose (e.g., layout). Here, it shows the data just as on the web page. Clicking on the 'Import' button imports the data like this:

The table's headings now become variable names and the data have been entered for you. No fuss, just paste the URL and select the table you want. You can do further analysis directly upon the data from this point. We want to help people not just create data from tables but also be able to insert them into an existing data set. This would help automatic aggregation of multiple data tables.

So far, it's a fairly basic feature that only works with well-structured HTML tables. Most non-standard tables will not work properly. We aim Salstat to be able to handle these superbly in the future.

Contact us:

We're happy to hear from you

Email us
Contact form

Salstat's social world:

Salstat is an open source project fostered by Thought Into Design Ltd
Thought Into Design Ltd is registered in England and Wales (Companies House number 7367421)