When web pages are properly formatted, you can easily use Google spreadsheets to import (“scrape”) their data. Here’s how.
I wanted to rank National Cancer Institute research expenditures by cancer type; on the website, the data are presented alphabetically.
How to scape data into your spreadsheet for analysis
Log in to Google Drive and pick create new spreadsheet.
In cell A1 (upper-left hand corner), paste this:
Wait a moment, and you should see something that looks like this:
Now that you have the data, you can make the comparisons – and charts – that interest you.
Be sure to freeze the header row before you begin sorting data:
- Highlight the row (click on the row number, in this case, it’s 1)
- Tap “View” and then select “Freeze”
- I chose “2 rows” so that the total budget would remain fixed at the top
How do you scrape data off a page that you’re interested in?
You need three things:
- The page URL
- The HTML container that holds the data (in this example, “table”)
- And a number reflecting which table you want (in this example, it’s the first table)
To identify the HTML container, you can use the “inspect element” feature of your browser (Chrome, Safari*) by right-clicking the page near the title of the data. In well-formatted HTML, the container will be a table.
To identify which table (or div) contains the information I’m looking for, I copy
- the page source HTML;
- paste it into BBEdit;
- then search for the HTML tag (in this case, I used <table — I don’t close the tag on search because sometimes there are CSS rules included in the tag space).
Here are additional resources:
- If you want to use your data at Tableau Public
- If you want to scrape “data” that isn’t formatted as data
- If you want to scrape social media statistics
How to enable “inspect element” in Safari
- Go to Safari > Preferences > Advanced
- Ensure that Show Develop menu in menu bar is enabled
- You now have a new menu tab and an advanced set of tools at your disposal.
- Inspect element by right-clicking in the page
- Move up or down the page to see the element you’re interested in.