Advanced: Create a simple scraper in Google Sheets using Xpath
This article comes from the Silk Data Handbook. It explains how to use a fairly simple way of scraping data using the
ImportXML formula with XPath query language. Once you ended up with a suitable spreadsheet, you can import the spreadsheet into your Silk site.
One of the most useful things when it comes to harvesting data from the web is learning how to use XPath expressions. XPath is “a query language for selecting nodes from an XML“. Knowing how to use XPath to parse webpages will allow endless scraping possibilities, and help you to built a spreadsheet from scratch. You can read more here and here.
For less complex scrapers, you’ll find that you don’t really need a deep knowledge of XPath to use it properly. For example, you can easily use XPath to construct web crawlers that turn thousands of pages into a structured spreadsheet.
You can use the importXML function to create an automatically updated Silk with Google Sheet sync!
Google Spreadsheets: importXML
Let’s say I have a list of cities, and i want to pull in basic, structured, information about each from Wikipedia. This is how I would start like
So to fill the image column, in cell C2, type:
The drag it down for all the other cities.
"//td/a/img/@src" is the XPath expression to query the url (stored in B2) and return the results contained in that specified path. To understand the exact path of what you need, you can view the source of a webpage and figure it out yourself. Or you can find the object you want (in this case the image), right click on it and select “Inspect Element". You will then view the page source. Find the url of image you want, right click and select “Copy XPath". You will now have saved on your clipboard the XPath needed to retrieve the image.
You can repeat this to fill out all the other columns. For example, the first paragraph describing a city is accessible through:
=JOIN(" ",importXML(B2, "//div/p"))