September 1, 2015 3:07 pm

How to use the Web Scraper Chrome Extension part 2

This is the second part of the Using the Web Scraper Chrome Extension series, please read the first part to see how to perform a basic content scraping. After you have scraped the content and the popup closes, you have to click on the Sitemap tab -> Export data as CSV.Using the Web Scraper Chrome Extension 6

Next, click on Download now and choose where to save the .csv file:

Using the Web Scraper Chrome Extension 7

The .csv file should look something like that:

Using the Web Scraper Chrome Extension 8

It would contain only one column called gif (the id of our selector) and it will have many rows containing the URL to the gif.

Create a simple MySQL table called awesomegifs with the structure that our csv file has: It would only need two columns, id which auto increments and which is a primary key and a url column which is a VARCHAR and would contain the URL of the gifs, here is the raw SQL for it:

It is important that the id is the last column because we want to assign to the columns in the table exactly as they are ordered in the .csv file. In our case, we have one column called gif in our .csv file and it will be assigned to the first column – url – in our table. The second column id, would not be assigned any value when loading the data from the .csv file to the table and it will auto increment by default.

Now, execute the following SQL command but add the proper path to the .csv file:

And, voila, we have 758 rows inserted.

Using the Web Scraper Chrome Extension 9

The image below shows how our table looks after the .csv file has been loaded into our table:

Using the Web Scraper Chrome Extension 10

Finally, you have your first website scraped with the Web Scraper extension.

Let us examine a more complicated website now. Let us say we want to scrape all jokes from the laughfactory.com website. We create a new sitemap, add the name of laughfactory and a starting URL of http://www.laughfactory.com/jokes/

Using the Web Scraper Chrome Extension 11

Then, we create a new selector with the type of “Link” and the checkbox Multiple checked. This tells Web Scraper to go over each link that matches the selector on the page http://www.laughfactory.com/jokes/ . Our CSS selector is .left-nav a which will make Web Scraper go through each link (category of jokes) on the laughfactory/jokes/ webpage.

Using the Web Scraper Chrome Extension 12

Then, we click on the newly created categories selector and create a nested selector (selector within the categories selector) called pagination. This will instruct Web Scraper first to open up a category and follow all pagination links for each category:

Using the Web Scraper Chrome Extension 13

Our pagination selector is .pagination li a which points to each link in the pagination and is again of type Link and Multiple. We have selected multiple because there will be more than one link in the pagination container and we have selected a type of link because we want Web Scraper to follow each link located in the pagination container.

Finally, we want to extract all jokes located on all pages within all categories. Therefore, when Web Scraper has followed some category and is located on some of the pages with jokes (we need another nested selector located in _root / categories/ pagination) we create a Text selector with Multiple checked because there is more than one joke on a page. Our CSS selector in this case is .joke3-pop-box p

Using the Web Scraper Chrome Extension 14

As with the previous example, you now have to click on Scrape, wait, save as a .csv file and load the data into a MySQL table and you are ready to use it in your applications.

Here is a how a simple MySQL table looks like (in terms of structure and content) with the resulting data loaded.Using the Web Scraper Chrome Extension 15

What are you waiting for? Go scrape the web!

Author Ivan Dimov

Ivan is a student of IT, a freelance web designer/developer and a tech writer. He deals with both front-end and back-end stuff. Whenever he is not in front of an Internet-enabled device he is probably reading a book or traveling. You can find more about him at: http://www.dimoff.biz. facebook, twitter


Tutorial Categories:
  • Wichet Amjuang

    Excellent topics with good explanation.
    I have tried to get data from filpkart.com for products in there catalogs but not success until now, appreciate if you can demo it as it should be some advance to create sitemap for that. Thanks.