Web scraping is a very useful mechanism to either extract data, or automate actions on websites. Normally we would use urllib or requests to do this, but things start to fail when websites use javascript to render the page rather than static HTML. For many websites the information is stored in static HTML files, but for others the information is loaded dynamically through javascript (e.g. from ajax calls). The reason maybe because the information is constantly changing, or it maybe to prevent webscraping! Either way, you need to more advanced techniques to scrape the information – this is where the library selenium can help.
- How to Web Scrape an ASP.NET Web Form using Selenium in Python Recently, I worked on a little side project that required web scraping ASP.NET forms off of a website. I quickly discovered that this task is more complicated than the usual requests/BeautifulSoup package combo that I use normally for web scraping.
- In this walkthrough, we'll tackle web scraping with a slightly different approach using the selenium python library. We'll then store the results in a CSV file using the pandas library. The code used in this example is on github. Why use selenium. Selenium is a framework which is designed to automate test for web applications.
What is web scraping?
Browse other questions tagged python-3.x selenium-webdriver web-scraping beautifulsoup or ask your own question. The Overflow Blog Podcast 329: Two words for ya – “networked spreadsheets”. This page explains how to do web scraping with Selenium IDE commands. Web scraping works if the data is inside the HTML of a website. If you want to extract data from a PDF, image or video you need to use visual screen scraping instead. When to use what command? The table belows shows the best command for each type of data extraction.
To align with terms, web scraping, also known as web harvesting, or web data extraction is data scraping used for data extraction from websites. The web scraping script may access the url directly using HTTP requests or through simulating a web browser. The second approach is exactly how selenium works – it simulates a web browser. The big advantage in simulating the website is that you can have the website fully render – whether it uses javascript or static HTML files.
What is selenium?
According to selenium official web page, it is a suite of tools for automating web browsers. This project is a member of the Software Freedom Conservancy, Selenium has three projects, each provides a different functionality if you are interested in it, visit their official website. The scope of this blog will be attached to the Selenium WebDriver project
When should you use selenium?
Selenium is going to facilitate us with tools to perform web scraping, but when should it be used? You generally can use selenium in the following scenarios:
- When the data is loaded dynamically – for example Twitter. What you see in “view source” is different to what you see on the page (The reason is that “view source” just shows the static HTML files. If you want to see under the covers of a dynamic website, right click and “inspect element” instead)
- When you need to perform an interactive action in order to display the data on screen – a classic example is infinite scrolling. For some websites, you need to scroll to the bottom of the page, and then more entries will show. What happens behind the scene is that when you scroll to the bottom, javascript code will call the server to load more records on screen.
So why not use selenium all the time? It is a bit slower then using requests and urllib. The reason is that selenium simulates running a full browser including the overhead that a brings with it. There are also a few extra steps required to use selenium as you can see below.
Once you have the data extracted, you can still use similar approaches to process the data (e.g. using tools such as BeautifulSoup)
Pre-requisites for using selenium
Step 1: Install selenium library
Before starting with a web scraping sample ensure that all requirements have been set, Selenium requires pip or pip3 installed, if you don’t have it installed you can follow the official guide to install it based on the operating system you have.
Once pip is installed you can proceed with the installation of selenium, with the following command
Alternatively, you can download the PyPI source archive (selenium-x.x.x.tar.gz) and install it using setup.py:
Step 2: Install web driver
Selenium simulates an actual browser. It won’t use your chrome installation but it will use a “driver” which is the browser engine to run a browser. Selenium supports multiple web browsers, so you may chose which web browser to use (read on)
Selenium WebDriver refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just a web driver.
Web driver needs to be downloaded, and then it could be either added to the path environment variable or initialized with a string containing the path where downloaded web driver is. Environment variables are out of the scope of the blog so we are going to use the second option.
From here to the end Firefox web driver is going to be used, but here is a table containing information regarding each web driver, you are able to choose any of them, Firefox is recommended to follow this blog
Download the driver to a common folder which is accessible. Your script will refer to this driver.
You can follow our guide on how to install the web driver here.
A Simple Selenium Example in Python
Ok, we’re all set. To begin with, let’s start with a quick staring example to ensure things are all working. Our first example will involving collecting a website title. In order to achieve this goal, we are going to use selenium, assuming it is already installed in your environment, just import webdriver
from selenium in a python file as it’s shown in the following.
Running the code below will open a firefox window which looks a little bit different as can be seen in the following image and at the then it prints into the console the title of the website, in this case, it is collecting data from ‘Google’. Results should be similar to the following images:
Note that this was run in foreground so that you can see what is happening. Now we are going to manually close the firefox window opened, it was intentionally opened in this way to be able to see that the web driver actually navigates just like a human will do. But now that it is known, we can add at the end of the out this code: driver.quit()
so the window will automatically be closed after the job is done. Code now will look like this.
Now the sample will open the Firefox web driver do its jobs and then close the windows. With this little and simple example, we are ready to go dipper and learn with a complex sample Fashion design drawings step by step.
How To Run Selenium in background
In case you are running your environment in console only or through putty or other terminal, you may not have access to the GUI. Also, in an automated environment, you will certainly want to run selenium without the browser popping up – e.g. in silent or headless mode. This is where you can add the following code at the start “options” and “–headless”.
The remaining examples will be run in ‘online’ mode so that you can see what is happening, but you can add the above snippet to help.
Example of Scraping a Dynamic Website in Python With Selenium
Until here, we have figure out how to scrap data from a static website, with a little bit of time, and patience you are now able to collect data from static websites. Let’s now dive a little bit more into the topic and build a script to extract data from a webpage which is dynamically loaded.
Imagine that you were requested to collect a list of YouTube videos regarding “Selenium”. With that information, we know that we are going to gather data from YouTube, that we need the searching result of “Selenium”, but this result will be dynamic and will change all the time.
The first approach is to replicate what we have done with Google, but now with YouTube, so a new file needs to be created yt-scraper.py
Now we are retrieving data YouTube title printed, but we are about to add some magic to the code. Our next step is to edit the search box and fill it with the word that we are looking for “Selenium” by simulating a person typing this into the search. This is done by using the Keys class:
from selenium.webdriver.common.keys import Keys
.
The driver.quit()
line is going to be commented temporally so we are able to see what we are performing
The Youtube page shows a list of videos from the search as expected!
As you might notice, a new function has been called, named find_element_by_xpath, which could be kind of confusing at the moment as it uses strange xpath text. Let’s learn a little bit about XPath to understand a bit more.
What is XPath?
XPath is an XML path used for navigation through the HTML structure of the page. It is a syntax for finding any element on a web page using XML path expression. XPath can be used for both HTML and XML documents to find the location of any element on a webpage using HTML DOM structure.
The above diagram shows how it can be used to find an element. In the above example we had ‘//input[@id=”search”]. This finds all <input> elements which have an attributed called “id” where the value is “search”. See the image below – under the “inspect element” for the search box from youTube, you can seen there’s a tag <input id=”search” … >. That’s exactly the element we’re searching for with XPath
There are a great variety of ways to find elements within a website, here is the full list which is recommended to read if you want to master the web scraping technique.
Looping Through Elements with Selenium
Now that Xpath has been explained, we are able to the next step, listing videos. Until now we have a code that is able to open https://youtube.com, type in the search box the word “Selenium” and hit Enter key so the search is performed by youtube engine, resulting in a bunch of videos related to Selenium, so let’s now list them.
Firstly, right click and “inspect element” on the video section and find the element which is the start of the video section. You can see in the image below that it’s a <div> tag with “id=’dismissable'”
We want to grab the title, so within the video, find the tag that covers the title. Again, right click on the title and “inspect element” – here you can see the element “id=’video-title'”. Within this tag, you can see the text of the title.
One last thing, let’s remind that we are working with internet and web browsing, so sometimes is needed to wait for the data to be able, in this case, we are going to wait 5 seconds after the search is performed and then retrieve the data we are looking information. Keep in mind that the results could vary due to internet speed, and device performance.
Once the code is executed you are going to see a list printed containing videos collected from YouTube as shown in the following image, which firstly prints the website title, then it tells us how many videos were collected and finally, it lists those videos.
Waiting for 5 seconds works, but then you have to adjust for each internet speed. There’s another mechanism you can use which is to wait for the actual element to be loaded – you can use this a with a try/except block instead.
So instead of the time.sleep(5), you can then replace the code with:
This will wait up to a maximum of 5 seconds for the videos to load, otherwise it’ll timeout
Conclusion
With Selenium you are going to be able to perform endless of tasks, from automation tasks to automate testing, the sky is the limit here, you have learned how to scrape data from static and dynamic websites, performing javascript actions like send some keys like “Enter”. You can also look at BeautifulSoup to extract and search for data next
Subscribe to our newsletter
Get new tips in your inbox automatically. Subscribe to our newsletter!
Web scraping or data extraction from internet sites can be done with various tools and methods. The more complex sites to scrape are always the ones that look for suspicious behaviors and non-human patterns. Therefore, the tools we use for scraping must simulate human behavior as much as possible.
The tools that are developed and can simulate human behavior are testing tools, and one of the most commonly used and well-known tools as such is Selenium.
In our previous blog post in this series we talked about puppeteer and how it is used for web scraping. In this article we will focus on another library called Selenium.
What is Selenium?
Selenium is a framework for web testing that allows simulating various browsers and was initially made for testing front-end components and websites. As you can probably guess, whatever one would like to test, another would like to scrape. And in the case of Selenium, this is a perfect library for scraping. Or is it?
A Brief History
Selenium was first developed by Jason Huggins in 2004 as an internal tool for a company he worked for. Since then it has been evolved a lot but the concept has remained the same: a framework that simulates (or in truth really operates as) a web browser.
How does selenium work?
Basically, Selenium is a library that can control an automated version of Google Chrome, Firefox, Safari, Vivaldi, etc. You can use it in an automated process and imitate a normal user’s behavior. If for example, you would like to check your competitor’s top 10 ranking products daily, you would write a piece of code that will open a Chrome window (automatically, using Selenium), surf to your competitor’s storefront or search results on Amazon and scrape the data of their leading products.
Selenium is composed of 5 components that make the testing (or scraping) process possible:
Selenium IDE:
This is a real IDE for testing. This is actually a Chrome Extension or a Firefox Add on, and allows recording, editing and debugging tests. This component is less functional for scraping since usually scraping would be done using an API.
Selenium Client API:
This is the API that allows us to communicate with selenium using an API. There are various libraries for different programming languages so scraping can be done in JavaScript, Java, C#, R, Python, and Ruby. It is continuously supported and improved by a strong community.
Selenium WebDriver:
This is the component of Selenium that actually plays the browser’s part in the scraping. The driver is just like a remote control that connects to a specific browser (like in anything else, each remote control is designed to control a specific browser), and through it, we can control the browser and tell it what to do.
Selenium Grid:
This is not a part of selenium itself, but more of a tool to allow us to use multiple selenium instances on remote machines.
Since our previous article talked about Puppeteer and praised it for being the best tool for web scraping, let’s examine the differences and similarities between Selenium and Puppeteer. Autodesk 123d design for mac.
What is the difference between Selenium and Puppeteer?
This is a very common question and the distinction is very important. Both of these libraries are very similar in concept, but there are a few key points to consider: Puppeteer’s main disadvantage is that it is limited to be used in Javascript, since the API Google publish supports Javascript only. Since it is a library written by Google, it supports only the Chrome browser. If you prefer to write all of your code in a coding language different than Javascript, or it is of importance to your company to use a web browser other than Google Chrome, I would consider using Selenium.
For scraping purposes, the fact that Puppeteer supports only Chrome really does not matter in my opinion. It matters more for testing usages when you would want to test your app or website on different browsers.
The library limitations and the fact that you need to know JavaScript in order to use it might be a bit more limiting, though I believe it’s always good to learn new skills and programming languages and controlling Puppeteer with JavaScript code is a great place to start with.
Should I choose Puppeteer or Selenium for web scraping?
The advantages of Puppeteer over Selenium are immense. And on top of stands the fact that Puppeteer is faster than Selenium. If you are planning on a high scale scraping operation, that may be a point to consider.
In addition, Puppeteer has more options that are vital for scraping, such as network interception. This is another great advantage over Selenium that allows you to handle each network request and response that is generated while loading a page and log them, stop them, or generate more of them.
This allows you to intercept and stop requesting certain resources such as images, CSS, or javascript files and reduce your response time and traffic used, both are very important when scraping on a large scale.
If you are planning on scraping for business purposes, either if to build a comparison website, a business intelligence tool, or any other business-oriented purpose, we would suggest you to use an API solution.
To sum it up, here are the main pros and cons of Selenium and puppeteer for web scraping.
Pros of selenium for web scraping:
Works with many programming languages.
It can be used with many different browsers and platforms.
Can manually record tests/small scrape operations.
It is powered by a great community.
Cons of selenium for web scraping:
Slower than Puppeteer.
It has less control over the way the scraping is done and has less advanced features.
Pros of puppeteer for web scraping:
Faster than other libraries.
It is easier to use – no need of installing any external driver.
It has more control, allows more options like Network Interception.
Cons of puppeteer for web scraping:
The main drawback of puppeteers is that it currently works only with JavaScript.
To conclude, since most of the advantages of Selenium over Puppeteer are mainly centered around testing, for scraping I would definitely recommend giving Puppeteer a try. Even at the price of learning JavaScript.
I know Puppeteer isn’t for everyone and you might be already into Selenium, want to write a python scraper or just don’t like Puppeteer for any reason. That’s ok and in this case let’s see how to use Selenium for web scraping.
How is Selenium used for web scraping?
Web scraping using Selenium is rather straightforward. I don’t want to go into a specific language (We will add more language-specific tutorials in the future), but the steps are very similar in every language. Feel free to use the official Selenium Documentation for in-depth details. The first and most important step is to install the Selenium web driver component.
Install Selenium in Javascript:
npm install selenium-webdriver
Install selenium in Python:
pip install selenium
Then you can start using the library according to the documentation.
3 Best practices for web scraping with Selenium
Scraping with Selenium is rather straight forwards. First, you need to get the HTML of the div, component or page you are scraping. This is done by navigating to that page using the web driver and then using a selector to extract the data you need. There are 3 key points you should notice though:
1. Use a good proxy server with IP rotation
This is the pitfall you can most easily fall to in case you are a programmer who is starting his scraping journey – If you do not use a good web proxy service you will get blocked. All modern sites including Google, Amazon, Airbnb, eBay, and many others use advanced anti-scraping mechanisms. Some put more effort into it than others, but all of them will start blocking you very quickly if you don’t use a proxy service and change your address every X amount of requests. What is X? depends on the website, but it varies between 5 and 20 usually.
Once you have your proxy in place and have a rotation mechanism for the IPs you use, the number of times you will be blocked by websites will be reduced to around zero.
Web Scraping With Selenium R
2. Find the sweet spot in your crawling rate
Scrapy Selenium Example
Crawling too fast is a big problem and will cause you to hit a lot of anti scrape defenses on the websites you are scraping. Of course, you want this to be as fast as possible, but scraping too fast is also sometimes painfully slow.
Web Scraping With Selenium C#
Try to play with it. Get mac address for pc. Start with a request per second and increase this. make sure that for this test you are using a different IP address every time and that you are trying to fetch different objects from the website you are testing.
3. Use user-agent rotation
When sending a request for a webpage or an object, the browser sends a header called User-Agent. It is necessary to change this every few requests, and always send a “legitimate” User-Agent (one that conceals the fact that this is a headless browser). Read more about User-Agents.
Selenium Scrape
In the next posts we will continue to discuss the challenges you may face when scraping online information and alternatives. You are also more than welcome to check our recent article about the differences between in house scraping and web scraping API.