In the previous post, we have discussed about how to start web scraping with requests and lxml libraries, and we also summarized two limitations with this approach:
- Time & effort required to chain all the requests for some complicated operations such as user authentication
To solve these two issues, I recommended to use selenium package. In fact you have checked this post, you may still remember that we can use selenium to simulate human actions such as open URL on browser or trigger a button click on the web page and so on.
In this post, I will demonstrate how to use selenium to automatically login to tweeter account, view and post tweets, where the same approach can be used for your web scraping project.
In order to use selenium to launch browser, you will need to download a web driver for the browser you are using. You can check all the supported browsers as well as the download links from here.
For the below code example, I will use Chrome version 86 and download the driver with this version supported. For simplicity, I will save the chromedriver.exe into my current code directory.
Besides the driver file, you will also need to install selenium in your working environment. Below is the pip command for installation of the latest version:
pip install --upgrade selenium
Let’s also import all the modules at the beginning of our code. Explanation will be given later where these modules are used:
from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as ec
With the above ready, let’s dive into our code example.
Login to twitter account with Selenium
Similar to a human behavior on the browser, Selenium does not allow you to interact with the invisible elements, and you would encounter ElementNotVisibleException when trying to access the element if it is not fully loaded or not in the view. So the best practice is to always maximize your browser window, so that majority of the information you need are visible and interactable.
To maximize the browser upon launching, you can set –start-maximized in the chrome operations as per below:
chromeOptions = Options() chromeOptions.add_argument("--start-maximized")
(You can also launch the browser first and later call the maximize_window function to maximize it)
This Chrome options shall be passed into the web driver constructor when it is initiated. We also need to specify the full path of driver exe file, for our case, it’s under the current directory.
driver = webdriver.Chrome(executable_path="chromedriver.exe", options=chromeOptions)
With the above code, a new Chrome browser will be launched. The web driver object has a get method which accepts a URL parameter and opens it from the browser. Below will open the twitter login page on your browser:
tweeter_url = "https://twitter.com/login" driver.get(tweeter_url)
As there are many factors impact how fast the web page can be fully loaded, you may need to add in some delays at certain steps to make sure that the current action has been completed successfully before moving to the next step.
In Selenium, there are two types of waiting approaches: implicit wait and explicit wait. The implicit wait will just instruct web driver to wait for maximum of certain time when polling the DOM, while explicit wait will check the presence/visibility of the element periodically until the condition is met or the maximum waiting time reached. As implicit wait applies to the entire lifecycle of the web driver, the explicit wait is relatively more flexible. Let’s define our explicit wait for a max of 10 seconds in our example:
wait = WebDriverWait(driver, 10)
Now, we shall follow what we have discussed in the previous post to find a unique identifier of the login username and password fields. By inspecting the web page HTML, you can easily find out the name attribute of the username and password field. Below is the screenshot of the HTML structure for username field:
To locate the username element, we can use the XPath with its element name. And let’s also use the explicit wait to locate it until the element is fully loaded and visible on the page:
username_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[username_or_email]")))
Once we located the username input field, we can send our login ID to this field with send_keys function as per below:
Note: you will need to replace this username/password variable with your twitter login credentials
Similarly, we can locate our password field by its name and send in our password:
password_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[password]"))) password_input.send_keys(password)
Once we have successfully set the values into these two fields, we can simulate the button click on the login button:
- Firstly we shall locate to the login button by its attribute data-testid=’LoginForm_Login_Button’
- Then call the WebElement click function to simulate how user clicks on the button
With the below code, you shall be able to login into your tweeter account and view the tweets on your home screen:
login_button = wait.until(ec.visibility_of_element_located((By.XPATH, "//div[@data-testid='LoginForm_Login_Button']"))) login_button.click()
To showcase how to interact with your web page like a normal user, let’s move to the next example to search a tweeter posts with some keywords.
Search tweeter posts by keywords
Same as previously, we shall first locate our search input box by its data-testid attribute as per below:
search_input = wait.until(ec.visibility_of_element_located((By.XPATH, "//div/input[@data-testid='SearchBox_Search_Input']")))
As a normal user, I can key in some keywords in the search box and hit ENTER for a search. We can do the same from Selenium via the send_keys function. Let’s first clear the input box and then send a keyword “ethereum” together with a ENTER key:
search_input.clear() search_input.send_keys("ethereum" + Keys.ENTER)
Upon receiving the ENTER key event, you shall see the search results are loading on the page. The next is to extract the tweeter posts from the searching results.
Below is the sample code that I extracted all the text from the tweets and printed as output:
tweet_divs = driver.find_elements_by_xpath("//div[@data-testid='tweet']") for div in tweet_divs: spans = div.find_elements_by_xpath(".//div/span") tweets = ''.join([span.text for span in spans]) print(tweets)
You shall see the output similar to below:
With this plain text results, you may use some text processing tools to further analyze what people are discussing around to this keyword.
Automatically post new tweets
Since we are able to search within tweeter, we shall also be able to post a new tweet with Selenium.
Let’s first locate the below text area by the data-testid attribute:
Below is the code to locate to the span of the text area by it’s ancestor div:
tweet_text_span = driver.find_element_by_xpath("//div[@data-testid='tweetTextarea_0']/div/div/div/span")
Then we can send whatever text we want to tweet:
tweet_text_span.send_keys("Do you know we can tweet with selenium?")
Once the text is written into the span, the tweet button will be enabled. You can locate the button and click to submit the post:
tweet_button = wait.until(ec.visibility_of_element_located((By.XPATH, "//div[@data-testid='tweetButtonInline']"))) tweet_button.click()
Upon submission, you shall see a new post added to your timeline as per below:
Move invisible element into visible view
There are always cases that you need to scroll up and down or left and right to view some information on the web page. You will also need to make sure your elements are in the view before you can do any operation such as getting its attributes or performing clicks.
who_to_follow = driver.find_element_by_xpath("//div/span[text() = 'Who to follow']") driver.execute_script("arguments.scrollIntoView(true);", who_to_follow)
Hide your browser with headless mode
When you use Selenium for some automation or scraping job, you may not wish to see the web pages jumping around in front of you. To make it running peacefully in the background, you can set the headless parameter in the Chrome option before the initialization of the web driver:
With this parameter, we would not see the browser launched and everything will be running quietly in the background. It’s good that you always test your code properly before you enable this headless mode.