Automation

plot route on Google Maps with Python Google Maps API, optimize route with Google Maps API, Singapore

Plot Route on Google Maps with Python

Introduction

You probably use Google Maps a lot in your daily life, such as locate a popular restaurant, check the distance between one place to another, or find the nearest driving route and shortest travelling time etc. If you have been thrown with a big set of location data where you need check and validate them from a map, and then find the optimal routes, you probably would think how this can be done programmatically in Python. In this article, we will be exploring how all these can be achieved with Python Google Maps APIs.

Prerequisites

To be able to use Google Maps APIs, you will need to create a Google Cloud Service account and set up a billing method. If you are a new signed up user, you would get $300 credit for you to try out all the various Google APIs. You can follow the official docs to set up a new project, and from where you would get an API key for using its services.

You will also need to install the GoogleMaps package in your working environment, below is the pip command to install the package:

pip install -U googlemaps

Now let’s import the package at the beginning of our code, and initialize the service with your API Key:

import googlemaps

gmaps = googlemaps.Client(key='YOUR API KEY')

If you did not get any error up to this step, you are all set to go. Let’s start to explore the various function calls you can do with this package.

Get Geolocation of the Addresses

Often you need to check a location by address, postal code or even the name of the store/company, this can be done via the geocode function which is similar to search a place in Google Maps from your browser. The function returns a list of possible locations with the detailed address info such as the formatted address, country, region, street, lat/lng etc.

Below are a few possible address info you can pass to this API call:

#short form of address, such as country + postal code
geocode_result = gmaps.geocode('singapore 018956')

#full address
geocode_result = gmaps.geocode("10 Bayfront Ave, Singapore 018956")

#a place name
geocode_result = gmaps.geocode("zhongshan park")

#Chinese characters
geocode_result = gmaps.geocode('滨海湾花园')

#place name/restaurant name
geocode_result = gmaps.geocode('jumbo seafood east coast')

print(geocode_result[0]["formatted_address"]) 
print(geocode_result[0]["geometry"]["location"]["lat"]) 
print(geocode_result[0]["geometry"]["location"]["lng"])

Depends how complete the information you have supplied to the function call, you may get some sample output as per below:

1206 ECP, #01-07/08 East Coast Seafood Centre, Singapore 449883
1.3051669,
103.930673

Reverse Geocoding

You can also use the latitude and longitude to check the address information, this is called reverse geocoding. For instance, the below will give you the human readable addresses for the given lat/lng:

reverse_geocode_result = gmaps.reverse_geocode((1.3550021,103.7084641))

print(reverse_geocode_result[0]["formatted_address"])
#'87 Farrer Dr, Singapore 259287'

You may get multiple address info with the same lat/lng as Google just return the list of addresses closest to this lat/lng.

Check the Distance Between Locations

To check the distance between multiple locations, you can use the Google distance matrix API, where you can supply multiple locations, and Google will return the travelling distance between each location as well as the travelling time. E.g.:

from datetime import datetime, timedelta

gmaps.distance_matrix(origins=geocode_result[0]['formatted_address'], 
                      destinations=reverse_geocode_result[0]["formatted_address"], 
                      departure_time=datetime.now() + timedelta(minutes=10))

In above example, we have provided the departure time as 10 minutes later from current time for Google to calculate the travelling time. The departure time cannot be any time in the past.

The return JSON object would include the travelling distance and time based on the current traffic condition (default transport mode is driving):

{'destination_addresses': ['87 Farrer Dr, Singapore 259287'],
 'origin_addresses': ['1206 ECP, #01-07/08 East Coast Seafood Centre, Singapore 449883'],
 'rows': [{'elements': [{'distance': {'text': '22.2 km', 'value': 22219},
     'duration': {'text': '24 mins', 'value': 1442},
     'duration_in_traffic': {'text': '22 mins', 'value': 1328},
     'status': 'OK'}]}],
 'status': 'OK'}

Get Directions Between Locations

One of the most frequent usage of Google Maps is to check the direction from one place to another. You can use directions API to get the route info as per below:

directions_result = gmaps.directions(geocode_result[0]['formatted_address'],
                                     reverse_geocode_result[0]["formatted_address"],
                                     mode="transit",
                                     arrival_time=datetime.now() + timedelta(minutes=0.5))

For this directions API, it returns you the detailed routing information based on what type of travelling mode you’ve chosen.

In the above example, we have specified the mode as “transit”, so Google Maps will suggest a route that you can take the public transport whenever possible to reach your final destination. It may not meet your arrival time if it is really infeasible anyway.

Below is the sample return with detailed routing directions:

[{'bounds': {'northeast': {'lat': 1.3229677, 'lng': 103.9314612},
   'southwest': {'lat': 1.2925606, 'lng': 103.8056495}},
  'copyrights': 'Map data ©2021 Google',
  'legs': [{'arrival_time': {'text': '9:16pm',
     'time_zone': 'Asia/Singapore',
     'value': 1611321373},
    'departure_time': {'text': '7:59pm',
     'time_zone': 'Asia/Singapore',
     'value': 1611316750},
    'distance': {'text': '18.0 km', 'value': 17992},
    'duration': {'text': '1 hour 17 mins', 'value': 4623},
    'end_address': '87 Farrer Dr, Singapore 259287',
    'end_location': {'lat': 1.3132547, 'lng': 103.8070619},
    'start_address': '1206 ECP, #01-07/08 East Coast Seafood Centre, Singapore 449883',

...

{'distance': {'text': '0.1 km', 'value': 106},
        'duration': {'text': '1 min', 'value': 76},
        'end_location': {'lat': 1.305934, 'lng': 103.9306822},
        'html_instructions': 'Turn <b>left</b>',
        'maneuver': 'turn-left',
        'polyline': {'points': 'q{}Fk_jyRM?KAIAE@a@DQBGDEBGDIF_@LQD'},
        'start_location': {'lat': 1.3050507, 'lng': 103.9309369},
        'travel_mode': 'WALKING'},
...
}]

You can also supply the waypoints parameter in order to route multiple locations between your origin and destination. For instance, if you want to go for a one day tour in Singapore, you can provide a list of the attractions and let Google to optimize the route for you with the optimize_waypoints = True parameter. The sample code as per below:

waypoints = ["Chinatown Buddha Tooth Relic Temple", 
"Sentosa Island, Singapore", 
"National Gallery Singapore", 
"Botanic Garden, Singapore",
"Boat Quay @ Bonham Street, Singapore 049782"]

results = gmaps.directions(origin = "Fort Canning Park, Singapore",
                                         destination = "Raffles Hotel, Singapore",                                     
                                         waypoints = waypoints,
                                         optimize_waypoints = True,
                                         departure_time=datetime.now() + timedelta(hours=1)

for i, leg in enumerate(results[0]["legs"]):
    print("Stop:" + str(i),
        leg["start_address"], 
        "==> ",
        leg["end_address"], 
        "distance: ",  
        leg["distance"]["value"], 
        "traveling Time: ",
        leg["duration"]["value"]
    )

To get a good result, you will need to make sure all the locations you’ve provided can be geocoded by Google. Below is the output:

plot route on Google Maps with Python Google Maps API

Plot Route on Google Maps

To visualize your route or location on a map, you can make use of the Maps Static API. It allows you to plot your locations as markers on the map and draw the path between each location.

To get started, we shall define the markers for our locations. We can specify the color, size and label attributes for each location in a “|” separated string. As the label only allows a single character from {A-Z, 0-9}, we will just use A-Z to indicate the sequence of each location. E.g.:

locations = ["Fort Canning Park, Singapore",
          "Chinatown Buddha Tooth Relic Temple", 
          "Sentosa Island, Singapore", 
          "National Gallery Singapore", 
          "Boat Quay @ Bonham Street, Singapore 049782",
          "Botanic Garden, Singapore",
          "Raffles Hotel, Singapore"]

markers = ["color:blue|size:mid|label:" + chr(65+i) + "|" 
                   + r for i, r in enumerate(locations)]

In the static_map function, you can specify the scale, size and zoom to define how many pixels of your output and whether you want to show the details up to the city or individual building on your map.

The format and maptype parameters are used to specify the output image format and what type of maps you want to use.

Lastly, we can specify the path parameter to connect the different locations together. Similar to how we define the markers, we can specify the attributes in a “|” separated string. Below is the sample code:

result_map = gmaps.static_map(
                 center=locations[0],
                 scale=2, 
                 zoom=12,
                 size=[640, 640], 
                 format="jpg", 
                 maptype="roadmap",
                 markers=markers,
                 path="color:0x0000ff|weight:2|" + "|".join(locations))

And if you save the return result into a .jpg file as per below:

with open(“driving_route_map.jpg”, “wb”) as img:
    for chunk in result_map:
        img.write(chunk)

You shall see something similar to the below:

plot route on Google Maps with Python Google Maps API

The routing sequence is based on the list of locations you’ve supplied, so you can probably use directions function to return a optimal route and then plot them with static_map.

A few things to be noted are that the static map only support small size of image (up to 1280×1280), and you will have to plot in more points if you want to draw a nicer driver routes. For our above code, we only take 7 location points, so the lines are not making any sense for driving. If we use the location points suggested by Google from directions function, the result would look much better. E.g.:

marker_points = []
waypoints = []

#extract the location points from the previous directions function

for leg in results[0]["legs"]:
    leg_start_loc = leg["start_location"]
    marker_points.append(f'{leg_start_loc["lat"]},{leg_start_loc["lng"]}')
    for step in leg["steps"]:
        end_loc = step["end_location"]
        waypoints.append(f'{end_loc["lat"]},{end_loc["lng"]}')
last_stop = results[0]["legs"][-1]["end_location"]
marker_points.append(f'{last_stop["lat"]},{last_stop["lng"]}')
        
markers = [ "color:blue|size:mid|label:" + chr(65+i) + "|" 
           + r for i, r in enumerate(marker_points)]
result_map = gmaps.static_map(
                 center = waypoints[0],
                 scale=2, 
                 zoom=13,
                 size=[640, 640], 
                 format="jpg", 
                 maptype="roadmap",
                 markers=markers,
                 path="color:0x0000ff|weight:2|" + "|".join(waypoints))

We are extracting all the location points from the directions function and pass them to the path parameter to draw the connection lines. The output would be similar to below which makes more sense for driving:

plot route on Google Maps with Python Google Maps API, optimize route with Google Maps API, Singapore

Conclusion

In this article, we have reviewed through a few Google Maps APIs which hopefully will help you for any feasibility study or even in your real projects. For example, to cleanse the dirty address for a large set of data, or to calculate the distance/travelling time or get the optimal routes between multiple locations. If your objective is to generate a dynamic map, you probably have to go for the Maps JavaScript API where you can display a map on web page and make it interactive, but still you may find these Python APIs would be more efficient in processing your raw data rather than doing it in JavaScript code.

(last updated on 8-May-2021)

create animated charts and gif in python with pandas-alive

Create Animated Charts In Python

Introduction

If you are working as a data analyst or data scientist for some time, you may have already known how to use matplotlib to visualize and present data in various charts. The matplotlib library provides an animation module to generate dynamic charts to make your data more engaging, however it still takes you a few steps to format your data, and initialize and update the data into the charts. In this article, I will demonstrate you another Python library – pandas-alive which allows you to generate animated charts directly from pandas data without any format conversion.

Prerequisites

You can install this library via pip command as per below if you do not have it in your working environment yet:

pip install pandas-alive

It will also install its dependencies such as pandas, pillow and numpy etc.

For demonstration of our later examples, let’s grab some sample covid-19 data from internet, you can download it from here.

Before we start, we shall import all the necessary modules and do a preview of our sample data:

import pandas as pd
import pandas-alive

df_covid = pd.read_excel("covid-19 sample data.xlsx")

The data we will be working on would be something similar to the below:

create animated charts and gif in python with pandas-alive

Now with all above ready, let’s dive into the code examples.

Generate animated bar chart race

Bar chart is the most straightforward way to present data, it can be drawn in horizontal or vertical manner. Let’s do a minor formatting on our data so that we can use date as horizontal or vertical axis to present the data.

df_covid = df_covid.pivot(index="date", columns="location", values="total_cases").fillna(0)

To create an animated bar chart horizontally, you can simply call plot_animated as per below:

df_covid.plot_animated("covid-19-h-bar.gif", period_fmt="%Y-%m", title="Covid-19 Cases")

The plot_animated function has default parameters kind=”race” and orientation = “h”, hence the output gif would be generated as per below:

create animated charts and gif in python with pandas-alive

You can change the default values of these two parameters to generate a vertical bar chart race:

df_covid.plot_animated("covid-19-v-bar.gif", 
                     period_fmt="%Y-%m", 
                     title="Covid-19 Cases", 
                     orientation='v')

The output chart would be something similar to below:

create animated charts and gif in python with pandas-alive

 

Generate animated line chart

To create an animated line chart, you just need to change the parameter kind = “line” as per below:

df_covid.plot_animated("covid-19-line.gif",
                     title="Covid-19 Cases",
                     kind='line',
                     period_fmt="%Y-%m",
                     period_label={
                         'x':0.25,
                         'y':0.9,
                         'family': 'sans-serif',
                         'color':  'darkred'
                    })

There are some other parameters such as period_label to control the format of the label, or n_visible to constrain how many records to be shown on the chart. The output chart would be as per the below:

create animated charts and gif in python with pandas-alive

 

Generate animated pie chart

Similar to other charts, you can create a simple pie chart with below parameters:

df_covid.plot_animated(filename='covid-19-pie-chart.gif',
                     kind="pie",                     
                     rotatelabels=True,
                     tick_label_size=5,
                     dpi=300,
                     period_fmt="%Y-%m",
                     )

You can also use other Axes.Pie parameters to define the pie chart behavior. The output from above code would be:

 

 

create animated charts and gif in python with pandas-alive

 

 

Generate scatter chart

Generate scatter chart or bubble chart is slightly complicated than other charts, but for our sample data, it does not make much sense to visualize it in this type of charts. E.g.

df_covid.plot_animated(filename='covid-19-scatter-chart.gif',
                     kind="scatter",
                     period_label={'x':0.05,'y':0.9},
                     steps_per_period=5
                    )

You shall see the output is similar to the line chart:

create animated charts and gif in python with pandas-alive

 

Conclusion

Pandas-Alive provides very convenient ways to generate all sorts of animated charts from pandas data frame with the underlying support from the Matplotlib library. It accepts most of the parameters you used in matplotlib, so you don’t have to learn a lot of new things before applying it for your charts.

There are many more features beyond the basis I have covered in above, such as supplying custom figures, generating GeoSpatial charts or combining multiple animated charts in one view. You can check more examples from its project page.

 

python visualize google trends data with word cloud

Python – Visualize Google Trends Data in Word Cloud

Christmas is just around the corner, the snowfall, beautiful festive lights and joyful songs from the last year still floating in your mind. But this year, things are getting unusual due to the Covid-19. A lot of celebration events are cancelled or suspended and people are advised to avoid gathering and stay at home as much as possible. Although staying at home became new norm, there is still a way that we can get to know what people are thinking about during this festive season since nowadays most of us search a lot from Google every day. With a few lines of Python code, we will be able to extract and visualize the data from Google Trends.

Let’s dive into the code examples.

Python to get Google trends data

To get the search trends from Google, we will need to use a Python package – pytrends. It’s not an official API for Google trends but It provides a convenience way to automatically download Google trends data same as what we can do manually from Google Trends website.

You can use the pip command to install the package:

pip install --upgrade pytrends

And import the necessary modules at the beginning of our code:

from pytrends.request import TrendReq

To use it, we can initiate the request object by providing the language for searching as well as the time zone information. For instance, I am specifying English as the language and time zone offset as -480 which is UCT + 8 in the below. The default value for this time zone offset is 360 (CST), so you can roughly see how this offset is calculated based on the UCT time zone.

pytrend = TrendReq(hl='en-US', tz=-480)

To get the search trends for a particular keyword, we shall specify it in a keyword list. For example, we use “christmas” to see what people have searched in Google related to this keyword. There are a few more parameters you need to specify in the build_payload function in order to narrow down the results:

cat – The category you are interested in, you can see the full list here.

timeframe – The date range when the search happened. You can specify the range as past X hours/days/months/years (the list of available options you can see from Google Trends web page) or even a specific start date and end date. For our case, we use “now 7-d” for the past 7 days.

geo –  The geolocation which can be two characters country code or leave it empty to see the results from globally

gprop – The source which you can leave it as empty for web search, other options can be images, news, youtube, or froogle

Let’s build up our query as per below:

kw_list = ["christmas"]
pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='SG', gprop='')

With all these criteria, we can check what are the related topics people searched in Google from Singapore. The related_queries function will give you a dictionary of both top & rising queries related to the keywords:

trends = pytrend.related_queries()

If you examine the trends variable, you shall see something similar to below:

python visualize google trends data with word cloud

 

The dictionary consists of results for both “top” and “rising” results in pandas dataframe objects, and you can access the top queries as per below:

df_sg = trends["christmas"]["top"]

Examine the first a few records in df_sg, you can see that people in Singapore are still in celebration mood as most of records are related to greetings, light shows or gifts etc.

python visualize google trends data with word cloud

On the other hand, let’s also take a look at the search trends for UK since It has just announced some new restrictions on travelling recently.

pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='GB', gprop='')
trends = pytrend.related_queries()
df_gb = trends["christmas"]["top"]

Examining the df_gb variable, you can see some people started worrying about the new rules and restrictions for this Christmas although majority of the searching results are still around of the festival celebration.

python visualize google trends data with word cloud

 

Visualize the results in word cloud

Since we have all the keywords and popularity that people used for search, the most straightforward to visualize them would be using word cloud to generate a picture. To do so, we will need use another python package – wordcloud which is a pure Python library for generating word cloud image. And you also need to use some supporting packages like PIL and numpy for manipulating the images.

You can use pip command to install these packages if you do not have them yet:

pip install --upgrade wordcloud
pip install Pillow==2.2.2
pip install --upgrade numpy

Let’s import all the necessary modules into our code:

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from PIL import Image
import os
import numpy as np

From previous section, we have already got the search keywords in dataframe. wordcloud supports both text string and words frequencies, for simplicity, let’s convert only keywords into a space separated string and forget about the value (popularity).

text = ' '.join(df_sg["query"].to_list())

And as all the keywords contain “christmas”, we shall filter out this word before generating the word cloud. In wordcloud package, it has a list of predefined words to be excluded,  and you can append more words to be excluded as per the below:

stopwords = set(STOPWORDS) 
stopwords.add("christmas")

Now let’s use this featured image as our background for generating word cloud. We shall load it as a 3-demensional array as the background mask for later use:

bg_mask = np.array(Image.open(os.path.join(os.getcwd(), "christmas tree.jpg")))

With all these ready, we can initiate a word cloud object with below parameters. The name of the parameters are quite self-explanatory, so I will not go through them one by one. You can check the official document from here.

wc = WordCloud(
    width = 600, 
    height = 1000,    
    background_color = 'white',
    colormap = 'rainbow',
    mask = bg_mask,
    stopwords = stopwords,
    max_words = 1000,
    max_font_size = 150,
    min_font_size = 15,
    contour_width = 2, 
    contour_color = 'dodgerblue'
)

Then we can supply our words to the generate_from_text function which will process the text and generate the image. Next we can save the output into an image file as per below code:

wc.generate_from_text(text)
wc.to_file("SG_christmas_cloud.jpg")

When opening the output image file, you shall see something like the below. Isn’t that cool?

python visualize google trends data with word cloud

Similarly, when you pass the UK searching result and generate the word cloud, you would see “covid” and “rules” are most concerned by UK people.

python visualize google trends data with word cloud

Note: since we are passing through a text string, the frequency is based on how many times the words repeated rather than the popularity from Google.

Conclusion

In this article, we have discussed how to use pytrends to automatically get the Google search data for any particular keyword and then use wordcloud to visualize the information. It only covers some basic usage of these two packages, you may check further on their documents to understand what else are provided in these packages. One thing to take note is that pytrends is using some scrapping techniques to get the data from Google Trends, it may break when there is any structural change in the way that Google makes the requests or sends the response. So frequent code upgrade is required by the project team. By the way, they are looking for maintainers, just in case you are interested.

 

web scraping and automate twitter post with selenium

Automate Your Tweets with Selenium

Introduction

In the previous post, we have discussed about how to start web scraping with requests and lxml libraries, and we also summarized two limitations with this approach:

  • Time & effort required to chain all the requests for some complicated operations such as user authentication
  • Triggering a button click or calling JavaScript code is not possible from the HTML response

To solve these two issues, I recommended to use selenium package. In fact you have checked this post, you may still remember that we can use selenium to simulate human actions such as open URL on browser or trigger a button click on the web page and so on.

In this post, I will demonstrate how to use selenium to automatically login to tweeter account, view and post tweets, where the same approach can be used for your web scraping project.

Prerequisites

In order to use selenium to launch browser, you will need to download a web driver for the browser you are using. You can check all the supported browsers as well as the download links from here.

For the below code example, I will use Chrome version 86 and download the driver with this version supported. For simplicity, I will save the chromedriver.exe into my current code directory.

Besides the driver file, you will also need to install selenium in your working environment. Below is the pip command for installation of the latest version:

pip install --upgrade selenium

Let’s also import all the modules at the beginning of our code. Explanation will be given later where these modules are used:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as ec

With the above ready, let’s dive into our code example.

Login to twitter account with Selenium

Similar to a human behavior on the browser, Selenium does not allow you to interact with the invisible elements, and you would encounter ElementNotVisibleException when trying to access the element if it is not fully loaded or not in the view. So the best practice is to always maximize your browser window, so that majority of the information you need are visible and interactable.

To maximize the browser upon launching, you can set –start-maximized in the chrome operations as per below:

chromeOptions = Options()
chromeOptions.add_argument("--start-maximized")

(You can also launch the browser first and later call the maximize_window function to maximize it)

This Chrome options shall be passed into the web driver constructor when it is initiated. We also need to specify the full path of driver exe file, for our case, it’s under the current directory.

driver = webdriver.Chrome(executable_path="chromedriver.exe", options=chromeOptions)

With the above code, a new Chrome browser will be launched. The web driver object has a get method which accepts a URL parameter and opens it from the browser. Below will open the twitter login page on your browser:

tweeter_url = "https://twitter.com/login"
driver.get(tweeter_url)

As there are many factors impact how fast the web page can be fully loaded, you may need to add in some delays at certain steps to make sure that the current action has been completed successfully before moving to the next step.

In Selenium, there are two types of waiting approaches: implicit wait and explicit wait. The implicit wait will just instruct web driver to wait for maximum of certain time when polling the DOM, while explicit wait will check the presence/visibility of the element periodically until the condition is met or the maximum waiting time reached. As implicit wait applies to the entire lifecycle of the web driver, the explicit wait is relatively more flexible. Let’s define our explicit wait for a max of 10 seconds in our example:

wait = WebDriverWait(driver, 10)

Now, we shall follow what we have discussed in the previous post to find a unique identifier of the login username and password fields. By inspecting the web page HTML, you can easily find out the name attribute of the username and password field. Below is the screenshot of the HTML structure for username field:

web scraping and automating twitter post with selenium

 

To locate the username element, we can use the XPath with its element name. And let’s also use the explicit wait to locate it until the element is fully loaded and visible on the page:

username_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[username_or_email]")))

Once we located the username input field, we can send our login ID to this field with send_keys function as per below:

username_input.send_keys(username)

Note: you will need to replace this username/password variable with your twitter login credentials

Similarly, we can locate our password field by its name and send in our password:

password_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[password]")))
password_input.send_keys(password)

Once we have successfully set the values into these two fields, we can simulate the button click on the login button:

  • Firstly we shall locate to the login button by its attribute data-testid=’LoginForm_Login_Button’
  • Then call the WebElement click function to simulate how user clicks on the button

With the below code, you shall be able to login into your tweeter account and view the tweets on your home screen:

login_button = wait.until(ec.visibility_of_element_located((By.XPATH, "//div[@data-testid='LoginForm_Login_Button']")))
login_button.click()

To showcase how to interact with your web page like a normal user, let’s move to the next example to search a tweeter posts with some keywords.

Search tweeter posts by keywords

Same as previously, we shall first locate our search input box by its data-testid attribute as per below:

search_input = wait.until(ec.visibility_of_element_located((By.XPATH, 
"//div/input[@data-testid='SearchBox_Search_Input']")))

As a normal user, I can key in some keywords in the search box and hit ENTER for a search. We can do the same from Selenium via the send_keys function. Let’s first clear the input box and then send a keyword “ethereum” together with a ENTER key:

search_input.clear()
search_input.send_keys("ethereum" + Keys.ENTER)

Upon receiving the ENTER key event, you shall see the search results are loading on the page. The next is to extract the tweeter posts from the searching results.

Below is the sample code that I extracted all the text from the tweets and printed as output:

tweet_divs = driver.find_elements_by_xpath("//div[@data-testid='tweet']")
for div in tweet_divs:
    spans = div.find_elements_by_xpath(".//div/span")
    tweets = ''.join([span.text for span in spans])
    print(tweets)

You shall see the output similar to below:

web scraping and automating twitter post with selenium

With this plain text results, you may use some text processing tools to further analyze what people are discussing around to this keyword.

Automatically post new tweets

Since we are able to search within tweeter, we shall also be able to post a new tweet with Selenium.

Let’s first locate the below text area by the data-testid attribute:

web scraping and automating twitter post with selenium

Below is the code to locate to the span of the text area by it’s ancestor div:

tweet_text_span = driver.find_element_by_xpath("//div[@data-testid='tweetTextarea_0']/div/div/div/span")

Then we can send whatever text we want to tweet:

tweet_text_span.send_keys("Do you know we can tweet with selenium?")

Once the text is written into the span, the tweet button will be enabled. You can locate the button and click to submit the post:

tweet_button = wait.until(ec.visibility_of_element_located((By.XPATH, 
                                                           "//div[@data-testid='tweetButtonInline']")))
tweet_button.click()

Upon submission, you shall see a new post added to your timeline as per below:

web scraping and automating twitter post with selenium

 

Move invisible element into visible view

There are always cases that you need to scroll up and down or left and right to view some information on the web page. You will also need to make sure your elements are in the view before you can do any operation such as getting its attributes or performing clicks.

To move the elements into the view, you can execute some JavaScript code to scroll to the element as per below:

who_to_follow = driver.find_element_by_xpath("//div/span[text() = 'Who to follow']")
driver.execute_script("arguments[0].scrollIntoView(true);", who_to_follow)

Hide your browser with headless mode

When you use Selenium for some automation or scraping job, you may not wish to see the web pages jumping around in front of you. To make it running peacefully in the background, you can set the headless parameter in the Chrome option before the initialization of the web driver:

chromeOptions.add_argument('--headless')

With this parameter, we would not see the browser launched and everything will be running quietly in the background. It’s good that you always test your code properly before you enable this headless mode.

Conclusion

In this article, we have demonstrated how to use Selenium to automatically login to tweeter account, and read or post tweets. And we have also reviewed through how to trigger the JavaScript code with Selenium web driver and run everything totally in the background. In your real project, you may not want to use the same approach to scrap website like tweeter since it has already provided developer account with all the API access. So this article is more to showcase the capability of the Selenium package.

With Selenium, dealing with complicated operations such as user authentication become much simpler as everything is performed like a normal browser user, and it also provides action chains to support all sorts of mouse movement actions such hover over or drag and drop etc. You shall consider to use it in your automation project or web scraping project if your target website relies heavily on the front-end JavaScript code.

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.