Python

create animated charts and gif in python with pandas-alive

Create Animated Charts In Python

Introduction

If you are working as a data analyst or data scientist for some time, you may have already known how to use matplotlib to visualize and present data in various charts. The matplotlib library provides an animation module to generate dynamic charts to make your data more engaging, however it still takes you a few steps to format your data, and initialize and update the data into the charts. In this article, I will demonstrate you another Python library – pandas-alive which allows you to generate animated charts directly from pandas data without any format conversion.

Prerequisites

You can install this library via pip command as per below if you do not have it in your working environment yet:

pip install pandas-alive

It will also install its dependencies such as pandas, pillow and numpy etc.

For demonstration of our later examples, let’s grab some sample covid-19 data from internet, you can download it from here.

Before we start, we shall import all the necessary modules and do a preview of our sample data:

import pandas as pd
import pandas-alive

df_covid = pd.read_excel("covid-19 sample data.xlsx")

The data we will be working on would be something similar to the below:

create animated charts and gif in python with pandas-alive

Now with all above ready, let’s dive into the code examples.

Generate animated bar chart race

Bar chart is the most straightforward way to present data, it can be drawn in horizontal or vertical manner. Let’s do a minor formatting on our data so that we can use date as horizontal or vertical axis to present the data.

df_covid = df_covid.pivot(index="date", columns="location", values="total_cases").fillna(0)

To create an animated bar chart horizontally, you can simply call plot_animated as per below:

df_covid.plot_animated("covid-19-h-bar.gif", period_fmt="%Y-%m", title="Covid-19 Cases")

The plot_animated function has default parameters kind=”race” and orientation = “h”, hence the output gif would be generated as per below:

create animated charts and gif in python with pandas-alive

You can change the default values of these two parameters to generate a vertical bar chart race:

df_covid.plot_animated("covid-19-v-bar.gif", 
                     period_fmt="%Y-%m", 
                     title="Covid-19 Cases", 
                     orientation='v')

The output chart would be something similar to below:

create animated charts and gif in python with pandas-alive

 

Generate animated line chart

To create an animated line chart, you just need to change the parameter kind = “line” as per below:

df_covid.plot_animated("covid-19-line.gif",
                     title="Covid-19 Cases",
                     kind='line',
                     period_fmt="%Y-%m",
                     period_label={
                         'x':0.25,
                         'y':0.9,
                         'family': 'sans-serif',
                         'color':  'darkred'
                    })

There are some other parameters such as period_label to control the format of the label, or n_visible to constrain how many records to be shown on the chart. The output chart would be as per the below:

create animated charts and gif in python with pandas-alive

 

Generate animated pie chart

Similar to other charts, you can create a simple pie chart with below parameters:

df_covid.plot_animated(filename='covid-19-pie-chart.gif',
                     kind="pie",                     
                     rotatelabels=True,
                     tick_label_size=5,
                     dpi=300,
                     period_fmt="%Y-%m",
                     )

You can also use other Axes.Pie parameters to define the pie chart behavior. The output from above code would be:

 

 

create animated charts and gif in python with pandas-alive

 

 

Generate scatter chart

Generate scatter chart or bubble chart is slightly complicated than other charts, but for our sample data, it does not make much sense to visualize it in this type of charts. E.g.

df_covid.plot_animated(filename='covid-19-scatter-chart.gif',
                     kind="scatter",
                     period_label={'x':0.05,'y':0.9},
                     steps_per_period=5
                    )

You shall see the output is similar to the line chart:

create animated charts and gif in python with pandas-alive

 

Conclusion

Pandas-Alive provides very convenient ways to generate all sorts of animated charts from pandas data frame with the underlying support from the Matplotlib library. It accepts most of the parameters you used in matplotlib, so you don’t have to learn a lot of new things before applying it for your charts.

There are many more features beyond the basis I have covered in above, such as supplying custom figures, generating GeoSpatial charts or combining multiple animated charts in one view. You can check more examples from its project page.

 

generate password with python random and python secrets

You Might Have Misused Python Random

Introduction

Python random module provides a convenient way for generating pseudo-random numbers in case you need some unpredictable results for your application such as the computer games, a lucky draw system or even the shuffling logic for your music player. Since it provides various functions to generate results in “unpredictable” manner, developers attempted to use this feature to produce random password or authentication token for security purpose without understanding of it’s fundamental implementation. In this article, we will be discussing how the Python random module has been misunderstood and misused for the scenarios which it shall not be used.

Basic usage of Python random module

Let’s take a look at some basic usage of this module. You can use it to generate random integers, float numbers or bytes as per below:

#Generate a random integer between 1 to 10
random.randint(1,10) 
#2

#generate a random floating point number between 0 to 1
random.random()
#0.3103975786510934

#Generate random number between 1 to 2 in uniform distribution
random.uniform(1, 2) 
#1.9530600469459607 

#Generate random number between 1 to 100, with step as 2
random.randrange(1, 100, 2) 
#43 

#Generate random bytes, available in Python version 3.9
random.randbytes(8)
#b'v\xf7\xb2v`\xc8U]'
#Generate a integer within 8 bits
random.getrandbits(8)
#68

And shuffling or sampling the items in a sequence can be achieved easily with below:

slangs = ["Boomer", "Cap", "Basic", "Retweet", "Fit", "Fr", "Clout"]
random.shuffle(slangs)
#['Fit', 'Basic', 'Fr', 'Clout', 'Cap', 'Retweet', 'Boomer']

random.sample(slangs, k=5)
['Fit', 'Fr', 'Clout', 'Retweet', 'Basic']

You can also use the choice function to choose a random option from a given sequence, for instance:

random.choice(["Boomer", "Cap", "Basic", "Retweet", "Fit", "Fr", "Clout"])
#'Retweet'

With this function, It’s easy to implement a lucky draw logic where all the participants’ name are passed in and a winner is selected randomly which seems to be the perfect use case of it. The problem comes when developers try to use this function to generate  password or security related tokens which they believe it’s random enough and resistant to predication. But it is indeed wrong.

Why the random numbers generated are not random enough?

To answer this question, we will need to take a further look at the implementation of this Python random module. This module uses Mersenne Twister algorithm as the core generator which is designed for modelling and simulation purpose rather than security or cryptography. Some study show it’s not difficult to reconstruct the internal state of the MT to predict the outcome and it can be attacked through this MT randomness.

Actually random module does provide a SystemRandom class which uses the sources from operation system to generate random numbers without relying on the software state and the result is not reproducible. For instance, depends on the actual implementation, some OS uses the atmospheric noise or the exact time of the key presses as the source for generating unpredictable result which is more suitable for cryptography.

You can use it in the same way as the random class except the state is not available:

sys_random = random.SystemRandom()

sys_random.random()
#0.04883551877802528

sys_random.randint(1, 100)
#72

sys_random.choice(["Boomer", "Cap", "Basic", "Retweet", "Fit", "Fr", "Clout"])
#'Clout'

Unfortunately, this class has been overlooked for many years and some developers continue using the functions from random module for generating password or security tokens which exposed a lot of security vulnerability. To address these concerns, an enhancement proposal raised to add a new secrets module for some common security-related functions, so that people won’t mix up it with the random module.

Python secrets module

Let’s take a look at the functions in secrets module:

#Generate random integer between 0 to 50
secrets.randbelow(50)
#2

#Generate random integer with 8 bits
secrets.randbits(8)
#125

#Generate random bytes
secrets.token_bytes(20)
#b'\x8a\xb3\xa92)S:\xf2\xac\x90\xaf\xb1\xb3Q\xc5\xfe\x80\xdb\xc2`'

#Generate random string in hexadecimal with default 32 bytes
secrets.token_hex()
#'cf4f964edf810ca7f6ad6b533038fdcf538c73bd59e11d3340a838e7fd88fdf9'

#Generate URL-safe text string
secrets.token_urlsafe(10)
#z9zQQsxcq6SRug

import string
secrets.choice(string.ascii_letters + string.digits + string.punctuation)
#'k'

The functions are pretty similar to what we have seen in random module, just that results are generated from os.urandom function which meant to be unpredictable enough for cryptographic applications.

Implementing a strong password generator with Python secrets

With all above said, let’s implement a strong password generator with the secrets module. Assuming we have below requirements on our password:

  • Length between 8 to 16
  • At least 1 lowercase
  • At least 1 uppercase
  • At least 1 number
  • At least 1 special characters among !#$%&@_~

We can use the choice function to randomly choose 1 character each time for a random iteration between 8 to 16 times, then test if all the requirements are met for the generated password. Below the sample code:

def generate_strong_password():
    special_characters = '!#$%&@_~'
    password_choices = string.ascii_letters + string.digits + special_characters
    while True:
        password = ''.join(secrets.choice(password_choices) for _ in range(random.randint(8, 16)))
        if (any(c.islower() for c in password)
                and any(c.isupper() for c in password)
                and sum(special_characters.find(c) > -1 for c in password) == 1
                and any(c.isdigit() for c in password)):
            break
    return password

To check the randomness, you can try the below:

[generate_strong_password() for _ in range(10)]
# sample output
['C9A&PbswcLMpJ',
 'DHrbuzZU&io7Io',
 'B1zx4YqKUS@01m',
 'Q!F8tsZJsw',
 '9QVF8oASt_jceq2',
 '2~EKErrHAc9jvbyZ',
 '8ceU_XZbYdqv8b',
 'h8oS@tZ6',
 '3w@OX33tw12J6',
 'FX~mkO0aNatqp']

Conclusion

In this article, we have reviewed through the most commonly used functions in Python random module for generating pseudo-random results for modelling and simulation. Due to misuse of random module for security related applications from the earlier Python developers, Python has introduced a new secrets module to focus on password, authentication or security related tokens. We have also demonstrated an example on how to use secrets module to implement a strong password generator. Although you can generate a strong password with the combination of the alphanumeric and special characters, you shall never store your password in plaintext or simple encrypted format, a good practice is to always use libraries like hashlib or bcrypt to salt and hash it before storing.

python visualize google trends data with word cloud

Python – Visualize Google Trends Data in Word Cloud

Christmas is just around the corner, the snowfall, beautiful festive lights and joyful songs from the last year still floating in your mind. But this year, things are getting unusual due to the Covid-19. A lot of celebration events are cancelled or suspended and people are advised to avoid gathering and stay at home as much as possible. Although staying at home became new norm, there is still a way that we can get to know what people are thinking about during this festive season since nowadays most of us search a lot from Google every day. With a few lines of Python code, we will be able to extract and visualize the data from Google Trends.

Let’s dive into the code examples.

Python to get Google trends data

To get the search trends from Google, we will need to use a Python package – pytrends. It’s not an official API for Google trends but It provides a convenience way to automatically download Google trends data same as what we can do manually from Google Trends website.

You can use the pip command to install the package:

pip install --upgrade pytrends

And import the necessary modules at the beginning of our code:

from pytrends.request import TrendReq

To use it, we can initiate the request object by providing the language for searching as well as the time zone information. For instance, I am specifying English as the language and time zone offset as -480 which is UCT + 8 in the below. The default value for this time zone offset is 360 (CST), so you can roughly see how this offset is calculated based on the UCT time zone.

pytrend = TrendReq(hl='en-US', tz=-480)

To get the search trends for a particular keyword, we shall specify it in a keyword list. For example, we use “christmas” to see what people have searched in Google related to this keyword. There are a few more parameters you need to specify in the build_payload function in order to narrow down the results:

cat – The category you are interested in, you can see the full list here.

timeframe – The date range when the search happened. You can specify the range as past X hours/days/months/years (the list of available options you can see from Google Trends web page) or even a specific start date and end date. For our case, we use “now 7-d” for the past 7 days.

geo –  The geolocation which can be two characters country code or leave it empty to see the results from globally

gprop – The source which you can leave it as empty for web search, other options can be images, news, youtube, or froogle

Let’s build up our query as per below:

kw_list = ["christmas"]
pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='SG', gprop='')

With all these criteria, we can check what are the related topics people searched in Google from Singapore. The related_queries function will give you a dictionary of both top & rising queries related to the keywords:

trends = pytrend.related_queries()

If you examine the trends variable, you shall see something similar to below:

python visualize google trends data with word cloud

 

The dictionary consists of results for both “top” and “rising” results in pandas dataframe objects, and you can access the top queries as per below:

df_sg = trends["christmas"]["top"]

Examine the first a few records in df_sg, you can see that people in Singapore are still in celebration mood as most of records are related to greetings, light shows or gifts etc.

python visualize google trends data with word cloud

On the other hand, let’s also take a look at the search trends for UK since It has just announced some new restrictions on travelling recently.

pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='GB', gprop='')
trends = pytrend.related_queries()
df_gb = trends["christmas"]["top"]

Examining the df_gb variable, you can see some people started worrying about the new rules and restrictions for this Christmas although majority of the searching results are still around of the festival celebration.

python visualize google trends data with word cloud

 

Visualize the results in word cloud

Since we have all the keywords and popularity that people used for search, the most straightforward to visualize them would be using word cloud to generate a picture. To do so, we will need use another python package – wordcloud which is a pure Python library for generating word cloud image. And you also need to use some supporting packages like PIL and numpy for manipulating the images.

You can use pip command to install these packages if you do not have them yet:

pip install --upgrade wordcloud
pip install Pillow==2.2.2
pip install --upgrade numpy

Let’s import all the necessary modules into our code:

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from PIL import Image
import os
import numpy as np

From previous section, we have already got the search keywords in dataframe. wordcloud supports both text string and words frequencies, for simplicity, let’s convert only keywords into a space separated string and forget about the value (popularity).

text = ' '.join(df_sg["query"].to_list())

And as all the keywords contain “christmas”, we shall filter out this word before generating the word cloud. In wordcloud package, it has a list of predefined words to be excluded,  and you can append more words to be excluded as per the below:

stopwords = set(STOPWORDS) 
stopwords.add("christmas")

Now let’s use this featured image as our background for generating word cloud. We shall load it as a 3-demensional array as the background mask for later use:

bg_mask = np.array(Image.open(os.path.join(os.getcwd(), "christmas tree.jpg")))

With all these ready, we can initiate a word cloud object with below parameters. The name of the parameters are quite self-explanatory, so I will not go through them one by one. You can check the official document from here.

wc = WordCloud(
    width = 600, 
    height = 1000,    
    background_color = 'white',
    colormap = 'rainbow',
    mask = bg_mask,
    stopwords = stopwords,
    max_words = 1000,
    max_font_size = 150,
    min_font_size = 15,
    contour_width = 2, 
    contour_color = 'dodgerblue'
)

Then we can supply our words to the generate_from_text function which will process the text and generate the image. Next we can save the output into an image file as per below code:

wc.generate_from_text(text)
wc.to_file("SG_christmas_cloud.jpg")

When opening the output image file, you shall see something like the below. Isn’t that cool?

python visualize google trends data with word cloud

Similarly, when you pass the UK searching result and generate the word cloud, you would see “covid” and “rules” are most concerned by UK people.

python visualize google trends data with word cloud

Note: since we are passing through a text string, the frequency is based on how many times the words repeated rather than the popularity from Google.

Conclusion

In this article, we have discussed how to use pytrends to automatically get the Google search data for any particular keyword and then use wordcloud to visualize the information. It only covers some basic usage of these two packages, you may check further on their documents to understand what else are provided in these packages. One thing to take note is that pytrends is using some scrapping techniques to get the data from Google Trends, it may break when there is any structural change in the way that Google makes the requests or sends the response. So frequent code upgrade is required by the project team. By the way, they are looking for maintainers, just in case you are interested.

 

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.

 

gspread read and write google sheet

Read and write Google Sheet with 5 lines of Python code

Introduction

Google Sheet is a very powerful tool in terms of collaboration, it allows multiple users to work on the same rows of data simultaneously. It also provides fine-grained APIs in various programming languages for your application to connect and interact with Google Sheet. Sometimes when you just need some simple operations like reading/writing data from a sheet, you may wonder if there is any higher level APIs that can complete these simple tasks easily. The short answer is yes. In this article, we will be discussing how can we read/write Google Sheet in 5 lines of Python code.

Prerequisites

As the prerequisite, you will need to have a Google service account in order for you to go through the Google cloud service authentication for your API calls. You can follow the guide from here for a free account setup. Once you have completed all the steps, you shall have a JSON file similar to below which contains your private key for accessing the Google cloud services. You may rename it to “client_secret.json” for our later use.

{
  "type": "service_account",
  "project_id": "new_project",
  "private_key_id": "xxxxxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\xxxxx\n-----END PRIVATE KEY-----\n",
  "client_email": "xxx@developer.gserviceaccount.com",
  "client_id": "xxx",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40developer.gserviceaccount.com"
}

From this JSON file, you can also find the email address for your newly created service account, if you need to access your existing Google Sheet files, you will need to grant access of your files to this email address.

Note: There is a limit of 100 requests per every 100 seconds for the free Google service account, you may need to upgrade your account to the paid account if this free quota is not sufficient for your business.

In addition to the service account, we need another two libraries google-auth and gspread which are the core modules to be used for authentication and manipulating the Google Sheet file.

Below is the pip command to install the two libraries:

pip install gspread
pip install google-auth

Lastly, let’s create a Google Sheet file namely “spreadsheet1” with some sample data from US 2020 election result:

gspread write and read google sheet

Once you have all above ready, let’s dive into our code examples.

Read Google Sheet data into pandas

Let’s first import the necessary libraries at the top of our script:

import gspread
from google.oauth2.service_account import Credentials
import pandas as pd

To get the access to Google Sheet, we will need to define the scope (API endpoint). For our case, we specify the scope to read and write the Google Sheet file.  If you would like to restrict your program from updating any data, you can specify spreadsheets.readonly and drive.readonly in the scope.

scope = ['https://www.googleapis.com/auth/spreadsheets',
        'https://www.googleapis.com/auth/drive']

And then we can build a Credentials object with our JSON file and the above defined scope:

creds = Credentials.from_service_account_file("client_secret.json", scopes=scope)

Next, we call the authorize function from gspread library to pass in our credentials:

client = gspread.authorize(creds)

With this one line of code, it will be going through the authentication under the hood. Once authentication passed, it establishes the connection between your application and the Google cloud service. From there, you can send request to open your spreadsheet file by specifying the file name:

google_sh = client.open("spreadsheet1")

Besides opening file by name, you can also use open_by_key with the sheet ID or open_by_url with the URL of the sheet.

If the proper access has been given to your service account, you would be able to gain the control to your Google Sheet, and you can continue to request to open a particular spreadsheet tab.

For instance, below returns the first sheet of the file:

sheet1 = google_sh.get_worksheet(0)

With the above, you can simply read all records into a dictionary with get_all_records function, and pass into a pandas DataFrame:

df = pd.DataFrame(data=sheet1.get_all_records())

Now if you examine the df object, you shall see the below output:

gspread write and read google sheet

So that’s it! With a few lines of code, you’ve successfully downloaded your data from Google Sheet into pandas, and now you can do whatever you need in pandas.

If you have duplicate column names in your Google Sheet, you may consider to use get_all_values function to get all the values into a list, so that duplicate column remains:

df = pd.DataFrame(data=sheet1.get_all_values())

All the column and row labels will default to RangeIndex as per below:

gspread write and read google sheet

Now let’s take a further look at what else we can achieve with the this library.

Add/Delete work sheets

With gspread, you can easily add new sheets or duplicate from the existing sheets. Below is an example to create a new sheet namely “Sheet2” with max number of rows and columns specified. The index parameter tells Google Sheet where you want to insert your new sheet. index=0 indicates the new sheet to be inserted as the first sheet.

sheet2 = google_sh.add_worksheet(title="Sheet2", rows="10", cols="10", index=0)

Duplicating an existing sheet can be done by specifying the source sheet ID and the new sheet name:

google_sh.duplicate_sheet(source_sheet_id=google_sh.worksheet("Sheet1").id, 
    new_sheet_name="Votes Copy")

Similarly, you can delete an existing sheet by passing in the worksheet object as per below:

google_sh.del_worksheet(sheet2)

If you would like to re-order your worksheets, you can do it with reorder_worksheets function. Assuming you want the sheet2 to be shown before sheet1:

google_sh.reorder_worksheets([sheet2, sheet1])

Read/Write Google Sheet cells

The worksheet object has the row_count and col_count properties which indicate the max rows and columns in the sheet file. But it’s not that useful when you want to know how many rows and columns of actual data you have:

print(sheet1.row_count, sheet1.col_count)
#1000, 26

To have a quick view of the number of rows and columns of your data, you can use:

#Assuming the first row and first column have the full data
print("no. of columns:", len(sheet1.row_values(1)))
#no. of columns: 3
print("no. of rows:", len(sheet1.col_values(1)))
#no. of rows: 8

To access the individual cells, you can either specify the row and column indexes, or use the A1 notation. For instance:

#Access the row 1 and column 2
sheet1.cell(1, 2).value
# or using A1 notation
sheet1.acell('B1').value

Note: the row/column index and A1 notation are all one-based numbers which is similar to the MS excel

Similarly, you can update the value for each individual cell as per below:

sheet1.update_cell(1, 2, "BIDEN VOTES")
#or
sheet1.update_acell("B1", "BIDEN VOTES")

To update multiple cells, you shall use the worksheet update function with the list of cells and values to be updated. For instance, below code will replace the values in row 8:

sheet1.update("A8:C8", [["Texas", 5261485, 5261485]])

Or use batch_update to update multiple ranges:

sheet1.batch_update([{"range": "A8:C8", 
                    "values" : [["Texas", 5261485, 5261485]]},
                     {"range": "A9:C9", 
                    "values" : [["Wisconsin", 1630673, 1610065]]},
                    ])

or use append_rows to insert a row at the last row:

sheet1.append_rows(values=[["Pennsylvania", 3458312, 3376499]])

Besides updating cell values, you can also update the cell format such as the font, background etc. For instance, the below will update the background of the 6th row to red color:

sheet1.format("A6:C6", 
              {"backgroundColor": {
                  "red": 1.0,
                  "green": 0.0,
                  "blue": 0.0,
                  "alpha": 1.0
              }
    })

Note that Google is using RGBA color model, so the color values must be numbers between 0-1.

Below is how it looks like in Google Sheet:

gspread write and read google sheet , format google sheet

Sometimes, it might be difficult to locate the exact index of the cell to be updated. You can find the cell by it’s text with the find function. It will return the first item from the matches.

cell = sheet1.find("Michigan")
print(cell.row, cell.col, cell.value)
#6, 1, 'Michigan'

You can also use Python regular express to find all matches. For instance, to find all cells with text ending as “da”:

import re
query = re.compile(".*da")
cells = sheet1.findall(query)
print(cells)
#[<Cell R4C1 'Florida'>, <Cell R7C1 'Nevada'>]

Add/Remove permission for your Google Sheet

Adding or removing permission for a particular Google Sheet file can be also super easy with gspread. Before adding/removing permission, you shall check who are the users currently have access to your file. You can use list_permission function to retrieve the list of users with their roles:

google_sh.list_permissions()

To give access of your file to other users, you can use:

#perm_type can be : user, group or domain
#role can be : owner, writer or reader
google_sh.share('someone@gmail.com', perm_type='user', role='reader')

When you check your file again, you shall see the email address you shared is added into the list of authorized users.

To revert back the access for a particular user, you can use remove_permissions function. By default, it removes all the access that has been granted to the user:

google_sh.remove_permissions('someone@gmail.com', role="writer")

When the role you’ve specifying does not match with the roles the user currently has, the function returns without doing anything.

Conclusion

Google Sheet API provides comprehensive interfaces for manipulating the sheets from the normal operations like reading/writing of the data, to validation, formatting, building pivot tables and charts etc. Often you find that you may just need some simple APIs to read and write Google Sheet files.

In this article, we have reviewed though the gspread package which provides the high level APIs for working with Google Sheets to serve for this purpose. With gspread, you are able to open existing sheet or create new Google Sheet file, read/write the data as well as do simply formatting. In fact, there are a few other libraries such as gspread-formatting and gspread-pandas which offer extensive functionalities for sheet formatting and interacting sheets with pandas dataframe, you may take a look in case you need something more complicated than what we have covered here.