Python

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.

 

gspread read and write google sheet

Read and write Google Sheet with 5 lines of Python code

Introduction

Google Sheet is a very powerful tool in terms of collaboration, it allows multiple users to work on the same rows of data simultaneously. It also provides fine-grained APIs in various programming languages for your application to connect and interact with Google Sheet. Sometimes when you just need some simple operations like reading/writing data from a sheet, you may wonder if there is any higher level APIs that can complete these simple tasks easily. The short answer is yes. In this article, we will be discussing how can we read/write Google Sheet in 5 lines of Python code.

Prerequisites

As the prerequisite, you will need to have a Google service account in order for you to go through the Google cloud service authentication for your API calls. You can follow the guide from here for a free account setup. Once you have completed all the steps, you shall have a JSON file similar to below which contains your private key for accessing the Google cloud services. You may rename it to “client_secret.json” for our later use.

{
  "type": "service_account",
  "project_id": "new_project",
  "private_key_id": "xxxxxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\xxxxx\n-----END PRIVATE KEY-----\n",
  "client_email": "[email protected]",
  "client_id": "xxx",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40developer.gserviceaccount.com"
}

From this JSON file, you can also find the email address for your newly created service account, if you need to access your existing Google Sheet files, you will need to grant access of your files to this email address.

Note: There is a limit of 100 requests per every 100 seconds for the free Google service account, you may need to upgrade your account to the paid account if this free quota is not sufficient for your business.

In addition to the service account, we need another two libraries google-auth and gspread which are the core modules to be used for authentication and manipulating the Google Sheet file.

Below is the pip command to install the two libraries:

pip install gspread
pip install google-auth

Lastly, let’s create a Google Sheet file namely “spreadsheet1” with some sample data from US 2020 election result:

gspread write and read google sheet

Once you have all above ready, let’s dive into our code examples.

Read Google Sheet data into pandas

Let’s first import the necessary libraries at the top of our script:

import gspread
from google.oauth2.service_account import Credentials
import pandas as pd

To get the access to Google Sheet, we will need to define the scope (API endpoint). For our case, we specify the scope to read and write the Google Sheet file.  If you would like to restrict your program from updating any data, you can specify spreadsheets.readonly and drive.readonly in the scope.

scope = ['https://www.googleapis.com/auth/spreadsheets',
        'https://www.googleapis.com/auth/drive']

And then we can build a Credentials object with our JSON file and the above defined scope:

creds = Credentials.from_service_account_file("client_secret.json", scopes=scope)

Next, we call the authorize function from gspread library to pass in our credentials:

client = gspread.authorize(creds)

With this one line of code, it will be going through the authentication under the hood. Once authentication passed, it establishes the connection between your application and the Google cloud service. From there, you can send request to open your spreadsheet file by specifying the file name:

google_sh = client.open("spreadsheet1")

Besides opening file by name, you can also use open_by_key with the sheet ID or open_by_url with the URL of the sheet.

If the proper access has been given to your service account, you would be able to gain the control to your Google Sheet, and you can continue to request to open a particular spreadsheet tab.

For instance, below returns the first sheet of the file:

sheet1 = google_sh.get_worksheet(0)

With the above, you can simply read all records into a dictionary with get_all_records function, and pass into a pandas DataFrame:

df = pd.DataFrame(data=sheet1.get_all_records())

Now if you examine the df object, you shall see the below output:

gspread write and read google sheet

So that’s it! With a few lines of code, you’ve successfully downloaded your data from Google Sheet into pandas, and now you can do whatever you need in pandas.

If you have duplicate column names in your Google Sheet, you may consider to use get_all_values function to get all the values into a list, so that duplicate column remains:

df = pd.DataFrame(data=sheet1.get_all_values())

All the column and row labels will default to RangeIndex as per below:

gspread write and read google sheet

Now let’s take a further look at what else we can achieve with the this library.

Add/Delete work sheets

With gspread, you can easily add new sheets or duplicate from the existing sheets. Below is an example to create a new sheet namely “Sheet2” with max number of rows and columns specified. The index parameter tells Google Sheet where you want to insert your new sheet. index=0 indicates the new sheet to be inserted as the first sheet.

sheet2 = google_sh.add_worksheet(title="Sheet2", rows="10", cols="10", index=0)

Duplicating an existing sheet can be done by specifying the source sheet ID and the new sheet name:

google_sh.duplicate_sheet(source_sheet_id=google_sh.worksheet("Sheet1").id, 
    new_sheet_name="Votes Copy")

Similarly, you can delete an existing sheet by passing in the worksheet object as per below:

google_sh.del_worksheet(sheet2)

If you would like to re-order your worksheets, you can do it with reorder_worksheets function. Assuming you want the sheet2 to be shown before sheet1:

google_sh.reorder_worksheets([sheet2, sheet1])

Read/Write Google Sheet cells

The worksheet object has the row_count and col_count properties which indicate the max rows and columns in the sheet file. But it’s not that useful when you want to know how many rows and columns of actual data you have:

print(sheet1.row_count, sheet1.col_count)
#1000, 26

To have a quick view of the number of rows and columns of your data, you can use:

#Assuming the first row and first column have the full data
print("no. of columns:", len(sheet1.row_values(1)))
#no. of columns: 3
print("no. of rows:", len(sheet1.col_values(1)))
#no. of rows: 8

To access the individual cells, you can either specify the row and column indexes, or use the A1 notation. For instance:

#Access the row 1 and column 2
sheet1.cell(1, 2).value
# or using A1 notation
sheet1.acell('B1').value

Note: the row/column index and A1 notation are all one-based numbers which is similar to the MS excel

Similarly, you can update the value for each individual cell as per below:

sheet1.update_cell(1, 2, "BIDEN VOTES")
#or
sheet1.update_acell("B1", "BIDEN VOTES")

To update multiple cells, you shall use the worksheet update function with the list of cells and values to be updated. For instance, below code will replace the values in row 8:

sheet1.update("A8:C8", [["Texas", 5261485, 5261485]])

Or use batch_update to update multiple ranges:

sheet1.batch_update([{"range": "A8:C8", 
                    "values" : [["Texas", 5261485, 5261485]]},
                     {"range": "A9:C9", 
                    "values" : [["Wisconsin", 1630673, 1610065]]},
                    ])

or use append_rows to insert a row at the last row:

sheet1.append_rows(values=[["Pennsylvania", 3458312, 3376499]])

Besides updating cell values, you can also update the cell format such as the font, background etc. For instance, the below will update the background of the 6th row to red color:

sheet1.format("A6:C6", 
              {"backgroundColor": {
                  "red": 1.0,
                  "green": 0.0,
                  "blue": 0.0,
                  "alpha": 1.0
              }
    })

Note that Google is using RGBA color model, so the color values must be numbers between 0-1.

Below is how it looks like in Google Sheet:

gspread write and read google sheet , format google sheet

Sometimes, it might be difficult to locate the exact index of the cell to be updated. You can find the cell by it’s text with the find function. It will return the first item from the matches.

cell = sheet1.find("Michigan")
print(cell.row, cell.col, cell.value)
#6, 1, 'Michigan'

You can also use Python regular express to find all matches. For instance, to find all cells with text ending as “da”:

import re
query = re.compile(".*da")
cells = sheet1.findall(query)
print(cells)
#[<Cell R4C1 'Florida'>, <Cell R7C1 'Nevada'>]

Add/Remove permission for your Google Sheet

Adding or removing permission for a particular Google Sheet file can be also super easy with gspread. Before adding/removing permission, you shall check who are the users currently have access to your file. You can use list_permission function to retrieve the list of users with their roles:

google_sh.list_permissions()

To give access of your file to other users, you can use:

#perm_type can be : user, group or domain
#role can be : owner, writer or reader
google_sh.share('[email protected]', perm_type='user', role='reader')

When you check your file again, you shall see the email address you shared is added into the list of authorized users.

To revert back the access for a particular user, you can use remove_permissions function. By default, it removes all the access that has been granted to the user:

google_sh.remove_permissions('[email protected]', role="writer")

When the role you’ve specifying does not match with the roles the user currently has, the function returns without doing anything.

Conclusion

Google Sheet API provides comprehensive interfaces for manipulating the sheets from the normal operations like reading/writing of the data, to validation, formatting, building pivot tables and charts etc. Often you find that you may just need some simple APIs to read and write Google Sheet files.

In this article, we have reviewed though the gspread package which provides the high level APIs for working with Google Sheets to serve for this purpose. With gspread, you are able to open existing sheet or create new Google Sheet file, read/write the data as well as do simply formatting. In fact, there are a few other libraries such as gspread-formatting and gspread-pandas which offer extensive functionalities for sheet formatting and interacting sheets with pandas dataframe, you may take a look in case you need something more complicated than what we have covered here.

The key for understanding python positional and keyword arguments

In one of the previous article, we have summarized the different ways of passing arguments to Python script. In this article, we will be reviewing through the various approaches for defining and passing arguments into a function in python.

First, let’s start from the basis.

Parameter vs argument

By definition, parameters are the identifiers you specified when you define your function, while arguments are the actual values you supplied to the function when you make the function call. Sometimes you may see people mix up the word parameter and argument, but ultimately they are all referring to the same thing.

Basically Python function supports two types of parameters: positional and keyword arguments. Positional argument is designed to accept argument by following its position during the definition time, while for keyword arguments, you will need to specify the identifier (keyword) followed by the values.

You are allowed to use a combination of positional and keyword arguments when you define your function parameters. Below are the 4 types of variations:

Positional or keyword parameter

By default when you define a Python function, you can either pass in the arguments as positional or keyword. For instance, the below function requires two parameters – file_name and separator:

def read_file(file_name, separator):    
    print(f"file_name={file_name}, separator={separator}")
    file_text  = "Assuming this is the first paragraph from the file."
    return file_text.split(separator)

You can make the function call by supplying the arguments by parameter position or providing the parameter keywords:

read_file("text.txt", " ")
#or
read_file(file_name="text.txt", separator=" ")
#or
read_file("text.txt", separator=" ")
#or
read_file(separator=" ", file_name="text.txt")

All above 4 calls would give you the same results as per below:

file_name=text.txt, separator= 
['Assuming', 'this', 'is', 'the', 'first', 'paragraph', 'from', 'the', 'file.']

Python accepts these arguments regardless of the arguments are provided in positional form or keyword form. When all arguments are by keywords, you can provide them in any order. But take note that positional arguments must be placed before the keyword arguments. For instance, the below throws syntax error,

read_file(file_name="text.txt", " ")

which shows “positional argument follows keyword argument”.

python positional argument and keyword argument

keyword only parameter

For clarity, you may want to implement functions that only accept keyword arguments, and your callers are restricted to only use keyword arguments to invoke the function. To achieve that, you can tweak a bit on your function definition with an additional “*” to indicate all parameters after it must be passed as keywords arguments. E.g.:

def write_file(file_name, *, separator, end, flush):
    print(f"file={file_name}, sep={separator}, end={end}, flush flag={flush}")

For the above function, the separator, end, flush parameters will only accept keyword arguments. You can call it as per below:

write_file("test.txt", separator=" ", end="\r\n", flush=True)

And you shall see output as per below:

file=test.txt, sep= , end=
, flush flag=True

But if you try to pass in all as positional arguments:

write_file("test.txt", " ", "\n", False)

You would see the below error message, which shows the last 3 positional arguments were not accepted.

python positional argument and keyword argument

To further restrict all parameters to be keyword only, you just need to shift the “*” to the beginning of all parameters:

def write_file(*, file_name, separator, end, flush):
    print(f"file={file_name}, sep={separator}, end={end}, flush flag={flush}")

This would make all parameters to only accept keyword arguments. And you can unpack an existing dictionary and pass it arguments into the function:

options = dict(file_name="test.txt", separator=",", end="\n", flush=True)
write(**options)

Positional only arguments

Many Python built-in functions only accept positional arguments, for instance the pow, divmod and abs etc. Prior to Python version 3.8, there is no easy way to define your own function with positional-only parameters. But from version 3.8 onward, you can restrict your function to only accept positional arguments by specifying the “/” (PEP 570) in the function definition. All the parameters come before “/” will be positional-only arguments. For example:

def read(file_name, separator, /):
    print(f"file_name={file_name}, separator={separator}")

For these parameters, if you try to pass in the keyword argument as per below:

read("test.txt", separator=",")

You would see error message indicating the particular parameter is positional-only argument.

    read("test.txt", separator=",")
TypeError: read() got some positional-only arguments passed as keyword arguments: 'separator'

Arbitrary number of arguments

There are cases that you have a number of positional or keyword arguments and you do not want to have a long list of parameters in the function definition. For such case, you can use *args and **kwargs to define arbitrary number of arguments to be accepted by the function. For instance:

def log(*args,**kwargs):
    print(f"args={args}, kwargs={kwargs}")

The above function accepts any number of positional arguments and keyword arguments. All the positional arguments will be packed into a tuple, and keyword arguments are packed into a dictionary.

log("start", "debugging", program_name="python", version=3.7)

When you make a function call as per above, you can see the below output:

args=('start', 'debugging'), kwargs={'program_name': 'python', 'version': 3.7}

This is especially useful when you are just trying to capture a snapshot of all the input arguments (such as logging) or implement a wrapper function for decorator where you do not need to know the exact arguments being passed in. The arbitrary arguments give the callers more flexibility on what they want to pass into your function.

The disadvantage are also obvious, it’s unclear to the new reader what are the parameters to be provided in order to get the correct result from the function call; and you shall not expect all the arguments to be present during the call, so you will have to write some logic to handle the various scenarios when the particular parameters are missing or present.

The best practice for function arguments

When you only have 1 or 2 parameters for the function, you won’t typically see any issue with the code readability/clarity. Problems usually emerge when you have more parameters and some are mandatory and some are optional. Consider the below send_email example:

def send_email(subject, to, cc, 
               bcc, message, attachment, 
               onbehalf, important, read_receipt_requested):
    print(f"{subject}, {to}, {cc}, \
          {bcc} {onbehalf}, {message}, \
          {attachment}, {important}, {read_receipt_requested}")

When you try to include as many parameters as possible to make the function generic for everybody, you’ll have to maintain a very long list of the parameters in the function definition. Calling this function by passing the positional arguments can be very confusing, as you will have to follow exactly the same sequence to provide the arguments as per in the function definition without omitting any single optional argument. E.g.:

send_email("hello", "[email protected]", "[email protected]",
           None, None, 
           "hello", None, False, True)

Without referring back to the function definition, it would be very hard to know which argument represents for which parameter. The best way to make handle such scenario is to always use keyword arguments and set the optional arguments with a default value. For example, the below improved version:

def send_email(subject, to, message, 
               cc=None, bcc=None, attachment=None, 
               onbehalf=None, important=False, read_receipt_requested=False):
    print(f"{subject}, {to}, {cc}, \
          {bcc} {onbehalf}, {message}, \
          {attachment}, {important}, {read_receipt_requested}")

#specify all parameters with keyword arguments
send_email(subject="hello", 
  to="[email protected]", 
  cc="[email protected]", 
  message="hello", 
  read_receipt_requested=True)

With keyword arguments, it improves readability of your code and also allows you to specify default values for the optional arguments. Further more, it gives you the flexibility to extend your parameter list in the future without refactoring your existing code, so that it provides the backwards compatibility at the very beginning.

Conclusion

In this article, we have reviewed through the different approaches for defining and passing arguments to Python function as well as their advantages and disadvantages. Based on your own scenario, you may need to evaluate whether to use positional-only, keyword-only, mix of positional and keyword, or even arbitrary arguments. For code clarity, the general recommendation is to use keyword argument as much as possible, so that all the arguments are understandable at the first glance, and reduces the chances of error.

python suppress stdout and stderr Photo by Yeshi Kangrang on Unsplash

Python recipes- suppress stdout and stderr messages

Introduction

If you have worked on some projects that requires API calls to the external parties or uses 3rd party libraries, you may sometimes run into the problem that you are able to get the correct return results but it also comes back with a lot of noises in the stdout and stderr. For instance, the developer may leave a lot of “for your info” messages in the standard output or some warning or error messages due to the version differences in some of the dependency libraries.

All these messages would flood your console and you have no control on the source code, hence you cannot change its behavior. To reduce these noises, one option is to suppress stdout and stderr messages during making the function call. In this article, we will discuss about some recipes to suppress the messages for such scenarios.

Unexpected messages from stdout and stderr

To further illustrate the issue, let’s take a look at the below example. Assuming we have below check_result function in a python file externallib.py, and this represents an external library.

import sys

def check_result():
    print("stdout message from externallib")
    print("stderr message from externallib", file=sys.stderr)
    return True

If you import the module and call the check_result function, you would be definitely getting the result as True, but you would see both the stdout and stderr messages from your console as well.

import externallib

result = externallib.check_result()

Both stdout and stderr messages were printed out in the console:

Python suppress stdout and stderr

suppress stdout and stderr with context manager

To stop these messages from printing out, we need to suppress stdout and stderr in the way that it redirects the output into a devnull file (similiar to /dev/null in Linux which is typically used for disposing of unwanted output streams) right before calling the function, and then redirect the outputs back after the call completed.

To do that, the best approach is to use a context manager, so that it is automatically directed/redirected upon the entry and exit of the context manager.

So let’s implement a context manager to perform the below:

  • Use suppress_stdout and suppress_stderr flags to indicate which stream to be suppressed
  • Save the state of the sys.stdout and sys.stderr in the __enter__ function, and redirect them to devnull based on the suppress_stdout and suppress_stderr flags
  • Restore back the state for sys.stdout and sys.stderr in __exit__

Below is the code snippet:

import os, sys

class suppress_output:
    def __init__(self, suppress_stdout=False, suppress_stderr=False):
        self.suppress_stdout = suppress_stdout
        self.suppress_stderr = suppress_stderr
        self._stdout = None
        self._stderr = None

    def __enter__(self):
        devnull = open(os.devnull, "w")
        if self.suppress_stdout:
            self._stdout = sys.stdout
            sys.stdout = devnull

        if self.suppress_stderr:
            self._stderr = sys.stderr
            sys.stderr = devnull

    def __exit__(self, *args):
        if self.suppress_stdout:
            sys.stdout = self._stdout
        if self.suppress_stderr:
            sys.stderr = self._stderr

And if you call the check_result again within this context manager as per below:

with suppress_output(suppress_stdout=True, suppress_stderr=True):
    result = externallib.check_result()
print(result)

You would not see any messages printed out from check_result function, and the return result would remain as True. This is exactly what we are expecting!

Since we are using context manager, you may wonder to use contextlib to make our code more concise. So let’s make use of the contextlib package, and re-implement the above context manager using decorator as per below:

from contextlib import contextmanager

@contextmanager
def nullify_output(suppress_stdout=True, suppress_stderr=True):
    stdout = sys.stdout
    stderr = sys.stderr
    devnull = open(os.devnull, "w")
    try:
        if suppress_stdout:
            sys.stdout = devnull
        if suppress_stderr:
            sys.stderr = devnull
        yield
    finally:
        if suppress_stdout:
            sys.stdout = stdout
        if suppress_stderr:
            sys.stderr = stderr

With the above decorator implementation, you shall be able to get the same result when you call the function:

with nullify_output(suppress_stdout=True, suppress_stderr=True):
    result = externallib.check_result()
print(result)

Everything seems to be good as of now, are we going to conclude here? Wait, there is something else we can still improve – instead of totally discard the messages, can we collect them into logging file?

Suppress stdout and stderr with redirect_stdout and redirect_stderr

If you scroll down the Python contextlib documentation further, you will notice there are two methods related to stdout and stderr: redirect_stdout and redirect_stderr . They are quite self-explanatory by their names, and also accept a file-like object as the redirect target.

With these two functions, we shall be able to make our code even more concise, meanwhile we can easily collect back the output message into our log file.

from contextlib import redirect_stdout, redirect_stderr
import io, logging
logging.basicConfig(filename='error.log', level=logging.DEBUG)

f = io.StringIO()
with redirect_stdout(f), redirect_stderr(f):
    result = externallib.check_result()
logging.info(f.getvalue())
print(result)

If you check the log file, you shall see the stdout and stderr messages were collected correctly.

suppress stdout and stderr with redirect_stdout or redirect_stderr

Of course, if you wish to continue disposing these messages, you can still specify the target file as devnull, so that nothing will be collected.

Conclusion

With all the above examples and explanations, hopefully you are able to use the code snippets and customize it to meet the objective in your own project. Directly disposing the stderr sometimes may not be a good idea in case there are some useful information for your later troubleshooting, so I would recommend to collect it into a log file as much as possible and do proper housekeeping to ensure the logs are not growing too fast.

If you are looking for solution to suppress certain known python exceptions, you may check out the suppress function from contextlib package.

Python comprehension Photo by Karsten Würth on Unsplash

Python comprehensions for list, set and dictionary

Introduction

Python comprehension is a set of looping and filtering instructions for evaluating expressions and producing sequence output. It is commonly used to construct list, set or dictionary objects which are known as list comprehension, set comprehension and dictionary comprehension. Comparing to use map or filter or nested loops to generate the same result, comprehension has more concise syntax and improved readability. In this article, I will be explaining these three types of comprehensions with some practical examples.

Python comprehension basic syntax

You may have seen people using list/set/dict comprehension in their code, but if you have not yet checked the Python documentation, below is the syntax for Python comprehension.

Assignment Expression for target in iterable [if condition]

It requires a single assignment expression with at least one for statement, and following by zero or more for or if statements.

With this basic understanding, let’s dive into the examples.

List comprehension

List comprehension uses for and if statements to construct a list literal. The new list can be derived from any sequence like string, list, set and dictionary etc. For instance, if we have a list like below:

words = [
    "Serendipity",
    "Petrichor",
    "Supine",
    "Solitude",
    "Aurora",
    "Idyllic",
    "Clinomania",
    "Pluviophile",
    "Euphoria",
    "Sequoia"]

Single loop/if statement

You can use list comprehension to derive a new list which only keeps the elements with length less than 8 from the original list:

short_words = [word for word in words if len(word) < 8 ]

If you examine the short_words, you shall see only the short words (length less than 8) were selected to form a new list:

['Supine', 'Aurora', 'Idyllic', 'Sequoia']

Multiple if statements

As described earlier in the syntax, you can have multiple if conditions to filter the elements:

short_s_words = [word for word in words if len(word) < 8 if word.startswith("S") ]
#short_s_words = [word for word in words if len(word) < 8 and word.startswith("S") ]

The above two would generate the same result as per below:

['Supine', 'Sequoia']

Similarly, you can also use or in the if statement:

short_or_s_words = [word for word in words if len(word) < 8 or word.startswith("S") ]

You shall see the below result for the short_or_s_words variable:

['Serendipity', 'Supine', 'Solitude', 'Aurora', 'Idyllic', 'Sequoia']

Multiple loop/if statements

Sometimes you may have nested data structure and you would like to make it a flatten structure. For instance, to transform a nested list into a flatten list, you can make use of the list comprehension with multiple loops as per below:

lat_long = [[1.291173,103.810535], [1.285387,103.846082], [1.285803,103.845392]]
[x for pos in lat_long for x in pos]

Python will evaluate these expressions from left to right until the innermost block is reached. You shall see the nested list has been transformed into a flatten list:

[1.291173, 103.810535, 1.285387, 103.846082, 1.285803, 103.845392]

And similarly if you have multiple sequences to be iterated through, you can have multiple for statements in your comprehension or use zip function depends on what kind of results you what to achieve:

[(word, num) for word in words if word.startswith("S") for num in range(4) if num%2 == 0]

The above code would generate the output as per below:

[('Serendipity', 0),
 ('Serendipity', 2),
 ('Supine', 0),
 ('Supine', 2),
 ('Solitude', 0),
 ('Solitude', 2),
 ('Sequoia', 0),
 ('Sequoia', 2)]

If you use zip as per below, it will generates some different results.

[(word, num) for word, num in zip(words, range(len(words))) if word.startswith("S") and num%2 == 0]

Another practical example would be using list comprehension to return the particular type of files from the current and its sub folders. For instance, below code would list out all all the ipynb files from current and its sub folder but excluding the checkpoints folder:

import os

[os.path.join(d[0], f) for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Generate tuples from list comprehension

As you can see from the above examples, the list comprehension supports to generate list of tuples, but do take note that you have to use parenthesis e.g.: (word, len(word)) in the expression to indicate the expected output to be a tuple, otherwise there will be a syntax error:

[(word, len(word)) for word in words]

Set comprehension

Similar to list comprehension, the set comprehension uses the same syntax but constructs a set literal. For instance:

words_set = set(words)
short_words_set = {word for word in words_set if len(word) < 8}

The only difference between list comprehension and set comprehension is the square braces “[]” changed to curly braces “{}”.  And you shall see the same result as previous example except the data type now is a set:

{'Aurora', 'Idyllic', 'Sequoia', 'Supine'}

And same as list comprehension, any iterables can be used in the set comprehension to derive a new set. So using the list directly in below will also produce the same result as the above example.

short_words_set = {word for word in words if len(word) < 8}

Due to the nature of the set data structure, you shall expect the duplicate values to be removed when forming a new set with set comprehension.

Dictionary comprehension

With enough explanation in above, i think we shall directly jump into the examples, since everything is the same as list and set comprehension except the data type.

Below is an example:

dict_words = {word: len(word) for word in words}

It produces a new dictionary as per below:

{'Serendipity': 11,
 'Petrichor': 9,
 'Supine': 6,
 'Solitude': 8,
 'Aurora': 6,
 'Idyllic': 7,
 'Clinomania': 10,
 'Pluviophile': 11,
 'Euphoria': 8,
 'Sequoia': 7}

And similarly, you can do some filtering with if statements:

s_words_dict = {word: length for word, length in dict_words.items() if word.startswith("S")}

You can see only the keys starting with “s” were selected to form a new dictionary:

{'Serendipity': 11, 'Supine': 6, 'Solitude': 8, 'Sequoia': 7}

You can check another usage of dictionary comprehension from this post – How to swap key and value in a python dictionary

Limitations and constraints

With all the above examples, you may find comprehension makes our codes more concise and clearer comparing to use map and filter:

list(map(lambda x: x, filter(lambda word: len(word) < 8, words)))

But do bear in mind not to overuse it, especially if you have more than two loop/if statements, you shall consider to move the logic into a function, rather than put everything into a singe line which causes the readability issue.

The Python comprehension is designed to be simple and only support the for and if statements, so you will not be able to put more complicated logic inside a single comprehension.

Finally, if you have a large set of data, you shall avoid using comprehension as it may exhaust all the memory and causes your program to crash. An alternative way is to use the generator expression, which has the similar syntax but it produces a generator for later use. For instance:

w_generator = ((word, length) for word, length in dict_words.items() if word.startswith("S"))

It returns a generator and you can consume the items one by one:

for x in w_generator:
    print(x)

You can see the same result would be produced:

('Serendipity', 11)
('Supine', 6)
('Solitude', 8)
('Sequoia', 7)

Conclusion

In this article, we have reviewed though the basic syntax of the Python comprehension with some practical examples for list comprehension, set comprehension and dictionary comprehension. Although it is so convenient to use Python comprehension, you will still need to think about the readability/clarity of your code to other people and also the potential memory issues when you are working on a huge data set.