Resources

Best Tips for Python, Data Science and Automation

Resources

common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes […]

Read More
Python one-liners with list comprehension and ternary operation

15 Most Powerful Python One-liners You Can't Skip

Introduction One-liner in Python refers to a short code snippet that achieves some powerful operations. It’s popular and widely used in Python community as it makes the code more concise and easier to understand. In this article, I will be sharing some most commonly used Python one-liners that would definitely speed up your coding without […]

Read More
web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process: There isn’t any public API available for you to […]

Read More
gspread read and write google sheet

Read and write Google Sheet with 5 lines of Python code

Introduction Google Sheet is a very powerful tool in terms of collaboration, it allows multiple users to work on the same rows of data simultaneously. It also provides fine-grained APIs in various programming languages for your application to connect and interact with Google Sheet. Sometimes when you just need some simple operations like reading/writing data […]

Read More
python suppress stdout and stderr Photo by Yeshi Kangrang on Unsplash

Python recipes- suppress stdout and stderr messages

Introduction If you have worked on some projects that requires API calls to the external parties or uses 3rd party libraries, you may sometimes run into the problem that you are able to get the correct return results but it also comes back with a lot of noises in the stdout and stderr. For instance, […]

Read More
Photo by Aron Visuals on Unsplash

How to calculate date difference between rows in pandas

Problem: You have some data with date (or numeric) data columns, you already knew you can directly use – operator to calculate the difference between columns, but now you would like to calculate the date difference between the two consecutive rows. For instance, You have some sample data for GPS tracking, and it has the […]

Read More
Photo by Ali Yahya on Unsplash

Master python closure with 3 real-world examples

Introduction

Python closure is a technique for binding function with an environment where the function gets access to all the variables defined in the enclosing scope. Closure typically appears in the programming language with first class function, which means functions are allowed to be passed as arguments, return value or assigned to a variable.

This definition sounds confusing to the python beginners, and sometimes the examples found from online also not intuitive enough in the way that most of the examples are trying to illustrate with some printing statement, so the readers may not get the whole idea of why and how the closure should be used. In this article, I will be using some real-world example to explain how to use closure in your code.

Nested function in Python

To understand closure, we must first know that Python has nested function where one function can be defined inside another. For instance, the below inner_func is the nested function and the outer_func returns it’s nested function as return value.

def outer_func():    
    print("starting outer func")
    def inner_func():
        pi = 3.1415926
        print(f"pi is : {pi}")
    return inner_func

When you invoke the outer_func, it returns the reference to the inner_func, and subsequently you can call the inner_func. Below is the output when you run in Jupyter Notebook:

python closure nested function example

After you have got some feeling about the nested function, let’s continue to explore how nested function is related to closure. If we modify our previous function and move the pi variable into outer function, surprisedly it generates the same result as previously.

def outer_func():    
    print("starting outer func")
    #move pi variable definition to outer function
    pi = 3.1415926
    def inner_func():
        print(f"pi is : {pi}")
    return inner_func

You may wonder the pi variable is defined in outer function which is a local variable to outer_func, why inner_func is able access it since it’s not a global scope? This is exactly where closure happens, the inner_func has the full access to the environment (variables) in it’s enclosing scope. The inner_func refers to pi variable as nonlocal variable since there is no other local variable called pi.

If you want to modify the value of the pi inside the inner_func, you will have to explicitly specify “nonlocal pi” before you modify it since it’s immutable data type.

With the above understanding, now let’s walk through some real-world examples to see how we can use closure in our code.

Hide data with Python closure

Let’s say we want to implement a counter to record how many time the word has been repeated. The first thing you may want to do is to define a dictionary in global scope, and then create a function to add in the words as key into this dictionary and also update the number of times it repeated. Below is the sample code:

counter = {}

def count_word(word):    
    global counter
    counter[word] = counter.get(word, 0) + 1
    return counter[word]

To make sure the count_word function updates the correct “counter”, we need to put the global keyword to explicitly tell Python interpreter to use the “counter” defined in global scope, not any variable we accidentally defined with the same name in the local scope (within this function).

Sample output:

python closure word counter sample output

The above code works as expected, but there are two potential issues: Firstly, the global variable is accessible to any of the other functions and you cannot guarantee your data won’t be modified by others. Secondly, the global variable exists in the memory as long as the program is still running, so you may not want to create so many global variables if not necessary.

To address these two issues, let’s re-implement it with closure:

def word_counter():
    counter = {}
    def count(word):
        counter[word] = counter.get(word, 0) + 1
        return counter[word]
    return count

If we run it from Jupyter Notebook, you will see the below output:

python closure word counter example output

With this implementation, the counter dictionary is hidden from the public access and the functionality remains the same. (you may notice it works even after the word_counter function is deleted)

Convert small class to function with Python closure

Occasionally in your project, you may want to implement a small utility class to do some simple task. Let’s take a look at the below example:

import requests

class RequestMaker:
    def __init__(self, base_url):
        self.url = base_url
    def request(self, **kwargs):
        return requests.get(self.url.format_map(kwargs))

You can see the below output when you call the make_request from an instance of RequestMaker:

python closure small class example

Since you’ve already seen in the word counter example, the closure can also hold the data for your later use, the above class can be converted into a function with closure:

import requests

def request_maker(url):
    def make_request(**kwargs):
        return requests.get(url.format_map(kwargs))
    return make_request

The code becomes more concise and achieves the same result. Take note that in the above code, we are able to pass in the arguments into the nested function with **kwargs (or *args).

python closure convert small class to closure

Replace text with case matching

When you use regular express to find and replace some text, you may realize if you are trying to match text in case insensitive mode, you will not able to replace the text with proper case. For instance:

import re

paragraph = 'To start Python programming, you need to install python and configure PYTHON env.'
re.sub("python", "java", paragraph, flags=re.I)

Output from above:

python closure replace with case

It indeed replaced all the occurrence of the “python”, but the case does not match with the original text. To solve this problem, let’s implement the replace function with closure:

def replace_case(word):
    def replace(m):
        text = m.group()
        if text.islower():
            return word.lower()
        elif text.isupper():
            return word.upper()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

In the above code, the replace function has the access to the original text we intend to replace with, and when we detect the case of the matched text, we can convert the case of original text and return it back.

So in our original substitute function, let’s pass in a function replace_case(“java”) as the second argument. (You may refer to Python official doc in case you want to know what is the behavior when passing in function to re.sub)

re.sub("python", replace_case("java"), paragraph, flags=re.IGNORECASE)

If we run the above again, you should be able to see the case has been retained during the replacement as per below:

python closure replace with case

Conclusion

In this article, we have discussed about the general reasons why Python closure is used and also demonstrated how it can be used in your code with 3 real-world examples. In fact, Python decorator is also a use case of closure, I will be discussing this topic in the next article.

 

pandas tricks pass multiple columns to lambda

Pandas tricks – pass multiple columns to lambda

Pandas is one of the most powerful tool for analyzing and manipulating data. In this article, I will be sharing with you the solutions for a very common issues you might have been facing with pandas when dealing with your data – how to pass multiple columns to lambda or self-defined functions.

Prerequisite

You will have to install pandas on your working environment:

pip install pandas

When dealing with data, you will always have the scenario that you want to calculate something based on the value of a few columns, and you may need to use lambda or self-defined function to write the calculation logic, but how to pass multiple columns to lambda function as parameters?

Let me use some real world example, so that easier for you to understand the issue that I am talking about. Below table shows partial of the e-com delivery charges offered by some company, so the delivery charges are determined by package size (H+L+W), package weight and the delivery mode you are choosing.

Size (cm/kg) 3 hours express Next Day Delivery Same Day Delivery
<60 CM (H+L+W) & MAX 1KG 12 8 10
<80 CM (H+L+W) & MAX 5KG 15 9 11
<100 CM (H+L+W) & MAX 8KG 17 11 13
<120 CM (H+L+W) & MAX 10KG 19 14 16

And assuming we have the below order data, and we want to simulate the delivery charges. Let’s create the data in a pandas dataframe.

import pandas as pd

df = pd.DataFrame({
    "Order#" : ["1", "2", "3", "4"], 
    "Weight" : [5.0, 2.1, 8.1, 7.5], 
    "Package Size" : [80, 45, 110, 90],
    "Delivery Mode": ["Same Day", "Next Day", "Express", "Next Day"]})

If you view dataframe from Jupyter Notebook (you can sign up here to use it for free), you shall be able to see the data as per below.

Pandas pass multiple columns to lambda same data

Let’s also implement a calculate_rate function where we need to pass in the weight, package size, and delivery mode in order to calculate the delivery charges:

def calculate_rate(weight, package_size, delivery_mode):
    #set the charges as $20 since we do not have the complete rate card
    charges = 20
    if weight <=1 and package_size <60:
        if delivery_mode == "Express":
            charges = 13
        elif delivery_mode == "Next Day":
            charges = 8
        else:
            charges = 10
    elif weight <=5 and package_size <80:
        if delivery_mode == "Express":
            charges = 15
        elif delivery_mode == "Next Day":
            charges = 9
        else:
            charges = 11
    elif weight <=8 and package_size <100:
        if delivery_mode == "Express":
            charges = 17
        elif delivery_mode == "Next Day":
            charges = 11
        else:
            charges = 13
    return charges

Pass multiple columns to lambda

Here comes to the most important part. You probably already know data frame has the apply function where you can apply the lambda function to the selected dataframe. We will also use the apply function, and we have a few ways to pass the columns to our calculate_rate function.

 Option 1

We can select the columns that involved in our calculation as a subset of the original data frame, and use the apply function to it.

And in the apply function, we have the parameter axis=1 to indicate that the x in the lambda represents a row, so we can unpack the x with *x and pass it to calculate_rate.

df["Delivery Charges"] = df[["Weight", "Package Size", "Delivery Mode"]].apply(lambda x : calculate_rate(*x), axis=1)

If we check the df again in Jupyter Notebook, you should see the new column “Delivery Charges” with the figures calculated based on the logic we defined in calculate_rate function.

Pandas pass multiple columns to lambda

Option 2:

If you do not want to get a subset of the data frame and then apply the lambda, you can also directly use the apply function to the original data frame. In this case, you will need to select the columns before passing to the calculate_rate function. Same as above, we will need to specify the axis=1 to indicate it’s applying to each row.

df["Delivery Charges"] = df.apply(lambda x : calculate_rate(x["Weight"], x["Package Size"], x["Delivery Mode"]), axis=1)

This will produce the same result as option 1. And you can also use x.Weight instead of x[“Weight”] when passing in the parameter.

 

Conclusion

The two options we discussed to pass multiple columns to lambda are basically the same, and it’s either applying to the subset or the original data frame. I have not yet tested with a large set of data, so there might be some differences in terms of the performance, you may need to take a note if you are dealing with a lot of data.

You may also interested to read some other articles related to pandas.

 

pandas tricks calculate percentage within group

Pandas Tricks – Calculate Percentage Within Group

Pandas groupby probably is the most frequently used function whenever you need to analyse your data, as it is so powerful for summarizing and aggregating data. Often you still need to do some calculation on your summarized data, e.g. calculating the % of vs total within certain category. In this article, I will be sharing with you some tricks to calculate percentage within groups of your data.

Prerequisite

You will need to install pandas if you have not yet installed:

pip install pandas
#or conda install pandas

I am going to use some real world example to demonstrate what kind of problems we are trying to solve. The sample data I am using is from this link , and you can also download it and try by yourself.

Let’s first read the data from this sample file:

import pandas as pd

# You can also replace the below file path to the URL of the file
df = pd.read_excel(r"C:\Sample Sales Data.xlsx", sheet_name="Sheet")

The data will be loaded into pandas dataframe, you will be able to see something as per below:

pandas tricks - calculate percentage within group

Let’s first calculate the sales amount for each transaction by multiplying the quantity and unit price columns.

df["Total Amount"] = df["Quantity"] * df["Price Per Unit"]

You can see the calculated result like below:

pandas tricks - calculate percentage within group

Calculate percentage within group

With the above details, you may want to group the data by sales person and the items they sold, so that you have a overall view of their performance for each person. You can do with the below :

#df.groupby(["Salesman","Item Desc"])["Total Amount"].sum()
df.groupby(["Salesman", "Item Desc"]).agg({"Total Amount" : "sum"})

And you will be able to see the total amount per each sales person:

pandas tricks - calculate percentage within group

This is good as you can see the total of the sales for each person and products within the given period.

Calculate the best performer

Now let’s see how we can get the % of the contribution to total revenue for each of the sales person, so that we can immediately see who is the best performer.

To achieve that, firstly we will need to group and sum up the “Total Amount” by “Salemans”, which we have already done previously.

df.groupby(["Salesman"]).agg({"Total Amount" : "sum"})

And then we calculate the sales amount against the total of the entire group. Here we can get the “Total Amount” as the subset of the original dataframe, and then use the apply function to calculate the current value vs the total. Take note, here the default value of axis is 0 for apply function.

[["Total Amount"]].apply(lambda x: 100*x/x.sum())

With the above, we should be able get the % of contribution to total sales for each sales person. And let’s also sort the % from largest to smallest:

sort_values(by="Total Amount", ascending=False)

Let’s put all together and run the below in Jupyter Notebook:

df.groupby(["Salesman"])\
.agg({"Total Amount" : "sum"})[["Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by="Total Amount", ascending=False)

You shall be able to see the below result with the sales contribution in descending order. (Do not confuse with the column name “Total Amount”, pandas uses the original column name for the aggregated data. You can rename it to whatever name you want later)

pandas tricks - calculate percentage within group for salesman

 

Calculate the most popular products

Similarly, we can follow the same logic to calculate what is the most popular products. This time we want to summarize the sales amount by product, and calculate the % vs total for both “Quantity” and “Total Amount”. And also we want to sort the data in descending order for both fields. e.g.:

df.groupby(["Item Desc"])\
.agg({"Quantity": "sum", "Total Amount" : "sum"})[["Quantity", "Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Quantity","Total Amount"], ascending=[False,False])

This will produce the below result, which shows “Whisky” is the most popular product in terms of number of quantity sold. But “Red Wine” contributes the most in terms of the total revenue probably because of the higher unit price.

pandas tricks - calculate percentage within group for products

 

Calculate best sales by product for each sales person

What if we still wants to understand within each sales person, what is the % of sales for each product vs his/her total sales amount?

In this case, we shall first group the “Salesman” and “Item Desc” to get the total sales amount for each group. And on top of it, we calculate the % within each “Salesman” group which is achieved with groupby(level=0).apply(lambda x: 100*x/x.sum()).

Note: After grouping, the original datafram becomes multiple index dataframe, hence the level = 0 here refers to the top level index which is “Salesman” in our case.

df.groupby(["Salesman", "Item Desc"])\
.agg({"Total Amount" : "sum"})\
.groupby(level=0).apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Salesman", "Item Desc","Total Amount"], ascending=[True, True, False])

You will be able see the below result which already sorted by % of sales contribution for each sales person.

pandas tricks - calculate percentage within group - for salesman and product

 

Conclusion

This is just some simple use cases where we want to calculate percentage within group with the pandas apply function, you may also be interested to see what else the apply function can do from here.

 

python send email with attachment via smtplib

How to send email with attachment via python smtplib

In one of my previous article, I have discussed about how to send email from outlook application. That has assumed you have already installed outlook and configured your email account on the machine where you want to run your script. In this article, I will be sharing with you how to automatically send email with attachments via lower level API, to be more specific, by using python smtplib where you do not need to set up anything in your environment to make it work.

For this article, I will demonstrate to you to send a HTML format email from a gmail account with some attachment. So besides the smtplib module, we will need to use another two modules – ssl and email.

Let’s get started!

First, you will need to find out the SMTP server and port info to send email via google account. You can find this information from this link. For your easy reading, I have captured in the below screenshot.

codeforests - google smtp server configuration info

So we are going to use the server: smtp.gmail.com and port 587 for our case. (you may search online to find out more info about the SSL & TLS, we will not discuss much about it in this article)

Let’s start to import all the modules we need:

import smtplib, ssl
from email.mime.multipart import MIMEMultipart 
from email.mime.text import MIMEText 
from email.mime.application import MIMEApplication

As we are going to send the email in HTML format (which are you able to unlock a lot features such as adding in styles, drawing tables etc.), we will need to use the MIMEText. And also the MIMEMultipart and MIMEApplication for the attachment.

Build up the email message

To build up our email message, we need to create mixed type MIMEMultipart object so that we can send both text and attachment. And next, we shall specify the from, to, cc and subject attributes.

smtp_server = 'smtp.gmail.com'
smtp_port = 587 
#Replace with your own gmail account
gmail = 'yourmail@gmail.com'
password = 'your password'

message = MIMEMultipart('mixed')
message['From'] = 'Contact <{sender}>'.format(sender = gmail)
message['To'] = 'contact@codeforests.com'
message['CC'] = 'contact@codeforests.com'
message['Subject'] = 'Hello'

You probably do not want anybody can see your hard coded password here, you may consider to put this email account info into a separate configuration file. Check my another post on the read/write configuration files.

For the HTML message content, we will wrap it into the MIMEText, and then attach it to our MIMEMultipart message:

msg_content = '<h4>Hi There,<br> This is a testing message.</h4>\n'
body = MIMEText(msg_content, 'html')
message.attach(body)

Let’s assume you want to attach a pdf file from your c drive, you can read it in binary mode and pass it into MIMEApplication with MIME type as pdf. Take note on the additional header where you need to specify the name your attachment file.

attachmentPath = "c:\\sample.pdf"
try:
	with open(attachmentPath, "rb") as attachment:
		p = MIMEApplication(attachment.read(),_subtype="pdf")	
		p.add_header('Content-Disposition', "attachment; filename= %s" % attachmentPath.split("\\")[-1]) 
		message.attach(p)
except Exception as e:
	print(str(e))

If you have a list of the attachments, you can loop through the list and attach them one by one with the above code.

Once everything is set properly, we can convert the message object into to a string:

msg_full = message.as_string()

Send email

Here comes to the most important part, we will need to initiate the TLS context and use it to communicate with SMTP server.

context = ssl.create_default_context()

And we will initialize the connection with SMTP server and set the TLS context, then start the handshaking process.

Next it authenticate our gmail account, and in the send mail method, you can specify the sender, to and cc (as a list), as well as the message string. (cc is optional)

with smtplib.SMTP(smtp_server, smtp_port) as server:
	server.ehlo()  
	server.starttls(context=context)
	server.ehlo()
	server.login(gmail, password)
	server.sendmail(gmail, 
				to.split(";") + (cc.split(";") if cc else []),
				msg_full)
	server.quit()

print("email sent out successfully")

Once sendmail completed, you will disconnect with the server by server.quit().

With all above, you shall be able to receive the email triggered from your code. You may want to wrap these codes into a class, so that you can reuse it as service library in your multiple projects.

 

As per always, please share if you have any questions or comments.

python cache

How to print colored message on command line terminal window

When you are developing a python script with some output messages printed on the terminal window, you may find a little bit boring that all the messages are printed in black and white, especially if some messages are meant for warning, and some just for information only. You may wonder how to print colored message to make them look differently, so that your users are able to pay special attention to those warning or error messages.

In this article, I will be sharing with you a library which allows you to print colored message in your terminal.

Let’s get started!

The library I am going to introduce called colorama, which is a small and clean library for styling your messages in both Windows, Linux and Mac os.

Prerequisite :

You will need to install this library, so that you will be able to run the following code in this article.

pip install colorama

To start using this library, you will need to import the modules, and call the init() method at the beginning of your script or your class initialization method.

import colorama
from colorama import Fore, Back, Style
colorama.init()

Print colored message with colorama

The init method also accepts some **kwargs to overwrite it’s default behaviors. E.g. by default, the style will not be reset back after printing out a message,  and the subsequent messages will be following the same styles. You can pass in autoreset = true to the init method, so that the style will be reset after each printing statement.

Below are the options you can use when formatting the font, background and style.

Fore: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Back: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Style: DIM, NORMAL, BRIGHT, RESET_ALL

To use it in your message, you can do as per below to wrap your messages with the styles:

print(Fore.CYAN + "Cyan messages will be printed out just for info only" + Style.RESET_ALL)
print(Fore.RED + "Red messages are meant to be to warning or error" + Style.RESET_ALL)
print(Fore.YELLOW + Back.GREEN +  "Yellow messages are debugging info" + Style.RESET_ALL)

This is how it would look like in your terminal:

Python printed colored message with colorama

As I mentioned earlier, if you don’t set the autoreset to true, you will need to reset the style at the end of your each message, so that different message applies different styles.

What if you want to apply the styles when asking user’s input ? Let’s see an example:

print(Fore.YELLOW)
choice = input("Enter YES to confrim:")
print(Style.RESET_ALL)
if str.upper(choice) in ["YES",'Y']:
    print(Fore.GREEN + "You have just confirmed to proceed." + Style.RESET_ALL)
else:
    print(Fore.RED + "You did not enter yes, let's stop here" + Style.RESET_ALL)

By wrapping the input inside Fore.YELLOW and Style.RESET_ALL, whatever output messages from your script or user entry, the same style will be applied.

Let’s put all the above into a script and run it in the terminal to check how it looks like.

Python printed colored message with colorama

Yes, that’s exactly what we want to achieve! Now you can wrap your printing statement into a method e.g.: print_colored_message, so that you do not need to repeat the code everywhere.

As per always, please share if you have any comments or questions.

 

python unpack objects

Python how to unpack tuple, list and dictionary

There are various cases that you want to unpack your python objects such as tuple, list or dictionary into individual variables, so that you can easily access the individual items. In this article I will be sharing with you how to unpack these different python objects and how it can be useful when working with the *args and **kwargs in the function.

Let’s get started.

Unpack python tuple objects

Let’s say we have a tuple object called shape which describes the height, width and channel of an image, we shall be able to unpack it to 3 separate variables by doing below:

shape = (500, 300, 3)
height, width, channel = shape
print(height, width, channel)

And you can see each item inside the tuple has been assigned to the individual variables with a meaningful name, which increases the readability of your code. Below is the output:

500 300 3

It’s definitely more elegant than accessing each items by index, e.g. shape[0], shape[1], shape[2].

What if we just need to access a few items in a big tuple which has many items? Here we need to introduce the _ (unnamed variable) and * (unpack arbitrary number of items)

For example,  if we just want to extract the first and the last item from the below tuple, we can let the rest of the items go into a unnamed variable.

toto_result = (4,11,14,23,28,47,24)
first, *_, last = toto_result
print(first, last)

So the above will give the below output:

4 24

If you are curious what is inside the “_”, you can try to print it out. and you would see it’s actually a list of the rest of items between the first and last item.

[11, 14, 23, 28, 47]

The most popular use case of the packing and unpacking is to pass around as parameters to function which accepts arbitrary number of arguments (*args). Let’s look at an example:

def sum(*numbers):
    total = 0
    for n in numbers:
        total += n
    return total

For the above sum function, it accepts any number of arguments and sum up the values. The * here is trying to pack all the arguments passed to this function and put it into a tuple called numbers. If you are going to sum up the values for all the items in toto_result, directly pass in the toto_result would not work.

toto_resut = (4,11,14,23,28,47,24)
#sum(toto_result) would raise TypeError

So what we can do is to unpack the items from the tuple then pass it the sum function:

total = sum(*toto_resut)
print(total)
#output should be 151

Unpack python list objects

Unpacking the list object is similar to the unpacking operations on tuple object. If we replace the tuple to list in the above example, it should be working perfectly.

shape = [500, 300, 3]
height, width, channel = shape
print(height, width, channel)
#output shall be 500 300 3

toto_result = [4,11,14,23,28,47,24]
first, *_, last = toto_result
print(first, last)
#output shall be 4 24

total = sum(*toto_resut) 
print(total) 
#output should be also 151

Unpack python dictionary objects

Unlike the list or tuple, unpacking the dictionary probably only useful when you wants to pass the dictionary as the keyword arguments into a function (**kwargs).

For instance, in the below function, you can pass in all your keyword arguments one by one.

def print_header(**headers):
    for header in headers:
        print(header, headers[header])

print_header(Host="Mozilla/5.0", referer = "https://www.codeforests.com")

Or if you have a dictionary like below, you can just unpack it and pass to the function:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
print_header(**headers)

It will generate the same result as previously, but the code is more concise.

Host www.codeforests.com
referer https://www.codeforests.com

With this unpacking operator, you can also combine multiple dictionaries as per below:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
extra_header = {'user-agent': 'Mozilla/5.0'}

new_header = {**headers, **extra_header}

The output of the new_header will be like below:

{'Host': 'www.codeforests.com',
 'referer': 'https://www.codeforests.com',
 'user-agent': 'Mozilla/5.0'}

Conclusion

The unpacking operation is very usefully especially when dealing with the *args and **kwargs. There is one thing worth noting on the unamed variable (_) which I mentioned in the previous paragraph. Please use it with caution, as if you notice, the python interactive interpreter also uses _ to store the last executed expression. So do take note on this potential conflict. See the below example:

codeforests interactive interpreter conflicts

As per always, welcome any comments or questions.

common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction

Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes will further enhance your understanding on the fundamentals as well as the programming skills.

In this article, I will be summarizing a few common Python mistakes that many people may have encountered when they started the learning journey and how they can be fixed or avoided.

Reload Modules after Modification

Have you ever wasted hours to debug and fix an issue and eventually realized you were not debugging on your modified source code? This usually happens to the beginners as they did not realize the entire module was only loaded into memory once when import statement was executed. So if you are modifying some code in separate module and import to your current code, you will have to reload the module to reflect the latest changes.

To reload a module, you can use the reload function from the importlib module:

from importlib import reload

# some module which you have made changes
import externallib

reload(externallib)

Naming Conflict for Global and Local Variables

Imagine you have defined a global variable named app_config, and you would like to use it inside the init_config function as per below:

app_config = "app.ini"

def init_config():
    app_config = app_config or "default.ini"
    print(app_config)

You may expect to print out “app.ini” since it’s already defined globally, but surprisedly you would get the “UnboundLocalError” exception due to the variable app_config is referenced before assignment. If you comment out the assignment statement and just print out the variable, you would see the value printed out correctly. So what is going on here?

The above exception is due to Python tries to create a variable in local scope whenever there is an assignment expression, and since the local variable and global variable have the same name, the global variable being shadowed in local scope. Thus Python throws an error saying your local variable app_config is used before it’s initialized.

To solve this naming conflict, you shall use different name for your global variable and local variables to avoid any confusion, e.g.:

app_config = "app.ini"

def init_config():
    config = app_config or "default.ini"
    print(config)

Checking Falsy Values

Examining true or false of a variable in if or while statement sometimes can also go wrong. It’s common for Python beginners to mix None value and other falsy values and eventually write some buggy code. E.g.:  assuming you want to check when price is not None and below 5, trigger some selling alert:

def selling_alert(price):
    if price and price < 5:
        print("selling signal!!!")

Everything looks fine, but when you test with price = 0, you would not get any alert:

selling_alert(0)
# Nothing has been printed out

This is due to both None and 0 are evaluated as False by Python, so the printing statement would be skipped although price < 5 is true.

In python, empty sequence objects such as “” (empty string), list, set, dict, tuple etc are all evaluated as False, and also zero in any numeric format like 0 and 0.0. So to avoid such issue, you shall be very clear whether your logic need to differentiate the None and other False values and then split the logic if necessary, e.g.:

if price is None:
   print("invalid input")
elif price < 5:
   print("selling signal!!!")

Default Value and Variable Binding

Default value can be used when you want to make your function parameter optional but still flexible to change. Imagine you need to implement a logging function with an event_time parameter, which you would like to give a default value as current timestamp when it is not given. You can happily write some code as per below:

from datetime import datetime

def log_event_time(event, event_time=datetime.now()):
    print(f"log this event - {event} at {event_time}")

And you would expect as long as the event_time is not provided during log_event_time function call, it shall log an event with the timestamp when the function is invoked. But if you test it with below:

log_event_time("check-in")

# log this event - check-in at 2021-02-21 14:00:56.046938

log_event_time("check-out")

# log this event - check-out at 2021-02-21 14:00:56.046938

You shall see that all the events were logged with same timestamp. So why the default value for event_time did not work?

To answer this question, you shall know the variable binding happens during the function definition time. For the above example, the default value of the event_time was assigned when the function is initially defined. And the same value will be used each time when the function is called.

To fix the issue, you can assign a None as default value and check to overwrite the event_time inside your function call when it is None. For instance:

def log_event_time(event, event_time=None):
    event_time = event_time or datetime.now()
    print(f"log this event - {event} at {event_time}")

Similar variable binding mistakes can happens when you implement your lambda functions. For your further reading, you may check my previous post why your lambda function does not work for more examples.

Default Value for Mutable Objects

Another mistake Python beginners trend to make is to set a default value for a mutable function parameter. For instance, the below user_list parameter in the add_white_list function:

def add_white_list(user, user_list=[]):
    user_list.append(user)
    return user_list

You may expect when user_list is not given, a empty list will be created and then new user will be added into this list and return back. It is working as expected for below:

my_list = add_white_list('Jack')

# ['Jack']

my_list = add_white_list('Jill', my_list)

#['Jack', 'Jill']

But when you want to start with a empty list again, you would see some unexpected result:

my_new_list = add_white_list('Joe')
# ['Jack', 'Jill', 'Joe']

From the previous variable binding example, we know that the default value for user_list is created only once at the function definition time. And since list is mutable, the changes made to the list object will be referred by the subsequent function calls.

To solve this problem, we shall give None as the default value for user_list and use a local variable to create a new list when user_list is not given during the call. e.g.:

def add_white_list(user, user_list=None):
    if user_list is None:
        user_list = []
    user_list.append(user)
    return user_list

Someone may get confused that datetime.now() shall create a Python class instance, which supposed to be mutable also. If you checked Python documentation, you would see the implementation of datetime is actually immutable.

Misunderstanding of Python Built-in Functions

Python has a lot of powerful built-in functions and some of them look similar by names, and if you do not spend some time to read through the documentation, you may end up using them in the wrong way.

For instance, you know built-in sorted function or list sort function both can be used to sort sequence object. But occasionally, you may make below mistake:

random_ints = [80, 53, 7, 92, 30, 31, 42, 10, 42, 18]

# The sorting is done in-place, and function returns None
even_integers_first = random_ints.sort(key=lambda x: x%2)

# Sorting is not done in-place, function returns a new list
sorted(random_ints)

Similarly for reverse and reversed function:

# The reversing is done in-place, and function returns None
random_ints = random_ints.reverse()
# reversing is not done in-place, function returns a new generator
reversed(random_ints)

And for list append and extend function:

crypto = ["BTC", "ETH"]

# the new list will be added as 1 element to crypto list
crypto.append(["XRP", "BNB"])

print(crypto)
#['BTC', 'ETH', ['XRP', 'BNB']]

# the new list will be flattened when adding to crypto list
crypto.extend(["UNI"])

print(crypto)
# ['BTC', 'ETH', ['XRP', 'BNB'], 'UNI']

Modifying Elements While Iterating

When iterating a sequence object, you may want to filter out some elements based on certain conditions.

For instance, if you want to iterate below list of integers and remove any elements if it is below 5. You probably would write the below code:

a = [1, 2, 3, 4, 5, 6, 2]

for b in a:
    if b < 5:
        a.remove(b)

But when checking the output of the list a, you would see the result is not as per you expected:

print(a)
# [4, 5, 6, 2]

This is because the for statement will evaluate the expression and create a generator for iterating the elements. Since we are deleting elements from the original list, it will also change the state of the generator, and then further cause the unexpected result. To fix this issue, you can make use of the list comprehension as per below if your filter condition is not complex:

[b for b in a if b >= 5]

Or if you wish, you can use the the filterfalse together with the lambda function:

from itertools import filterfalse
list(filterfalse(lambda x: x < 5, a))

Re-iterate An Exhausted Generator

Many Python learners started writing code without understanding the difference between generator and iterator. This would cause the error that re-iterating an exhausted generator. For instance the below generator, you can print out the values in a list:

some_gen = (i for i in range(10))
print(list(some_gen))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

And sometimes you may forget you have already iterated the generator once, and when you try to execute the below:

for x in some_gen:
    print(x)

You would not be able to see anything printed out.

To fix this issue, you shall save your result into a list first if you’re not dealing with a lot of data, or you can use the tee function from itertools module to create multiple copies of the generator so that you can iterate multiple times:

from itertools import tee

# create 3 copies of generators from the original iterable
x, y, z = tee(some_gen, n=3)

Conclusion

In this article, we have reviewed through some common Python mistakes that you may encounter when you start writing Python codes. There are definitely more mistakes you probably would make if you simply jump into the coding without understanding of the fundamentals. But as the old saying, no pain no gain, ultimately you will get there when you drill down all the mistakes and clear all the roadblocks.

Python one-liners with list comprehension and ternary operation

15 Most Powerful Python One-liners You Can’t Skip

Introduction

One-liner in Python refers to a short code snippet that achieves some powerful operations. It’s popular and widely used in Python community as it makes the code more concise and easier to understand. In this article, I will be sharing some most commonly used Python one-liners that would definitely speed up your coding without compromising any clarity.

Let’s start from the basis.

Ternary operations

Ternary operation allows you to evaluate a value based on the condition being true or false. Instead of writing a few lines of if/else statements, you can simply do it with one line of code:

x = 1
y = 2

result = 1 if x > 0 and y > x else -1
print(result)
# 1

#re-assign x to 6 if it is evaluated as False
x = x or 6

Assign values for multiple variables

You can assign values for multiple variables simultaneously as per below. (You may want to check this article to understand what is going on under the hood)

key, value = "user", "password"

print(key, value)
#('user', 'password')

Swap variables

To swap the values of the variables, simply perform the below without having a temp variable which is usually required by other programming languages like Java or C.

key, value = value, key

print(key, value) 
#('password', 'user')

Swap elements in a list

Imagine you have a list of users as per below, and you would like to swap the first element with last element:

users = ["admin", "anonymous1", "anonymous2"]

Since list is mutable, you can re-assign the values of first and last elements by swapping their sequence as per below:

users[0], users[2] = users[2], users[0]
# or users[0], users[-1] = users[-1], users[0]

print(users)
#['anonymous2', 'anonymous1', 'admin']

Further more, if the requirement is to swap the elements at the odd and even positions in a list, e.g. in the below list:

numbers = list(range(10))
#[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We would like to swap the elements of the 2nd and 1st, 4th and 3rd, and so on. It can be achieved by performing the below list slicing with assignment operation:

numbers[::2], numbers[1::2] = numbers[1::2], numbers[0::2]

print(numbers)
#[1, 0, 3, 2, 5, 4, 7, 6, 9, 8]

Replace elements in a list

To further expend the above example, if we want to replace the elements on every odd/even position in a list, for instance to 0, we can do the re-assignment with below:

numbers[1::2] = [0]*len(numbers[1::2])

print(numbers)
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Of course there is an alternative way with list comprehension, we shall touch on it later.

Generate list with list comprehension

By using list comprehension, you can easily generate new a list with certain filtering conditions from the current sequence object. For instance, the below will generate a list of even numbers between 1 to 20:

even_nums = [i for i in range(1, 20) if i%2 == 0]

print(even_nums)
#[2, 4, 6, 8, 10, 12, 14, 16, 18]

Create sub list from a list

Similarly, you can get a sub list from the existing list with the list comprehension as per below:

[i for i in even_nums if i <5]
# 2, 4

Manipulating elements in the list

With list comprehension, you can also transform your list of elements into another format. For instance, to convert the integers to alphabets:

alphabets = [chr(65+i) for i in even_nums]
# ['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

Or convert the upper case into lower case:

[i.lower() for i in alphabets]
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

And all the above can be done without list comprehension as well:

list(map(lambda x : chr(65+x), even_nums))
#['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

list(map(str.lower, alphabets))
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

Another real world example would be to use list comprehension to list out all the .ipynb files from current folder and its sub folders (excluding the checkpoint files):

import os

[f for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Flatten a list of sequences

If you have a list of sequence objects as per below, and you would like to flatten them into 1 dimensional:

a = [[1,2], [3,4], [5,6,7]]

You can use multiple for expressions in list comprehension to flatten it:

b = [y for x in a for y in x]

print(b)
#[1, 2, 3, 4, 5, 6, 7]

Alternatively, you can make use of the itertools module to get the same result:

import itertools

list(itertools.chain.from_iterable(a))

Ternary operation with list comprehension

In the previous ternary operation example, we have discussed how to replace the elements in the even position of a list. Here is the alternative using list comprehension in conjunction with a ternary expression:

numbers = list(range(10))
[y if i % 2 == 0 else 0 for i, y in enumerate(numbers)]
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Generate a dictionary with dictionary comprehension

To derive a dictionary from a list, you can use dictionary comprehension as per below:

even_nums_dict = {chr(65+i):v for i, v in enumerate(even_nums)}

#{'A': 2, 'B': 4, 'C': 6, 'D': 8, 'E': 10, 'F': 12, 'G': 14, 'H': 16, 'I': 18}

Generate a set with set comprehension

Similar operation is also available for set data type when you want to derive elements from a list to set:

even_nums_set = {chr(65+i) for i in even_nums}
#{'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S'}

When using the built-in data type set, you shall expect that it only keeps the unique values. For instance, you can use set to remove duplicate values:

a = [1,2,2,4,6,7]
unique = set(a)

print(unique)
#{1, 2, 4, 6, 7}

More Python comprehension examples can be found here.

Read file into generator

Reading files can be done in one-liner as per below:

text = (line.strip() for line in open('response.html', 'r'))

Take note the parentheses are used in above generator expression rather than [], when [] is used, it returns a list.

One-liner with Python -c command

Sometimes you may want to run code snippets without entering into the Python interactive mode. You can execute the code with Python -c option in command line window. For instance, check the current Python version:

python -c "import sys; print(sys.version.split()[0])"

#3.7.2

Or check the value of the environment variable:

python -c "import os;print(os.getenv('PATH').split(';'))"

Conclusion

In this article, we have reviewed through some commonly used Python one-liners which would greatly improve your code readability and coding productivity. There are definitely more to be covered, and every Pythonista would have his/her own list of favorite one-liners. Sometimes you will also need to consider the code performance before you use it rather simply pursuing the conciseness of the code.

As the general rule of thumb, you shall not use/innovate something that confusing ,difficult to read or totally not benefiting either in readability or productivity.

python visualize google trends data with word cloud

Python – Visualize Google Trends Data in Word Cloud

Christmas is just around the corner, the snowfall, beautiful festive lights and joyful songs from the last year still floating in your mind. But this year, things are getting unusual due to the Covid-19. A lot of celebration events are cancelled or suspended and people are advised to avoid gathering and stay at home as much as possible. Although staying at home became new norm, there is still a way that we can get to know what people are thinking about during this festive season since nowadays most of us search a lot from Google every day. With a few lines of Python code, we will be able to extract and visualize the data from Google Trends.

Let’s dive into the code examples.

Python to get Google trends data

To get the search trends from Google, we will need to use a Python package – pytrends. It’s not an official API for Google trends but It provides a convenience way to automatically download Google trends data same as what we can do manually from Google Trends website.

You can use the pip command to install the package:

pip install --upgrade pytrends

And import the necessary modules at the beginning of our code:

from pytrends.request import TrendReq

To use it, we can initiate the request object by providing the language for searching as well as the time zone information. For instance, I am specifying English as the language and time zone offset as -480 which is UCT + 8 in the below. The default value for this time zone offset is 360 (CST), so you can roughly see how this offset is calculated based on the UCT time zone.

pytrend = TrendReq(hl='en-US', tz=-480)

To get the search trends for a particular keyword, we shall specify it in a keyword list. For example, we use “christmas” to see what people have searched in Google related to this keyword. There are a few more parameters you need to specify in the build_payload function in order to narrow down the results:

cat – The category you are interested in, you can see the full list here.

timeframe – The date range when the search happened. You can specify the range as past X hours/days/months/years (the list of available options you can see from Google Trends web page) or even a specific start date and end date. For our case, we use “now 7-d” for the past 7 days.

geo –  The geolocation which can be two characters country code or leave it empty to see the results from globally

gprop – The source which you can leave it as empty for web search, other options can be images, news, youtube, or froogle

Let’s build up our query as per below:

kw_list = ["christmas"]
pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='SG', gprop='')

With all these criteria, we can check what are the related topics people searched in Google from Singapore. The related_queries function will give you a dictionary of both top & rising queries related to the keywords:

trends = pytrend.related_queries()

If you examine the trends variable, you shall see something similar to below:

python visualize google trends data with word cloud

 

The dictionary consists of results for both “top” and “rising” results in pandas dataframe objects, and you can access the top queries as per below:

df_sg = trends["christmas"]["top"]

Examine the first a few records in df_sg, you can see that people in Singapore are still in celebration mood as most of records are related to greetings, light shows or gifts etc.

python visualize google trends data with word cloud

On the other hand, let’s also take a look at the search trends for UK since It has just announced some new restrictions on travelling recently.

pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='GB', gprop='')
trends = pytrend.related_queries()
df_gb = trends["christmas"]["top"]

Examining the df_gb variable, you can see some people started worrying about the new rules and restrictions for this Christmas although majority of the searching results are still around of the festival celebration.

python visualize google trends data with word cloud

 

Visualize the results in word cloud

Since we have all the keywords and popularity that people used for search, the most straightforward to visualize them would be using word cloud to generate a picture. To do so, we will need use another python package – wordcloud which is a pure Python library for generating word cloud image. And you also need to use some supporting packages like PIL and numpy for manipulating the images.

You can use pip command to install these packages if you do not have them yet:

pip install --upgrade wordcloud
pip install Pillow==2.2.2
pip install --upgrade numpy

Let’s import all the necessary modules into our code:

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from PIL import Image
import os
import numpy as np

From previous section, we have already got the search keywords in dataframe. wordcloud supports both text string and words frequencies, for simplicity, let’s convert only keywords into a space separated string and forget about the value (popularity).

text = ' '.join(df_sg["query"].to_list())

And as all the keywords contain “christmas”, we shall filter out this word before generating the word cloud. In wordcloud package, it has a list of predefined words to be excluded,  and you can append more words to be excluded as per the below:

stopwords = set(STOPWORDS) 
stopwords.add("christmas")

Now let’s use this featured image as our background for generating word cloud. We shall load it as a 3-demensional array as the background mask for later use:

bg_mask = np.array(Image.open(os.path.join(os.getcwd(), "christmas tree.jpg")))

With all these ready, we can initiate a word cloud object with below parameters. The name of the parameters are quite self-explanatory, so I will not go through them one by one. You can check the official document from here.

wc = WordCloud(
    width = 600, 
    height = 1000,    
    background_color = 'white',
    colormap = 'rainbow',
    mask = bg_mask,
    stopwords = stopwords,
    max_words = 1000,
    max_font_size = 150,
    min_font_size = 15,
    contour_width = 2, 
    contour_color = 'dodgerblue'
)

Then we can supply our words to the generate_from_text function which will process the text and generate the image. Next we can save the output into an image file as per below code:

wc.generate_from_text(text)
wc.to_file("SG_christmas_cloud.jpg")

When opening the output image file, you shall see something like the below. Isn’t that cool?

python visualize google trends data with word cloud

Similarly, when you pass the UK searching result and generate the word cloud, you would see “covid” and “rules” are most concerned by UK people.

python visualize google trends data with word cloud

Note: since we are passing through a text string, the frequency is based on how many times the words repeated rather than the popularity from Google.

Conclusion

In this article, we have discussed how to use pytrends to automatically get the Google search data for any particular keyword and then use wordcloud to visualize the information. It only covers some basic usage of these two packages, you may check further on their documents to understand what else are provided in these packages. One thing to take note is that pytrends is using some scrapping techniques to get the data from Google Trends, it may break when there is any structural change in the way that Google makes the requests or sends the response. So frequent code upgrade is required by the project team. By the way, they are looking for maintainers, just in case you are interested.

 

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.

 

gspread read and write google sheet

Read and write Google Sheet with 5 lines of Python code

Introduction

Google Sheet is a very powerful tool in terms of collaboration, it allows multiple users to work on the same rows of data simultaneously. It also provides fine-grained APIs in various programming languages for your application to connect and interact with Google Sheet. Sometimes when you just need some simple operations like reading/writing data from a sheet, you may wonder if there is any higher level APIs that can complete these simple tasks easily. The short answer is yes. In this article, we will be discussing how can we read/write Google Sheet in 5 lines of Python code.

Prerequisites

As the prerequisite, you will need to have a Google service account in order for you to go through the Google cloud service authentication for your API calls. You can follow the guide from here for a free account setup. Once you have completed all the steps, you shall have a JSON file similar to below which contains your private key for accessing the Google cloud services. You may rename it to “client_secret.json” for our later use.

{
  "type": "service_account",
  "project_id": "new_project",
  "private_key_id": "xxxxxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\xxxxx\n-----END PRIVATE KEY-----\n",
  "client_email": "xxx@developer.gserviceaccount.com",
  "client_id": "xxx",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40developer.gserviceaccount.com"
}

From this JSON file, you can also find the email address for your newly created service account, if you need to access your existing Google Sheet files, you will need to grant access of your files to this email address.

Note: There is a limit of 100 requests per every 100 seconds for the free Google service account, you may need to upgrade your account to the paid account if this free quota is not sufficient for your business.

In addition to the service account, we need another two libraries google-auth and gspread which are the core modules to be used for authentication and manipulating the Google Sheet file.

Below is the pip command to install the two libraries:

pip install gspread
pip install google-auth

Lastly, let’s create a Google Sheet file namely “spreadsheet1” with some sample data from US 2020 election result:

gspread write and read google sheet

Once you have all above ready, let’s dive into our code examples.

Read Google Sheet data into pandas

Let’s first import the necessary libraries at the top of our script:

import gspread
from google.oauth2.service_account import Credentials
import pandas as pd

To get the access to Google Sheet, we will need to define the scope (API endpoint). For our case, we specify the scope to read and write the Google Sheet file.  If you would like to restrict your program from updating any data, you can specify spreadsheets.readonly and drive.readonly in the scope.

scope = ['https://www.googleapis.com/auth/spreadsheets',
        'https://www.googleapis.com/auth/drive']

And then we can build a Credentials object with our JSON file and the above defined scope:

creds = Credentials.from_service_account_file("client_secret.json", scopes=scope)

Next, we call the authorize function from gspread library to pass in our credentials:

client = gspread.authorize(creds)

With this one line of code, it will be going through the authentication under the hood. Once authentication passed, it establishes the connection between your application and the Google cloud service. From there, you can send request to open your spreadsheet file by specifying the file name:

google_sh = client.open("spreadsheet1")

Besides opening file by name, you can also use open_by_key with the sheet ID or open_by_url with the URL of the sheet.

If the proper access has been given to your service account, you would be able to gain the control to your Google Sheet, and you can continue to request to open a particular spreadsheet tab.

For instance, below returns the first sheet of the file:

sheet1 = google_sh.get_worksheet(0)

With the above, you can simply read all records into a dictionary with get_all_records function, and pass into a pandas DataFrame:

df = pd.DataFrame(data=sheet1.get_all_records())

Now if you examine the df object, you shall see the below output:

gspread write and read google sheet

So that’s it! With a few lines of code, you’ve successfully downloaded your data from Google Sheet into pandas, and now you can do whatever you need in pandas.

If you have duplicate column names in your Google Sheet, you may consider to use get_all_values function to get all the values into a list, so that duplicate column remains:

df = pd.DataFrame(data=sheet1.get_all_values())

All the column and row labels will default to RangeIndex as per below:

gspread write and read google sheet

Now let’s take a further look at what else we can achieve with the this library.

Add/Delete work sheets

With gspread, you can easily add new sheets or duplicate from the existing sheets. Below is an example to create a new sheet namely “Sheet2” with max number of rows and columns specified. The index parameter tells Google Sheet where you want to insert your new sheet. index=0 indicates the new sheet to be inserted as the first sheet.

sheet2 = google_sh.add_worksheet(title="Sheet2", rows="10", cols="10", index=0)

Duplicating an existing sheet can be done by specifying the source sheet ID and the new sheet name:

google_sh.duplicate_sheet(source_sheet_id=google_sh.worksheet("Sheet1").id, 
    new_sheet_name="Votes Copy")

Similarly, you can delete an existing sheet by passing in the worksheet object as per below:

google_sh.del_worksheet(sheet2)

If you would like to re-order your worksheets, you can do it with reorder_worksheets function. Assuming you want the sheet2 to be shown before sheet1:

google_sh.reorder_worksheets([sheet2, sheet1])

Read/Write Google Sheet cells

The worksheet object has the row_count and col_count properties which indicate the max rows and columns in the sheet file. But it’s not that useful when you want to know how many rows and columns of actual data you have:

print(sheet1.row_count, sheet1.col_count)
#1000, 26

To have a quick view of the number of rows and columns of your data, you can use:

#Assuming the first row and first column have the full data
print("no. of columns:", len(sheet1.row_values(1)))
#no. of columns: 3
print("no. of rows:", len(sheet1.col_values(1)))
#no. of rows: 8

To access the individual cells, you can either specify the row and column indexes, or use the A1 notation. For instance:

#Access the row 1 and column 2
sheet1.cell(1, 2).value
# or using A1 notation
sheet1.acell('B1').value

Note: the row/column index and A1 notation are all one-based numbers which is similar to the MS excel

Similarly, you can update the value for each individual cell as per below:

sheet1.update_cell(1, 2, "BIDEN VOTES")
#or
sheet1.update_acell("B1", "BIDEN VOTES")

To update multiple cells, you shall use the worksheet update function with the list of cells and values to be updated. For instance, below code will replace the values in row 8:

sheet1.update("A8:C8", [["Texas", 5261485, 5261485]])

Or use batch_update to update multiple ranges:

sheet1.batch_update([{"range": "A8:C8", 
                    "values" : [["Texas", 5261485, 5261485]]},
                     {"range": "A9:C9", 
                    "values" : [["Wisconsin", 1630673, 1610065]]},
                    ])

or use append_rows to insert a row at the last row:

sheet1.append_rows(values=[["Pennsylvania", 3458312, 3376499]])

Besides updating cell values, you can also update the cell format such as the font, background etc. For instance, the below will update the background of the 6th row to red color:

sheet1.format("A6:C6", 
              {"backgroundColor": {
                  "red": 1.0,
                  "green": 0.0,
                  "blue": 0.0,
                  "alpha": 1.0
              }
    })

Note that Google is using RGBA color model, so the color values must be numbers between 0-1.

Below is how it looks like in Google Sheet:

gspread write and read google sheet , format google sheet

Sometimes, it might be difficult to locate the exact index of the cell to be updated. You can find the cell by it’s text with the find function. It will return the first item from the matches.

cell = sheet1.find("Michigan")
print(cell.row, cell.col, cell.value)
#6, 1, 'Michigan'

You can also use Python regular express to find all matches. For instance, to find all cells with text ending as “da”:

import re
query = re.compile(".*da")
cells = sheet1.findall(query)
print(cells)
#[<Cell R4C1 'Florida'>, <Cell R7C1 'Nevada'>]

Add/Remove permission for your Google Sheet

Adding or removing permission for a particular Google Sheet file can be also super easy with gspread. Before adding/removing permission, you shall check who are the users currently have access to your file. You can use list_permission function to retrieve the list of users with their roles:

google_sh.list_permissions()

To give access of your file to other users, you can use:

#perm_type can be : user, group or domain
#role can be : owner, writer or reader
google_sh.share('someone@gmail.com', perm_type='user', role='reader')

When you check your file again, you shall see the email address you shared is added into the list of authorized users.

To revert back the access for a particular user, you can use remove_permissions function. By default, it removes all the access that has been granted to the user:

google_sh.remove_permissions('someone@gmail.com', role="writer")

When the role you’ve specifying does not match with the roles the user currently has, the function returns without doing anything.

Conclusion

Google Sheet API provides comprehensive interfaces for manipulating the sheets from the normal operations like reading/writing of the data, to validation, formatting, building pivot tables and charts etc. Often you find that you may just need some simple APIs to read and write Google Sheet files.

In this article, we have reviewed though the gspread package which provides the high level APIs for working with Google Sheets to serve for this purpose. With gspread, you are able to open existing sheet or create new Google Sheet file, read/write the data as well as do simply formatting. In fact, there are a few other libraries such as gspread-formatting and gspread-pandas which offer extensive functionalities for sheet formatting and interacting sheets with pandas dataframe, you may take a look in case you need something more complicated than what we have covered here.

python suppress stdout and stderr Photo by Yeshi Kangrang on Unsplash

Python recipes- suppress stdout and stderr messages

Introduction

If you have worked on some projects that requires API calls to the external parties or uses 3rd party libraries, you may sometimes run into the problem that you are able to get the correct return results but it also comes back with a lot of noises in the stdout and stderr. For instance, the developer may leave a lot of “for your info” messages in the standard output or some warning or error messages due to the version differences in some of the dependency libraries.

All these messages would flood your console and you have no control on the source code, hence you cannot change its behavior. To reduce these noises, one option is to suppress stdout and stderr messages during making the function call. In this article, we will discuss about some recipes to suppress the messages for such scenarios.

Unexpected messages from stdout and stderr

To further illustrate the issue, let’s take a look at the below example. Assuming we have below check_result function in a python file externallib.py, and this represents an external library.

import sys

def check_result():
    print("stdout message from externallib")
    print("stderr message from externallib", file=sys.stderr)
    return True

If you import the module and call the check_result function, you would be definitely getting the result as True, but you would see both the stdout and stderr messages from your console as well.

import externallib

result = externallib.check_result()

Both stdout and stderr messages were printed out in the console:

Python suppress stdout and stderr

suppress stdout and stderr with context manager

To stop these messages from printing out, we need to suppress stdout and stderr in the way that it redirects the output into a devnull file (similiar to /dev/null in Linux which is typically used for disposing of unwanted output streams) right before calling the function, and then redirect the outputs back after the call completed.

To do that, the best approach is to use a context manager, so that it is automatically directed/redirected upon the entry and exit of the context manager.

So let’s implement a context manager to perform the below:

  • Use suppress_stdout and suppress_stderr flags to indicate which stream to be suppressed
  • Save the state of the sys.stdout and sys.stderr in the __enter__ function, and redirect them to devnull based on the suppress_stdout and suppress_stderr flags
  • Restore back the state for sys.stdout and sys.stderr in __exit__

Below is the code snippet:

import os, sys

class suppress_output:
    def __init__(self, suppress_stdout=False, suppress_stderr=False):
        self.suppress_stdout = suppress_stdout
        self.suppress_stderr = suppress_stderr
        self._stdout = None
        self._stderr = None

    def __enter__(self):
        devnull = open(os.devnull, "w")
        if self.suppress_stdout:
            self._stdout = sys.stdout
            sys.stdout = devnull

        if self.suppress_stderr:
            self._stderr = sys.stderr
            sys.stderr = devnull

    def __exit__(self, *args):
        if self.suppress_stdout:
            sys.stdout = self._stdout
        if self.suppress_stderr:
            sys.stderr = self._stderr

And if you call the check_result again within this context manager as per below:

with suppress_output(suppress_stdout=True, suppress_stderr=True):
    result = externallib.check_result()
print(result)

You would not see any messages printed out from check_result function, and the return result would remain as True. This is exactly what we are expecting!

Since we are using context manager, you may wonder to use contextlib to make our code more concise. So let’s make use of the contextlib package, and re-implement the above context manager using decorator as per below:

from contextlib import contextmanager

@contextmanager
def nullify_output(suppress_stdout=True, suppress_stderr=True):
    stdout = sys.stdout
    stderr = sys.stderr
    devnull = open(os.devnull, "w")
    try:
        if suppress_stdout:
            sys.stdout = devnull
        if suppress_stderr:
            sys.stderr = devnull
        yield
    finally:
        if suppress_stdout:
            sys.stdout = stdout
        if suppress_stderr:
            sys.stderr = stderr

With the above decorator implementation, you shall be able to get the same result when you call the function:

with nullify_output(suppress_stdout=True, suppress_stderr=True):
    result = externallib.check_result()
print(result)

Everything seems to be good as of now, are we going to conclude here? Wait, there is something else we can still improve – instead of totally discard the messages, can we collect them into logging file?

Suppress stdout and stderr with redirect_stdout and redirect_stderr

If you scroll down the Python contextlib documentation further, you will notice there are two methods related to stdout and stderr: redirect_stdout and redirect_stderr . They are quite self-explanatory by their names, and also accept a file-like object as the redirect target.

With these two functions, we shall be able to make our code even more concise, meanwhile we can easily collect back the output message into our log file.

from contextlib import redirect_stdout, redirect_stderr
import io, logging
logging.basicConfig(filename='error.log', level=logging.DEBUG)

f = io.StringIO()
with redirect_stdout(f), redirect_stderr(f):
    result = externallib.check_result()
logging.info(f.getvalue())
print(result)

If you check the log file, you shall see the stdout and stderr messages were collected correctly.

suppress stdout and stderr with redirect_stdout or redirect_stderr

Of course, if you wish to continue disposing these messages, you can still specify the target file as devnull, so that nothing will be collected.

Conclusion

With all the above examples and explanations, hopefully you are able to use the code snippets and customize it to meet the objective in your own project. Directly disposing the stderr sometimes may not be a good idea in case there are some useful information for your later troubleshooting, so I would recommend to collect it into a log file as much as possible and do proper housekeeping to ensure the logs are not growing too fast.

If you are looking for solution to suppress certain known python exceptions, you may check out the suppress function from contextlib package.

python datetime

Python datetime – the 9 tips you shall know

Introduction

Dealing with date and time are quite common whenever you are writing Python scripts, for instance, the simplest use cases would be logging some events with a timestamp, or saving a file with date and timing info in the file name. It can be challenging when you have more complicated scenarios such as handling time zone, daylight saving and recurrences etc. The built-in Python datetime module is capable of handling most of the date and time operations, and there are third party libraries can help you to easily manage the time zone and daylight saving challenges. In this article, we will be discussion some tips for using the Python datetime module as well as the third party package dateutil.

Prerequisite

If you do not have dateutil installed yet, you shall install the latest version to your working environment. Below is the pip command to install the package:

pip install python-dateutil

Let’s get started!

Various ways to get current date and time

The top one use cases that you need a Python datetime object is to get the current date or time. There are plenty of ways to get current date and time from Python datetime module, for instance:

>>>from datetime import datetime
>>>import time

#Local timezone
>>>datetime.now()
datetime.datetime(2020, 10, 24, 21, 31, 11, 761666)
>>>datetime.today()
datetime.datetime(2020, 10, 24, 21, 31, 12, 139719)

>>>datetime.fromtimestamp(time.time())
datetime.datetime(2020, 10, 24, 21, 31, 12, 559183)

#Not suggested
>>>datetime.fromtimestamp(time.mktime(time.localtime()))
datetime.datetime(2020, 10, 24, 21, 33, 5)

#UTC timezone
>>>datetime.now(timezone.utc)
datetime.datetime(2020, 10, 24, 13, 31, 13, 443442, tzinfo=datetime.timezone.utc)
>>>datetime.utcnow()
datetime.datetime(2020, 10, 24, 13, 31, 14, 240517)

Most of the above methods will return a date object in local machine time, and the last two methods will get the date and time in UTC time zone.

If you only need the date info, you can discard the time portion by using the date() method as per below:

>>>datetime.now().date()
datetime.date(2020, 10, 24)

Get year, month, day and time from Python datetime

From the datetime object, you can easily get each individual components such as year, month, day, hour etc. Below examples show you how to extract the date and time components from the datetime object, as well as the weekday or week number information:

>>>TODAY = datetime.today()
>>>TODAY.year, TODAY.month, TODAY.day, TODAY.hour, TODAY.minute, TODAY.second, TODAY.microsecond
(2020, 10, 24, 21, 36, 35, 842689)

#Monday is 0 and Sunday is 6
>>>TODAY.weekday()
5
#Monday is 1 and Sunday is 7
>>>TODAY.isoweekday()
6
#Return year, weekno, and weekday
>>>TODAY.isocalendar()
(2020, 43, 6)

Take note on the start day when you get the weekday in numbers, weekday() returns 0 for Monday, while isoweekday() returns 1 for Monday. There are some programming languages use 0 for Sunday, in this case you can use the %w format code to get the weekday number where it starts from 0 as Sunday.

Date plus or minus X days

Very often you will need to do some arithmetic calculation on the dates such as calculating number of days backward or forward from the current date. To do that, you will need to use the timedelta class. Below is the syntax to create a timedelta object, you can specify number of weeks, days, hours, minutes etc. for initialization:

>>>timedelta(days=1, seconds=50, microseconds=1000, milliseconds=1000, minutes=10, hours=6, weeks=1)
datetime.timedelta(days=8, seconds=22251, microseconds=1000)

All the arguments passed to timedelta will be eventually converted into days, seconds and microseconds.

So to calculate today plus 1 day, you can specify the timedelta with 1 day and add it up to the current date:

>>>tomorrow = datetime.today().date() + timedelta(days=1)
datetime.date(2020, 10, 25)

Similarly, calculating the date backwards can be achieved by specifying the arguments as negative numbers:

>>>yesterday = datetime.today().date() + timedelta(days=-1) 
datetime.date(2020, 10, 23)

When calculating the difference between two dates, it will also return a timedelta object:

>>>tomorrow - yesterday
datetime.timedelta(days=2)

Get the first day of the month

With the replace() method, you can replace the year, month or day of the current date and return a new date. The most commonly used scenario would be getting the first day of the month based on current date, e.g.:

>>>datetime.today().date().replace(day=1)
datetime.date(2020, 10, 1)

Format date with strftime and strptime

There are many scenarios that you need to convert a date from string or format a date object into a string. You can easily convert a date into string format with the strftime method, for instance:

>>>datetime.now().strftime("%Y-%b-%d %H:%M:%S")
'2020-Oct-25 20:35:54

And similarly, from string you can use strptime to convert a string object into a date object:

>>>datetime.strptime("Oct 25 2020 08:10:00", "%b %d %Y %H:%M:%S")
datetime.datetime(2020, 10, 25, 8, 10)

You can check here for the full list of the format code supported by strftime and strptime. And do take note that strptime can be much slower than you expected if you are using it in a loop for a large set of data. For such case, you may consider to directly use datatime.datetime(year, month, day) to form the datetime object.

Create time zone aware date

Most of the methods in Python datetime module return time zone naive objects (which means it does not include any timezone info), in case you need some time zone aware objects, you can specify the time zone info when initializing a date/time object, for instance:

>>>singapore_tz = timezone(timedelta(hours=8), name="SGT")
>>>sg_time_now = datetime.now(tz=singapore_tz)
datetime.datetime(2020, 10, 24, 22, 31, 6, 554991, tzinfo=datetime.timezone(datetime.timedelta(seconds=28800), 'SGT'))

If you use 3rd party libraries like pytz or dateutil, you can easily get the time zone info by supplying IANA time zone database name or Windows time zone names. Below is the example for dateutil:

>>>import dateutil

#time zone database name from IANA
>>>sh_tz = dateutil.tz.gettz('Asia/Shanghai')
>>>datetime(2020, 10, 24, 22, tzinfo = sh_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzfile('PRC'))

#windows time zone names
>>>cn_tz = dateutil.tz.gettz('China Standard Time')
>>>datetime(2020, 10, 24, 22, tzinfo = cn_tz)
datetime.datetime(2020, 10, 24, 22, 0, tzinfo=tzwin('China Standard Time'))

With the time zone database, you do not need to worry about the offset hours, and you only need to provide the name to get the correct date and time in the respective time zone.

Get a date by relative period

If you use timedelta to get a date from the current date plus a relative period such as 1 year or 1 month, you may sometimes run into problems during the leap years. For instance, the below returns Apr 30 as year 2020 is leap year, and the number of days shall be 366 rather than 365.

>>>datetime(2019, 5, 1) + timedelta(days=365)
datetime.datetime(2020, 4, 30, 0, 0)

The simply way to get the correct result is to use the relativedelta from the dateutil package, e.g.:

>>>from dateutil.relativedelta import relativedelta

>>>datetime(2019, 5, 1) + relativedelta(years=1)
datetime.datetime(2020, 5, 1, 0, 0)

You can also specify the other arguments such as the months, days and hours:

>>>datetime.today() + relativedelta(years=1, months=1, days=10, hours=10)
datetime.datetime(2021, 12, 5, 8, 49, 31, 386813)

To get the date of the next Sat from the current date, you can use :

>>>datetime.today() + relativedelta(weekday=calendar.SATURDAY)
datetime.datetime(2020, 10, 24, 15, 16, 10, 502191)

Take note that if you are running it on Saturday before 23:59:59, it will just return the current date, so it is actually returning the nearest Saturday from your current date.

List out all the weekdays

In case you need to get all the weekdays starting from a particular date, you can make use of the recurrence rules from dateutil package.

For instance, the below rrule specifies to recur on daily basis for Mon to Fri with a start and end date:

>>>from dateutil.rrule import rrule, DAILY, MO, TU, WE, TH, FR
>>>from dateutil.parser import parse

>>>list(rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR], dtstart=datetime.now().date(), until=datetime(2020, 11, 2)))

[datetime.datetime(2020, 10, 26, 0, 0),
 datetime.datetime(2020, 10, 27, 0, 0),
 datetime.datetime(2020, 10, 28, 0, 0),
 datetime.datetime(2020, 10, 29, 0, 0),
 datetime.datetime(2020, 10, 30, 0, 0),
 datetime.datetime(2020, 11, 2, 0, 0)]

The frequency and interval arguments determine the frequency of the recurrence, and the byweekday and dtstart further constrain which are the dates to be selected.

Besides the weekday argument, you can also specify by year, month, hour, minute etc. You can check here for all the available arguments supported for instantiating the rrule object.

Another example, the below code returns a list of dates recurring on 9:15am every another day:

>>>list(rrule(DAILY, interval=2, byminute=15, count=4, dtstart=parse("20201024T090000")))
[datetime.datetime(2020, 10, 24, 9, 15),
 datetime.datetime(2020, 10, 26, 9, 15),
 datetime.datetime(2020, 10, 28, 9, 15),
 datetime.datetime(2020, 10, 30, 9, 15)]

Get a list of business days

Sometimes you would need to exclude the public holidays to get only the business days. To do so, you may first get a list of holidays from another 3rd party libraries like holidays or simply put all holidays into some config file, and then exclude these dates from rrule. For instance:

holidays = [
    datetime(2020, 7, 10,),
    datetime(2020, 7, 31,),
    datetime(2020, 8, 10,)
]
r = rrule(DAILY, interval=1, byweekday=[MO, TU, WE, TH, FR],
   dtstart=datetime(2020, 7, 10), until=datetime(2020, 8, 1))

rs = rrule.rruleset()
rs.rrule(r)

for d in holidays:
    rs.exdate(d)

print(list(rs))

You can see the public holidays have been excluded from the return results:

python datetime - dateutil output

Conclusion

Working with date sometimes can be tough especially when you need to manipulate the dates in different time zones or considering the daylight savings. Luckily with Python datetime and other 3rd party libraries like dateutil, things are getting easier. But you will still need to be very careful when handling dates with time zone or setting up recurrence rules in local time.

Thanks for reading, and you can find other Python related topics from here.

argparse pass argument to python script

10 tips for passing arguments to Python script

When writing some Python utility scripts, you may wish to make your script as flexible as possible so that people can use it in different ways. One approach is to parameterize some of the variables originally hard coded in the script, and pass them as arguments to your script. If you have only 1 or 2 arguments, you may find sys.argv is good enough since you can easily access the arguments by the index from argv list. The limitation is also obvious, as you will find it’s difficult to manage when there are more arguments, and some are mandatory and some are optional, also you cannot specify the acceptable data type and add proper description for each argument etc.

In this article, we will be discussing some tips for the argparse package, which provides easier way to manage your input arguments.

To get started, you shall import this package into your script, and try to run with some sample code like below:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='foo help')
args = parser.parse_args()
print(args)

Customize your prefix_chars

Most of time you would see people use “-” before the argument name, you can change this default behavior to support more prefix characters, such as + or \ etc. To do that, you can specify them in prefix_chars when initializing the argument parser, for instance:

parser = argparse.ArgumentParser(prefix_chars='-+/', description="This is to demonstrate multiple prefix characters")
parser.add_argument("+a", "++add")
parser.add_argument("-s", "--sub")
parser.add_argument("/d", "//dir")
args = parser.parse_args()
print(args)

When you save above as argumentparser.py file and call it with below input arguments, you shall see all the arguments are parsed correctly as per expected:

>>python argumentparser.py +a 1 -s 2 /d python
Namespace(add='1', dir='python', sub='2')

Do take note that, if your argument name contains the prefix character “-“, you may see “-” character being replaced to “_”. For example, your argument name read-only would be replaced to read_only, and you shall use args.read_only to reference the value.

Argument data type

When you are adding new arguments, the default data type is always string, so whatever values followed behind the argument name will be converted into a string. Argument parser supports all immutable data types, so you can specify a data type and let argument parser to validate if correct data type has been passed in. E.g.:

parser.add_argument("-c", "--count", type=int)

You shall see the below validation error if incorrect data type has been passed in:

>>python argumentparser.py -c 1.5
usage: argumentparser.py [-h] [-c COUNT]
argumentparser.py: error: argument -c/--count: invalid int value: '1.5'

Various argument actions

The action keyword in add_argument allows you to specify how you want to handle the arguments when they are passed into the script. Some of the commonly used actions are:

  • store – default behavior
  • store_const – work together with const keyword to store it’s value
  • store_true or store_false – set the argument value to True or False
  • append – allows the same argument to appear multiple times and store the argument values into a list
  • append_const – same as append, but it will store the value from const keyword
  • count – count how many times the argument appears

Below are some examples:

parser.add_argument('-a', '--auto', action="store_true", help="to run automatically")
parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert Celsius to Kelvin temperature")

parser.add_argument("-t", "--temperature",
                        type=float,
                        action="append",
                        default=[],
                        help="Celsius temperature to be used in %(prog)s")

parser.add_argument('--age', dest='criteria', action='append_const', const=18)
parser.add_argument('--gender',dest='criteria', action='append_const', const="male")
parser.add_argument("-c", "--count", action="count")

When you run in the command line, you shall see all these arguments are parsed correctly and stored into the respective variables:

>>python argumentparser.py -k -t 35.1 -t 37.5 --age --gender -cc -a
Namespace(auto=True, count=2, criteria=[18, 'male'], kelvin=273.15, temperature=[35.1, 37.5])

In Python version 3.8 and later, you can also extend your own class from argparse.Action and pass it to the action.

Use action=”append” or narg=”+” ?

If you want to collect a list of values from a particular input argument, you have two options:

  • specify action = “append”
  • specify the nargs=”+”

For the below code, both “amount” and “nums” will be able to store a list of values from the input:

parser.add_argument("-a", "--amount",
                        type=float,
                        action="append")
parser.add_argument("-n", "--nums", nargs="+")

The only difference is that, for “append” action, you will need to repeat the argument name whenever you need to add extra values. While for “nargs”, you just need to put all the space separated values after the argument name. E.g.:

>>python argumentparser.py -a 1 -a 2 -n 3 4
Namespace(amount=[1.0, 2.0], nums=['3', '4'])

You may notice that if have any argument with nargs=”+”, it’s better always put it after all the positional arguments, as the argument parser would take your positional argument as part of the previous argument. (see the example in the next tips)

Mixing of positional and optional arguments

When there is no prefix characters used in the argument name, the argument parser will treat it as a positional argument. For instance:

parser.add_argument("caller", help="The process that invoke this script")
parser.add_argument("-c", "--count")

When you check the help for this script, you shall see caller is taken as positional argument.

>>python argumentparser.py -h
usage: argumentparser.py [-h] [-c COUNT] caller

positional arguments:
  caller                The process that invoke this script

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT

Positional arguments are considered as mandatory, so Python will throw error if they are not specified when calling the script. You can put positional argument at any place of your input argument stream. E.g.:

>>python argumentparser.py -c 2 "cmd.exe"
>>python argumentparser.py "cmd.exe" -c 2
Namespace(caller='cmd.exe', count=2)

Python is smart enough to interpret and assign the values to the correct variables unless there is some confusion when trying to interpret your input arguments, e.g.: If you use nargs to indicate multiple argument values can be passed in:

parser.add_argument("-c", "--count", nargs='+')

And putting your positional argument behind this argument will cause error, because all the values behind “-c” will be taken as the values for “count”

>>python argumentparser.py -c 1 3 "cmd.exe"
usage: argumentparser.py [-h] [-c COUNT [COUNT ...]] caller
argumentparser.py: error: the following arguments are required: caller

Difference between const vs default

const keyword usually works together with action option – store_const or append_const to store the value from the const keyword when the argument appears. If the argument is not supplied, the argument variable will be set as None. Consider the below two arguments:

parser.add_argument("-k", "--kelvin",
                        action="store_const",
                        const=273.15,
                        help="The constant to convert celsius to Kelvin temperature")
parser.add_argument("-c", "--count", default=0)

If you run with below input arguments, you shall the similar result as below:

>>python argumentparser.py -k
Namespace(count=0, kelvin=273.15)
>>python argumentparser.py -c 1
Namespace(count='1', kelvin=None)
>>python argumentparser.py -k 270
usage: argumentparser.py [-h] [-k] [-c COUNT]
argumentparser.py: error: unrecognized arguments: 270

So with const keyword, you basically cannot specify any other values. but still you can add a default value, so that when the argument is supplied, the default value will be set as the default value rather than None.

Mandatory optional argument?

If you would like your optional argument to be mandatory (although it sounds a bit weird), you can specify the required option as True in the add_argument method, e.g.:

parser.add_argument("--data-type", required=True)

With required as True, even you have specified the default option, python will still prompt error saying the argument –data-type is required.

Ignore case in choice option

Image you are implementing some automation scripts to be triggered in various mode, and you would like to limit the options to be accepted for this mode argument, you can specify a list of values in the choices keyword when adding the argument:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'])

But you may realize one problem as when you specify “auto” or “Auto”, you would see below error message:

>>python argumentparser.py -m "auto"
usage: argumentparser.py [-h] [-m {AUTO,SCHEDULER,SEMI-AUTO}]
argumentparser.py: error: argument -m/--mode: invalid choice: 'auto' (choose from 'AUTO', 'SCHEDULER', 'SEMI-AUTO')

By default, the argument parser will compare the values in case sensitive manner. To ignore the cases, you can specify a type keyword and transform the input values into upper or lower case:

parser.add_argument('-m','--mode', choices=['AUTO','SCHEDULER','SEMI-AUTO'], type=str.upper)

Conflicting options

Sometimes defining some mutually exclusive arguments can be very useful as you do not wish the two or multiple options to be used at the same time. argparse package also provides a easy way to group these options with necessary validations on the input arguments. For instance, you can group the “auto” and “on-demand” mode into the mutually exclusive group, so that only one mode can be activated at one time:

mode_group = parser.add_mutually_exclusive_group()
mode_group.add_argument('-a', '--auto', action="store_true", help="to run automatically")
mode_group.add_argument('-d', '--on-demand', action="store_true", help="to run on demand")

If both arguments are supplied, you would see the below error message:

>>python argumentparser.py -d -a
usage: argumentparser.py [-h] [-a | -d]
argumentparser.py: error: argument -a/--auto: not allowed with argument -d/--on-demand

Conclusion

argpase package is super useful when you need to write some script to be executed from the command line. In this article, we have reviewed through some tips that might help you to extend your understanding on the different use cases for each individual options argparse provided. If you have more complicated use case, you may want to read further on the official documentation such as the sub-commands and file type etc.

pandas split one row of data into multiple rows

Pandas tricks – split one row of data into multiple rows

As a data scientist or analyst, you will need to spend a lot of time wrangling the data from various sources so that you can have a standard data structure for your further analysis. There are cases that you get the raw data in some sort of summary view and you would need to split one row of data into multiple rows based on certain conditions in order to do grouping and matching from different perspectives. In this article, we will be discussing a solution to solve this particular issue.

Prerequisites:

You will need to get pandas installed if you have not yet. Below is the pip command to install pandas:

pip install pandas

And let’s import the necessary modules and use this sample data for our demonstration, you can download it into your local folder, or just supply this URL link to pandas read_excel method:

import pandas as pd
import numpy as np

df = pd.read_excel("eShop-Delivery-Record.xlsx", sheet_name=0)

So if we do a quick view of the first 5 rows of the data with df.head(5), you would see the below output:

pandas split one row of data into multiple rows

Assume this is the data extracted from a eCommerce system where someone is running a online shop for footwear and apparel products, and the shop provides free 7 days return for the items that it is selling. You can see that each of the rows has the order information, when and who performed the delivery service, and if customer requested return, when the item was returned and by which courier service provider. The data is more from the shop owner’s view, and you may find some difficulty when you want to analyse from courier service providers’ perspective with the current data format. So probably we shall do some transformation to make the format simpler for analysis.

Split one row of data into multiple rows

Now let’s say we would like to split this one row of data into 2 rows if there is a return happening, so that each row has the order info as well as the courier service info and we can easily do some analysis such as calculating the return rate for each product, courier service cost for each month by courier companies, etc.

The output format we would like to have is more like a transaction based, so let’s try to format our date columns and rename the delivery related columns, so that it won’t confuse us later when splitting the data.

df["Delivery Date"] = pd.to_datetime(df["Delivery Date"]).dt.date
df["Return Date"] = pd.to_datetime(df["Return Date"]).dt.date

df.rename(columns={"Delivery Date" : "Transaction Date",
"Delivery Courier" : "Courier",
"Delivery Charges" : "Charges"}, inplace=True)

And we add one more column as transaction type to indicate whether the record is for delivery or return. For now, we just assign it as “DELIVERY” for all records:

df["Transaction Type"] = "DELIVERY"

The rows we need to split are the ones with return info, so let’s create a filter by checking if return date is empty:

flt_returned = ~df["Return Date"].isna()

If you verify the filter with df[flt_returned], you shall see all rows with return info are selected as per below:

pandas split one row of data into multiple rows

To split out the delivery and return info for these rows, we will need to perform the below steps:

  • Duplicate the current 1 row into 2 rows
  • Change the transaction type to “RETURN” for the second duplicated row
  • Copy values of the Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively

To duplicate these records, we use data frame index.repeat() to repeat these index twice, and then use loc function to get the data for these repeated indexes. Below is the code to create the duplicate records for the rows with return info:

d = df[flt_returned].loc[df[flt_returned].index.repeat(2),:].reset_index(drop=True)

Next, let’s save the duplicated row indexes into a variable, so that we can refer to it multiple times even when some data in the duplicated row changed. We use the data frame duplicated function to return the index of the duplicated rows. For this function, the keep=”first” argument will mark 1st row as non-duplicate and the subsequent rows as duplicate, while keep=”last” will mark the 1st row as duplicate.

idx_duplicate = d.duplicated(keep="first")
#the default value for keep argument is "first", so you can just use d.duplicated()

With this idx_duplicate variable, we can directly update the transaction type for these rows to RETURN:

d.loc[idx_duplicate,"Transaction Type"] = "RETURN"

And next, we shall copy the return info into Transaction Date, Courier, Charges fields for these return records. You can either base on the transaction type value to select rows, or continue to use the idx_duplicate to identify the return records.

Below will copy values from Return Date, Return Courier, Return Charges to Transaction Date, Courier, Charges respectively:

d.loc[idx_duplicate, ["Transaction Date", "Courier", "Charges"]] = d.loc[idx_duplicate, 
                                                     ["Return Date", "Return Courier","Return Charges"]].to_numpy()

If you check the data now, you shall see for the return rows, the return info is already copied over:

pandas split one row of data into multiple rows

(Note: you may want to check here to understand why to_numpy() is needed for swapping columns)

Finally, we need to combine the original rows which only has delivery info with the above processed data. Let’s also sort the values by order number and reset the index:

new_df = pd.concat([d, df[~flt_returned]]).sort_values("Order#").reset_index(drop=True)

Since the return related columns are redundant now, we shall drop these columns to avoid the confusion, so let’s use the below code to drop them by the “Return” keywords in the column labels:

new_df.drop(new_df.filter(regex="Return*", axis=1), axis=1, inplace=True)

(To understand how df.filter works, check my this article)

Once we deleted the redundant columns, you shall see the below final result in the new_df as per below:

pandas split one row of data into multiple rows

So we have successfully transformed our data from a shop owner’s view to courier companies’ view, each of the delivery and return records are now an individual row.

Conclusion

Data wrangling sometimes can be tough depends on what kind of source data you get. In this article, we have gone through a solution to split one row of data into multiple rows by using the pandas index.repeat to duplicate the rows and loc function to swapping the values. There are other possible ways to handle this, please do share your comments in case you have any better idea.

Photo by Luther Bottrill on Unsplash

Why your lambda function does not work

Introduction

Lambda function in Python is designed to be a one-liner and throwaway function even without the needs to assign a function name, so it is also known as anonymous function. Comparing to the normal Python functions, you do not need to write the def and return keywords for lambda function, and it can be defined just at the place where you need it, so it makes your code more concise and looks a bit special. In this article, we will be discussing some unexpected results you may have encountered when you are using lambda function.

Basis usage of lambda

Let’s cover some basis of lambda function before we dive into the problems we are going solve in this article.

Below is the syntax to define lambda function:

lambda [arguments] : expression

As you can see lambda function can be defined with or without arguments, and take note that it only accepts one line of expression, not any of the Python statements. Expressions can be also statements, the difference is that you are able to evaluate a expression into values (or objects), e.g.: 2**2, but you may not be able to evaluate a statement like while(True): into a value. You can think there is an implicit “return” keyword before the expression, so your expression must be eventually computed into a value.

And here are some basic usage of lambda function:

square = lambda x: x**2
print(square(4))
#Output: 16
cryptocurrencies = [('Bitcoin', 10948.52),('Ethereum', 381.41),('Tether', 1.00),
('XRP', 0.249940),
('Bitcoin Cash', 231.86),
('Polkadot', 4.91),
('Binance Coin', 27.02),
('Chainlink', 10.47),
('Litecoin', 48.20),
('EOS', 2.69),
('TRON', 0.027157),
('Neo', 24.29),
('Stellar', 0.077903),
('Huobi Token', 4.91)]

top5_by_name = sorted(cryptocurrencies, key=lambda token: token[0].lower())[0:5]
print(top5_by_name)
#Output: [('Binance Coin', 27.02), ('Bitcoin', 10948.52), ('Bitcoin Cash', 231.86), ('Chainlink', 10.47), ('EOS', 2.69)]

lowest = min(cryptocurrencies, key=lambda token: token[1])
print(lowest)
#Output: ('TRON', 0.027157)

highest = max(cryptocurrencies, key=lambda token: token[1])
print(highest)
#Output: ('Bitcoin', 10948.52)

highest_in_local_currency = lambda exchange_rate: highest[1] * exchange_rate
highest_sgd = highest_in_local_currency(1.38)
print(highest_sgd)
#Output: 15108.9576

You can see that it is quite convenient when you just need a very short function to be supplied to another function which accepts argument like key=keyfunc, such as sorted, list.sort, min, max, heapq.nlargest, heapq.nsmallest, itertool.groupby and so on. The common thing about these use cases is that you do not need very complicated logic (can be written in one line) in the keyfunc and probably you will not reuse it in anywhere else. So it is the ideal scenario to use a lambda function.

Now Let’s expand further on our previous example, assuming the bitcoin price fluctuated a lot on Mon & Tue although it still dominated the market, and you would like to convert the price in SGD in below way:

highest = ('Bitcoin', 10948.52)
mon_highest = lambda exchange_rate: highest[1] * exchange_rate

highest = ('Bitcoin', 10000)
tue_highest = lambda exchange_rate: highest[1] * exchange_rate

print("Mon:", mon_highest(1.36))
print("Tue:", tue_highest(1.36))

You want to assign different values in highest variable to calculate the price in another currency, but you would be surprised when checking the result:

python lambda variable binding

Instead of scratching your head to figure out why it does not work, and let’s try another approach. I am going to create a list of converter functions where I pass in the cryptocurrency pair to calculate the price based on the exchange rate supplied. Later I loop through these functions and print out the converted values:

converters = [lambda exchange_rate: crypto[1] * exchange_rate for crypto in cryptocurrencies]
for c in converters:
    print(c(1.36))

I am expecting to see all the prices are converted into local currency based on the exchange rate 1.36, but when running the above code, it gives below result:

python lambda variable binding

python lambda variable binding output

Same as the previous behaviour, only the last value was used in lambda function. so why it does not work as intended when I use the lambda in this way?

Runtime data binding

When people come into this issue, it is usually due to a fundamental misunderstanding of the variable binding for Python function. For Python function regardless of normal function or lambda function, the variables used in function are bound at runtime not in definition time. So for our first example, the lambda function only used the highest variable stored in locals() at the moment when it is executed.

With this concept cleared, you shall be able to understand the behavior of the output from above two examples, only the latest values at execution time were used in the lambda function.

To fix this issue, we just need a minor change to our original code to pass in the variable in the function definition as default value to the argument. For instance, below is the fix for the first example:

mon_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate
tue_highest = lambda exchange_rate, highest = highest: highest[1] * exchange_rate

Below is the fix for the second example:

converters = [lambda exchange_rate, crypto = crypto: crypto[1] * exchange_rate for crypto in cryptocurrencies]

You may wonder why must use lambda in above two examples, indeed they do not necessarily require a lambda function. For the first example, since you need to call the function more than once, you should just use normal function instead just to be more careful when you need any variable from outside of the function.

And for the second example, it can be simply replaced with a list comprehension as per below:

list(map(lambda crypto: crypto[1] * 1.36, cryptocurrencies))

Conclusion:

Lambda function provides convenience for writing tiny functions for the one-time use, and make your code concise. But it is also highly restricted due to the one line of expression, as you cannot use multiple statements, exception handling and conditions etc. Whatever lambda does, you can definitely use a normal function to replace. The only thing matters is about the readability, so you will need to evaluate whether it is the best scenario to use lambda, and bear in mind about the variable binding.