Resources

Best Tips for Python, Data Science and Automation

Resources

pandas convert columns to rows, convert wide to long, pandas melt

Pandas Tips - Convert Columns To Rows

  Introduction In one of my previous posts – Pandas tricks to split one row of data into multiple rows, we have discussed a solution to split the summary data from one row into multiple rows in order to standardize the data for further analysis. Similarly, there are many scenarios that we have the aggregated […]

Read More
Manipulate Audio File in Python, pydub, download youtube,cut video python

Manipulate Audio File in Python With 6 Powerful Tips

Introduction Dealing with audio files may not be that common to a Python enthusiast, but sometimes you may wonder if you are able to manipulate audio files in Python for your personal interest. For instance, if you really like some music, and you want to edit some parts of it and save into your phone, […]

Read More
Python generate QR code, Python read QR code, Photo by Lukas on Unsplash

Read and Generate QR Code With 5 Lines of Python Code

 Introduction QR Code is the most popular 2 dimensional barcodes that widely used for document management, track and trace in supply chain and logistics industry, mobile payment,  and even the “touchless” health declaration and contact tracing during the COVID-19 pandemic. Comparing to 1D barcode, QR code can be very small in size but hold more […]

Read More
20 Useful Tips for Using Python Pip

20 Tips for Using Python Pip

Introduction Python has become one of the most popular programming languages due to the easy to use syntax as well as the thousands of open-source libraries developed by the Python community. Almost every problem you want to solve, you can find a solution with these third-party libraries, so that you do not need to reinvent […]

Read More
reading email from outlook with python pywin32

5 Useful Tips for Reading Email From Outlook In Python

Introduction Pywin32 is one of the most popular packages for automating your daily work for Microsoft outlook/excel etc. In my previous post, we discussed about how to use this package to read emails and save attachments from outlook. As there were quite many questions raised in the comments which were not covered in the original […]

Read More
common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes […]

Read More
Photo by Ali Yahya on Unsplash

Master python closure with 3 real-world examples

Introduction

Python closure is a technique for binding function with an environment where the function gets access to all the variables defined in the enclosing scope. Closure typically appears in the programming language with first class function, which means functions are allowed to be passed as arguments, return value or assigned to a variable.

This definition sounds confusing to the python beginners, and sometimes the examples found from online also not intuitive enough in the way that most of the examples are trying to illustrate with some printing statement, so the readers may not get the whole idea of why and how the closure should be used. In this article, I will be using some real-world example to explain how to use closure in your code.

Nested function in Python

To understand closure, we must first know that Python has nested function where one function can be defined inside another. For instance, the below inner_func is the nested function and the outer_func returns it’s nested function as return value.

def outer_func():    
    print("starting outer func")
    def inner_func():
        pi = 3.1415926
        print(f"pi is : {pi}")
    return inner_func

When you invoke the outer_func, it returns the reference to the inner_func, and subsequently you can call the inner_func. Below is the output when you run in Jupyter Notebook:

python closure nested function example

After you have got some feeling about the nested function, let’s continue to explore how nested function is related to closure. If we modify our previous function and move the pi variable into outer function, surprisedly it generates the same result as previously.

def outer_func():    
    print("starting outer func")
    #move pi variable definition to outer function
    pi = 3.1415926
    def inner_func():
        print(f"pi is : {pi}")
    return inner_func

You may wonder the pi variable is defined in outer function which is a local variable to outer_func, why inner_func is able access it since it’s not a global scope? This is exactly where closure happens, the inner_func has the full access to the environment (variables) in it’s enclosing scope. The inner_func refers to pi variable as nonlocal variable since there is no other local variable called pi.

If you want to modify the value of the pi inside the inner_func, you will have to explicitly specify “nonlocal pi” before you modify it since it’s immutable data type.

With the above understanding, now let’s walk through some real-world examples to see how we can use closure in our code.

Hide data with Python closure

Let’s say we want to implement a counter to record how many time the word has been repeated. The first thing you may want to do is to define a dictionary in global scope, and then create a function to add in the words as key into this dictionary and also update the number of times it repeated. Below is the sample code:

counter = {}

def count_word(word):    
    global counter
    counter[word] = counter.get(word, 0) + 1
    return counter[word]

To make sure the count_word function updates the correct “counter”, we need to put the global keyword to explicitly tell Python interpreter to use the “counter” defined in global scope, not any variable we accidentally defined with the same name in the local scope (within this function).

Sample output:

python closure word counter sample output

The above code works as expected, but there are two potential issues: Firstly, the global variable is accessible to any of the other functions and you cannot guarantee your data won’t be modified by others. Secondly, the global variable exists in the memory as long as the program is still running, so you may not want to create so many global variables if not necessary.

To address these two issues, let’s re-implement it with closure:

def word_counter():
    counter = {}
    def count(word):
        counter[word] = counter.get(word, 0) + 1
        return counter[word]
    return count

If we run it from Jupyter Notebook, you will see the below output:

python closure word counter example output

With this implementation, the counter dictionary is hidden from the public access and the functionality remains the same. (you may notice it works even after the word_counter function is deleted)

Convert small class to function with Python closure

Occasionally in your project, you may want to implement a small utility class to do some simple task. Let’s take a look at the below example:

import requests

class RequestMaker:
    def __init__(self, base_url):
        self.url = base_url
    def request(self, **kwargs):
        return requests.get(self.url.format_map(kwargs))

You can see the below output when you call the make_request from an instance of RequestMaker:

python closure small class example

Since you’ve already seen in the word counter example, the closure can also hold the data for your later use, the above class can be converted into a function with closure:

import requests

def request_maker(url):
    def make_request(**kwargs):
        return requests.get(url.format_map(kwargs))
    return make_request

The code becomes more concise and achieves the same result. Take note that in the above code, we are able to pass in the arguments into the nested function with **kwargs (or *args).

python closure convert small class to closure

Replace text with case matching

When you use regular express to find and replace some text, you may realize if you are trying to match text in case insensitive mode, you will not able to replace the text with proper case. For instance:

import re

paragraph = 'To start Python programming, you need to install python and configure PYTHON env.'
re.sub("python", "java", paragraph, flags=re.I)

Output from above:

python closure replace with case

It indeed replaced all the occurrence of the “python”, but the case does not match with the original text. To solve this problem, let’s implement the replace function with closure:

def replace_case(word):
    def replace(m):
        text = m.group()
        if text.islower():
            return word.lower()
        elif text.isupper():
            return word.upper()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

In the above code, the replace function has the access to the original text we intend to replace with, and when we detect the case of the matched text, we can convert the case of original text and return it back.

So in our original substitute function, let’s pass in a function replace_case(“java”) as the second argument. (You may refer to Python official doc in case you want to know what is the behavior when passing in function to re.sub)

re.sub("python", replace_case("java"), paragraph, flags=re.IGNORECASE)

If we run the above again, you should be able to see the case has been retained during the replacement as per below:

python closure replace with case

Conclusion

In this article, we have discussed about the general reasons why Python closure is used and also demonstrated how it can be used in your code with 3 real-world examples. In fact, Python decorator is also a use case of closure, I will be discussing this topic in the next article.

 

pandas tricks pass multiple columns to lambda

Pandas tricks – pass multiple columns to lambda

Pandas is one of the most powerful tool for analyzing and manipulating data. In this article, I will be sharing with you the solutions for a very common issues you might have been facing with pandas when dealing with your data – how to pass multiple columns to lambda or self-defined functions.

Prerequisite

You will have to install pandas on your working environment:

pip install pandas

When dealing with data, you will always have the scenario that you want to calculate something based on the value of a few columns, and you may need to use lambda or self-defined function to write the calculation logic, but how to pass multiple columns to lambda function as parameters?

Let me use some real world example, so that easier for you to understand the issue that I am talking about. Below table shows partial of the e-com delivery charges offered by some company, so the delivery charges are determined by package size (H+L+W), package weight and the delivery mode you are choosing.

Size (cm/kg) 3 hours express Next Day Delivery Same Day Delivery
<60 CM (H+L+W) & MAX 1KG 12 8 10
<80 CM (H+L+W) & MAX 5KG 15 9 11
<100 CM (H+L+W) & MAX 8KG 17 11 13
<120 CM (H+L+W) & MAX 10KG 19 14 16

And assuming we have the below order data, and we want to simulate the delivery charges. Let’s create the data in a pandas dataframe.

import pandas as pd

df = pd.DataFrame({
    "Order#" : ["1", "2", "3", "4"], 
    "Weight" : [5.0, 2.1, 8.1, 7.5], 
    "Package Size" : [80, 45, 110, 90],
    "Delivery Mode": ["Same Day", "Next Day", "Express", "Next Day"]})

If you view dataframe from Jupyter Notebook (you can sign up here to use it for free), you shall be able to see the data as per below.

Pandas pass multiple columns to lambda same data

Let’s also implement a calculate_rate function where we need to pass in the weight, package size, and delivery mode in order to calculate the delivery charges:

def calculate_rate(weight, package_size, delivery_mode):
    #set the charges as $20 since we do not have the complete rate card
    charges = 20
    if weight <=1 and package_size <60:
        if delivery_mode == "Express":
            charges = 13
        elif delivery_mode == "Next Day":
            charges = 8
        else:
            charges = 10
    elif weight <=5 and package_size <80:
        if delivery_mode == "Express":
            charges = 15
        elif delivery_mode == "Next Day":
            charges = 9
        else:
            charges = 11
    elif weight <=8 and package_size <100:
        if delivery_mode == "Express":
            charges = 17
        elif delivery_mode == "Next Day":
            charges = 11
        else:
            charges = 13
    return charges

Pass multiple columns to lambda

Here comes to the most important part. You probably already know data frame has the apply function where you can apply the lambda function to the selected dataframe. We will also use the apply function, and we have a few ways to pass the columns to our calculate_rate function.

 Option 1

We can select the columns that involved in our calculation as a subset of the original data frame, and use the apply function to it.

And in the apply function, we have the parameter axis=1 to indicate that the x in the lambda represents a row, so we can unpack the x with *x and pass it to calculate_rate.

df["Delivery Charges"] = df[["Weight", "Package Size", "Delivery Mode"]].apply(lambda x : calculate_rate(*x), axis=1)

If we check the df again in Jupyter Notebook, you should see the new column “Delivery Charges” with the figures calculated based on the logic we defined in calculate_rate function.

Pandas pass multiple columns to lambda

Option 2:

If you do not want to get a subset of the data frame and then apply the lambda, you can also directly use the apply function to the original data frame. In this case, you will need to select the columns before passing to the calculate_rate function. Same as above, we will need to specify the axis=1 to indicate it’s applying to each row.

df["Delivery Charges"] = df.apply(lambda x : calculate_rate(x["Weight"], x["Package Size"], x["Delivery Mode"]), axis=1)

This will produce the same result as option 1. And you can also use x.Weight instead of x[“Weight”] when passing in the parameter.

 

Conclusion

The two options we discussed to pass multiple columns to lambda are basically the same, and it’s either applying to the subset or the original data frame. I have not yet tested with a large set of data, so there might be some differences in terms of the performance, you may need to take a note if you are dealing with a lot of data.

You may also interested to read some other articles related to pandas.

 

pandas tricks calculate percentage within group

Pandas Tricks – Calculate Percentage Within Group

Pandas groupby probably is the most frequently used function whenever you need to analyse your data, as it is so powerful for summarizing and aggregating data. Often you still need to do some calculation on your summarized data, e.g. calculating the % of vs total within certain category. In this article, I will be sharing with you some tricks to calculate percentage within groups of your data.

Prerequisite

You will need to install pandas if you have not yet installed:

pip install pandas
#or conda install pandas

I am going to use some real world example to demonstrate what kind of problems we are trying to solve. The sample data I am using is from this link , and you can also download it and try by yourself.

Let’s first read the data from this sample file:

import pandas as pd

# You can also replace the below file path to the URL of the file
df = pd.read_excel(r"C:\Sample Sales Data.xlsx", sheet_name="Sheet")

The data will be loaded into pandas dataframe, you will be able to see something as per below:

pandas tricks - calculate percentage within group

Let’s first calculate the sales amount for each transaction by multiplying the quantity and unit price columns.

df["Total Amount"] = df["Quantity"] * df["Price Per Unit"]

You can see the calculated result like below:

pandas tricks - calculate percentage within group

Calculate percentage within group

With the above details, you may want to group the data by sales person and the items they sold, so that you have a overall view of their performance for each person. You can do with the below :

#df.groupby(["Salesman","Item Desc"])["Total Amount"].sum()
df.groupby(["Salesman", "Item Desc"]).agg({"Total Amount" : "sum"})

And you will be able to see the total amount per each sales person:

pandas tricks - calculate percentage within group

This is good as you can see the total of the sales for each person and products within the given period.

Calculate the best performer

Now let’s see how we can get the % of the contribution to total revenue for each of the sales person, so that we can immediately see who is the best performer.

To achieve that, firstly we will need to group and sum up the “Total Amount” by “Salemans”, which we have already done previously.

df.groupby(["Salesman"]).agg({"Total Amount" : "sum"})

And then we calculate the sales amount against the total of the entire group. Here we can get the “Total Amount” as the subset of the original dataframe, and then use the apply function to calculate the current value vs the total. Take note, here the default value of axis is 0 for apply function.

[["Total Amount"]].apply(lambda x: 100*x/x.sum())

With the above, we should be able get the % of contribution to total sales for each sales person. And let’s also sort the % from largest to smallest:

sort_values(by="Total Amount", ascending=False)

Let’s put all together and run the below in Jupyter Notebook:

df.groupby(["Salesman"])\
.agg({"Total Amount" : "sum"})[["Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by="Total Amount", ascending=False)

You shall be able to see the below result with the sales contribution in descending order. (Do not confuse with the column name “Total Amount”, pandas uses the original column name for the aggregated data. You can rename it to whatever name you want later)

pandas tricks - calculate percentage within group for salesman

 

Calculate the most popular products

Similarly, we can follow the same logic to calculate what is the most popular products. This time we want to summarize the sales amount by product, and calculate the % vs total for both “Quantity” and “Total Amount”. And also we want to sort the data in descending order for both fields. e.g.:

df.groupby(["Item Desc"])\
.agg({"Quantity": "sum", "Total Amount" : "sum"})[["Quantity", "Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Quantity","Total Amount"], ascending=[False,False])

This will produce the below result, which shows “Whisky” is the most popular product in terms of number of quantity sold. But “Red Wine” contributes the most in terms of the total revenue probably because of the higher unit price.

pandas tricks - calculate percentage within group for products

 

Calculate best sales by product for each sales person

What if we still wants to understand within each sales person, what is the % of sales for each product vs his/her total sales amount?

In this case, we shall first group the “Salesman” and “Item Desc” to get the total sales amount for each group. And on top of it, we calculate the % within each “Salesman” group which is achieved with groupby(level=0).apply(lambda x: 100*x/x.sum()).

Note: After grouping, the original datafram becomes multiple index dataframe, hence the level = 0 here refers to the top level index which is “Salesman” in our case.

df.groupby(["Salesman", "Item Desc"])\
.agg({"Total Amount" : "sum"})\
.groupby(level=0).apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Salesman", "Item Desc","Total Amount"], ascending=[True, True, False])

You will be able see the below result which already sorted by % of sales contribution for each sales person.

pandas tricks - calculate percentage within group - for salesman and product

 

Conclusion

This is just some simple use cases where we want to calculate percentage within group with the pandas apply function, you may also be interested to see what else the apply function can do from here.

 

python send email with attachment via smtplib

How to send email with attachment via python smtplib

In one of my previous article, I have discussed about how to send email from outlook application. That has assumed you have already installed outlook and configured your email account on the machine where you want to run your script. In this article, I will be sharing with you how to automatically send email with attachments via lower level API, to be more specific, by using python smtplib where you do not need to set up anything in your environment to make it work.

For this article, I will demonstrate to you to send a HTML format email from a gmail account with some attachment. So besides the smtplib module, we will need to use another two modules – ssl and email.

Let’s get started!

First, you will need to find out the SMTP server and port info to send email via google account. You can find this information from this link. For your easy reading, I have captured in the below screenshot.

codeforests - google smtp server configuration info

So we are going to use the server: smtp.gmail.com and port 587 for our case. (you may search online to find out more info about the SSL & TLS, we will not discuss much about it in this article)

Let’s start to import all the modules we need:

import smtplib, ssl
from email.mime.multipart import MIMEMultipart 
from email.mime.text import MIMEText 
from email.mime.application import MIMEApplication

As we are going to send the email in HTML format (which are you able to unlock a lot features such as adding in styles, drawing tables etc.), we will need to use the MIMEText. And also the MIMEMultipart and MIMEApplication for the attachment.

Build up the email message

To build up our email message, we need to create mixed type MIMEMultipart object so that we can send both text and attachment. And next, we shall specify the from, to, cc and subject attributes.

smtp_server = 'smtp.gmail.com'
smtp_port = 587 
#Replace with your own gmail account
gmail = 'yourmail@gmail.com'
password = 'your password'

message = MIMEMultipart('mixed')
message['From'] = 'Contact <{sender}>'.format(sender = gmail)
message['To'] = 'contact@company.com'
message['CC'] = 'contact@company.com'
message['Subject'] = 'Hello'

You probably do not want anybody can see your hard coded password here, you may consider to put this email account info into a separate configuration file. Check my another post on the read/write configuration files.

For the HTML message content, we will wrap it into the MIMEText, and then attach it to our MIMEMultipart message:

msg_content = '<h4>Hi There,<br> This is a testing message.</h4>\n'
body = MIMEText(msg_content, 'html')
message.attach(body)

Let’s assume you want to attach a pdf file from your c drive, you can read it in binary mode and pass it into MIMEApplication with MIME type as pdf. Take note on the additional header where you need to specify the name your attachment file.

attachmentPath = "c:\\sample.pdf"
try:
	with open(attachmentPath, "rb") as attachment:
		p = MIMEApplication(attachment.read(),_subtype="pdf")	
		p.add_header('Content-Disposition', "attachment; filename= %s" % attachmentPath.split("\\")[-1]) 
		message.attach(p)
except Exception as e:
	print(str(e))

If you have a list of the attachments, you can loop through the list and attach them one by one with the above code.

Once everything is set properly, we can convert the message object into to a string:

msg_full = message.as_string()

Send email

Here comes to the most important part, we will need to initiate the TLS context and use it to communicate with SMTP server.

context = ssl.create_default_context()

And we will initialize the connection with SMTP server and set the TLS context, then start the handshaking process.

Next it authenticate our gmail account, and in the send mail method, you can specify the sender, to and cc (as a list), as well as the message string. (cc is optional)

with smtplib.SMTP(smtp_server, smtp_port) as server:
	server.ehlo()  
	server.starttls(context=context)
	server.ehlo()
	server.login(gmail, password)
	server.sendmail(gmail, 
				to.split(";") + (cc.split(";") if cc else []),
				msg_full)
	server.quit()

print("email sent out successfully")

Once sendmail completed, you will disconnect with the server by server.quit().

With all above, you shall be able to receive the email triggered from your code. You may want to wrap these codes into a class, so that you can reuse it as service library in your multiple projects.

 

As per always, please share if you have any questions or comments.

python cache

How to print colored message on command line terminal window

When you are developing a python script with some output messages printed on the terminal window, you may find a little bit boring that all the messages are printed in black and white, especially if some messages are meant for warning, and some just for information only. You may wonder how to print colored message to make them look differently, so that your users are able to pay special attention to those warning or error messages.

In this article, I will be sharing with you a library which allows you to print colored message in your terminal.

Let’s get started!

The library I am going to introduce called colorama, which is a small and clean library for styling your messages in both Windows, Linux and Mac os.

Prerequisite :

You will need to install this library, so that you will be able to run the following code in this article.

pip install colorama

To start using this library, you will need to import the modules, and call the init() method at the beginning of your script or your class initialization method.

import colorama
from colorama import Fore, Back, Style
colorama.init()

Print colored message with colorama

The init method also accepts some **kwargs to overwrite it’s default behaviors. E.g. by default, the style will not be reset back after printing out a message,  and the subsequent messages will be following the same styles. You can pass in autoreset = true to the init method, so that the style will be reset after each printing statement.

Below are the options you can use when formatting the font, background and style.

Fore: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Back: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Style: DIM, NORMAL, BRIGHT, RESET_ALL

To use it in your message, you can do as per below to wrap your messages with the styles:

print(Fore.CYAN + "Cyan messages will be printed out just for info only" + Style.RESET_ALL)
print(Fore.RED + "Red messages are meant to be to warning or error" + Style.RESET_ALL)
print(Fore.YELLOW + Back.GREEN +  "Yellow messages are debugging info" + Style.RESET_ALL)

This is how it would look like in your terminal:

Python printed colored message with colorama

As I mentioned earlier, if you don’t set the autoreset to true, you will need to reset the style at the end of your each message, so that different message applies different styles.

What if you want to apply the styles when asking user’s input ? Let’s see an example:

print(Fore.YELLOW)
choice = input("Enter YES to confrim:")
print(Style.RESET_ALL)
if str.upper(choice) in ["YES",'Y']:
    print(Fore.GREEN + "You have just confirmed to proceed." + Style.RESET_ALL)
else:
    print(Fore.RED + "You did not enter yes, let's stop here" + Style.RESET_ALL)

By wrapping the input inside Fore.YELLOW and Style.RESET_ALL, whatever output messages from your script or user entry, the same style will be applied.

Let’s put all the above into a script and run it in the terminal to check how it looks like.

Python printed colored message with colorama

Yes, that’s exactly what we want to achieve! Now you can wrap your printing statement into a method e.g.: print_colored_message, so that you do not need to repeat the code everywhere.

As per always, please share if you have any comments or questions.

 

python unpack objects

Python how to unpack tuple, list and dictionary

There are various cases that you want to unpack your python objects such as tuple, list or dictionary into individual variables, so that you can easily access the individual items. In this article I will be sharing with you how to unpack these different python objects and how it can be useful when working with the *args and **kwargs in the function.

Let’s get started.

Unpack python tuple objects

Let’s say we have a tuple object called shape which describes the height, width and channel of an image, we shall be able to unpack it to 3 separate variables by doing below:

shape = (500, 300, 3)
height, width, channel = shape
print(height, width, channel)

And you can see each item inside the tuple has been assigned to the individual variables with a meaningful name, which increases the readability of your code. Below is the output:

500 300 3

It’s definitely more elegant than accessing each items by index, e.g. shape[0], shape[1], shape[2].

What if we just need to access a few items in a big tuple which has many items? Here we need to introduce the _ (unnamed variable) and * (unpack arbitrary number of items)

For example,  if we just want to extract the first and the last item from the below tuple, we can let the rest of the items go into a unnamed variable.

toto_result = (4,11,14,23,28,47,24)
first, *_, last = toto_result
print(first, last)

So the above will give the below output:

4 24

If you are curious what is inside the “_”, you can try to print it out. and you would see it’s actually a list of the rest of items between the first and last item.

[11, 14, 23, 28, 47]

The most popular use case of the packing and unpacking is to pass around as parameters to function which accepts arbitrary number of arguments (*args). Let’s look at an example:

def sum(*numbers):
    total = 0
    for n in numbers:
        total += n
    return total

For the above sum function, it accepts any number of arguments and sum up the values. The * here is trying to pack all the arguments passed to this function and put it into a tuple called numbers. If you are going to sum up the values for all the items in toto_result, directly pass in the toto_result would not work.

toto_resut = (4,11,14,23,28,47,24)
#sum(toto_result) would raise TypeError

So what we can do is to unpack the items from the tuple then pass it the sum function:

total = sum(*toto_resut)
print(total)
#output should be 151

Unpack python list objects

Unpacking the list object is similar to the unpacking operations on tuple object. If we replace the tuple to list in the above example, it should be working perfectly.

shape = [500, 300, 3]
height, width, channel = shape
print(height, width, channel)
#output shall be 500 300 3

toto_result = [4,11,14,23,28,47,24]
first, *_, last = toto_result
print(first, last)
#output shall be 4 24

total = sum(*toto_resut) 
print(total) 
#output should be also 151

Unpack python dictionary objects

Unlike the list or tuple, unpacking the dictionary probably only useful when you wants to pass the dictionary as the keyword arguments into a function (**kwargs).

For instance, in the below function, you can pass in all your keyword arguments one by one.

def print_header(**headers):
    for header in headers:
        print(header, headers[header])

print_header(Host="Mozilla/5.0", referer = "https://www.codeforests.com")

Or if you have a dictionary like below, you can just unpack it and pass to the function:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
print_header(**headers)

It will generate the same result as previously, but the code is more concise.

Host www.codeforests.com
referer https://www.codeforests.com

With this unpacking operator, you can also combine multiple dictionaries as per below:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
extra_header = {'user-agent': 'Mozilla/5.0'}

new_header = {**headers, **extra_header}

The output of the new_header will be like below:

{'Host': 'www.codeforests.com',
 'referer': 'https://www.codeforests.com',
 'user-agent': 'Mozilla/5.0'}

Conclusion

The unpacking operation is very usefully especially when dealing with the *args and **kwargs. There is one thing worth noting on the unamed variable (_) which I mentioned in the previous paragraph. Please use it with caution, as if you notice, the python interactive interpreter also uses _ to store the last executed expression. So do take note on this potential conflict. See the below example:

codeforests interactive interpreter conflicts

As per always, welcome any comments or questions.

Manipulate Audio File in Python, pydub, download youtube,cut video python

Manipulate Audio File in Python With 6 Powerful Tips

Introduction

Dealing with audio files may not be that common to a Python enthusiast, but sometimes you may wonder if you are able to manipulate audio files in Python for your personal interest. For instance, if you really like some music, and you want to edit some parts of it and save into your phone, so that you can listen it during your study or outdoor exercise without skipping those annoying ads.

In this post, I would be introducing you a simple yet useful library for you to manipulate audio file in Python code.

Prerequisites:

You need to install Pydub in your working environment, below is the installation command via pip (click for more tips):

pip install pydub

The library has the dependency to ffmpeg in order to support most of the audio file formats, so you may use the below command to install if you do not have it yet:

pip install ffmpeg

Download Video from YouTube

As I am going to use a funny video from YouTube for the later demonstration, I would need install another library – youtube_dl to download the video into my local folder:

pip install youtube_dl

Below is the command to download the video from YouTube with the given URL and the output file name. You can also use -f to specify the file format if the original video has multiple format:

youtube-dl "https://www.youtube.com/watch?v=Zo6F_qtQCCc" -o "hongshaorou.mp4"

You may see the below output messages from your terminal, and the final output file will be saved to your current directory:

Manipulate Audio File in Python, pydub, download youtube,cut video python

Now let’s import pydub and use this video to explore what we can do with this library.

from pydub import AudioSegment
import os

Extract Sound From A Video File

To load a video file, we can use the from_file function from the AudioSegment module:

base_dir = r"c:\sounds"
sound = AudioSegment.from_file(os.path.join(base_dir, "hongshaorou.mp4"))

There are also other functions such as from_mp3, from_wav or from_ogg etc., depending on what type of audio files you want to read. With the export function, you can easily convert the video file into another format:

sound.export(os.path.join(base_dir, "hsr.mp3"), format="mp3")

There are some more parameters you can use to specify the metadata when you save the file, e.g.:

sound.export(os.path.join(base_dir, "hsr.mp3"),
                           format="mp3",
                           bitrate="192k",
                           tags={"album": "chinese cuisine", "artist": "not sure"},
                           cover= os.path.join(base_dir,"hongshaorou.jpeg"))

And you can also retrieve the meta info as per below:

from pydub.utils import mediainfo
mediainfo('hsr3.mp3')

Split/Cut Audio Clips

With the AudioSegment object, you can cut the audio file like slicing a list by specifying the starting point and ending point in milliseconds. For instance, to cut our audio file from 1:18 to 1:33 and save it to mp3:

first_cut_point = (1*60 + 18) * 1000
last_cut_point = (1*60 + 33) * 1000

sound_clip = sound[first_cut_point:last_cut_point]

sound_clip.export(os.path.join(base_dir, "hsr.mp3"), format="mp3")

Increase/Reduce Sound Volume

You can make the sound louder or quieter by adding/subtracting the decibels as per below:

#increase volume by 10dB for the first 2 seconds
sound_clip_1 = sound_clip[:2000] + 10

#reduce volume by 5dB for the last 3 seconds
sound_clip_2 = sound_clip[-3000:] - 5

#combine multiple sound clips
final_clip = sound_clip_1 + sound_clip[2000:-3000] + sound_clip_2

Play Sound In Python

If you are running the code in Jupyter Lab, you can simply execute the final_clip and see how the result sounds like:

Otherwise you use the playback module to play the sound as per below:

from pydub.playback import play
play(final_clip)

Adding Silence In The Sound

Silence can be added to your sound clip as per below:

#Adding 1 second silence before the sound clip
AudioSegment.silent(duration=1000) + sound_clip[:5000]

Overlay Audio Onto Another Audio

The overlay function allows you to overlay one AudioSegment to another AudioSegment object. For instance:

sound_clip[5000:10000].overlay(final_clip[:5000])

There are some more useful functions for editing audio files, you can see full API document from here.

Conclusion

In this article, we have reviewed through a few very useful functions in the pydub library which allows you to manipulate audio file such as converting audio formats, combining, splitting or editing sound clips. With these tips, you shall be able to create your own sound clips in a few lines of Python code. In this post, we have also used the youtube-dl library which allows you to download the video from YouTube and some other video streaming website. You may refer to this reddit discussion if you are wondering whether this is legal. But I believe it should be alright if you just use for your personal exploration on the Python programming.

Python generate QR code, Python read QR code, Photo by Lukas on Unsplash

Read and Generate QR Code With 5 Lines of Python Code

 Introduction

QR Code is the most popular 2 dimensional barcodes that widely used for document management, track and trace in supply chain and logistics industry, mobile payment,  and even the “touchless” health declaration and contact tracing during the COVID-19 pandemic. Comparing to 1D barcode, QR code can be very small in size but hold more information, and also easier for scanning as you can scan it from any direction.

In this article, I would be sharing with you how to use some pure Python packages to generate QR code and read QR code from images.

Generate QR code with Python

To generate QR code, we will use a Python package called qrcode. Below is the pip command to install this package:

#install qrcode together with pillow
pip install qrcode[pil]

#or install qrcode if you already have pillow installed
pip install qrcode

As it has dependency to Pillow package, you will need to have this package installed as well. Once you have these packages ready, let’s import the modules at the beginning of our code:

import qrcode
from PIL import Image

Generating a QR code with this qrcode library can be easily done with 1 line of code:

img = qrcode.make('QR Code')

If you check the “img” object from Jupyter Notebook, you can see the below image:

Generate QR code Python, Read QR code Python

This make function provides a quick way to generate QR code with all the default parameters. To specify the parameters like the size, style or border of the boxes, you can use the QRCode class. For instance:

qr = qrcode.QRCode(
    version=1,
    error_correction=qrcode.constants.ERROR_CORRECT_L,
    box_size=10,
    border=4,
)

Here is the explanations for these parameters:

version – QR code has 40 different sizes which indicated as the version parameter in above, version 1 represents a 21×21 matrix.  You can use (v-1)*4 + 21 to calculate the size of the matrix for each of the version number.

error_correctionspecifies error correction level which controls how many error correction code blocks to be inserted in order to achieve the error correction capability. In another words, if you want your barcode to be readable even when it’s damaged (or have a logo/image onto it) , you may increase the error correction level, but this would also make your barcode more compact.

box_size the number of pixels of the square box

border – the thickness of the square box border

Once you have a QRCode instance, you can use the below code to specify the barcode data, color and generate a image:

#barcode content
qr.add_data('codeforests.com')

#auto adjust the size
qr.make(fit=True)

#specifying barcode color
img = qr.make_image(fill_color="#040359", back_color="#f7f7fa")

If you check the “img” object from Jupyter Notebook again, you shall see something similar to below:

 

Generate QR code Python, Read QR code Python

 

To use the same barcode style to generate new barcode, you can just clear the data and then re-generate a new image object:

 

qr.clear()
qr.add_data('Python Tutorials')

img2 = qr.make_image(fill_color="#015b82", back_color="TransParent")

When inspecting the “img2” in Jupyter Notebook, you shall see below:

Generate QR code Python, Read QR code Python, Python QR code with different color

 

You can simply use the “save” method to save it into an image file since it is a Pillow Image object:

img2.save("qr_code.png")

The qrcode package cannot directly generate multiple QR codes into one image, if you need that, you may use the Pillow package to combine the images. For instance:

#create a blank image
new_img = Image.new("RGBA", (600, 350), "#fcfcfc")

new_img.paste(img, (0, 0))

new_img.paste(img2, (300, 0))

new_img.save("multi-QR-code.png")

The above will create a new image and combine the two barcode images into one. If you check the saved image file, you shall see:

 

Generate QR code Python, Read QR code Python, generate multiple QR codes on one page

With this package, you can also generate styled QR code e.g.: rounded corners, radial gradient, embedded image or different color masks. You can take a look at the samples from it’s office site.

Read QR Code in Python

To read QR code, we will use another Python package called pyzbar. You can use below pip command to install it:

pip install pyzbar

This library is also a very easy to use, you can directly pass in a Pillow Image object, numpy.ndarray or raw bytes to the decode method to detect the barcode. For instance:

import pyzbar.pyzbar as pyzbar
from pyzbar.pyzbar import ZBarSymbol

input_image = Image.open("multi-QR-code.png")

decoded_objects = pyzbar.decode(input_image, symbols=[ZBarSymbol.QRCODE])

The decode method returns a list of barcode objects detected from the image with their position info. You can use the symbols parameter to restrict what type of barcodes you want to detect. When this parameter is not specified, all its supported barcode types will be checked.

From the above, you can further loop through the list to get the actual content data of the barcodes:

for obj in decoded_objects:
    zbarData = obj.data.decode("utf-8")
    print(zbarData)

You shall see the below result:

Generate QR code Python, Read QR code Python

In your real-world project, if you need to read one barcode among the multiple barcodes from a document, you may try to use the symbols to restrict the barcode types, or use regular expression to validate the detected barcode data in order to find the correct one you need.

If you need to do a lot of image pre-processing or even read barcode from video or webcam, you may install OpenCV and use the detectAndDecodeMulti method to read the QR code.

Conclusion

In this article, we have reviewed through two simple but useful packages – qrcode for generating QR code, and pyzbar for reading the content from a QR code. There are quite many other Python packages for generating all sorts of one or two dimensional barcodes, some are in pure Python packages and some are Python wrappers, you may take a look at the summary table from this blog if any specific barcode type you need is not supported by these two packages.

 

20 Useful Tips for Using Python Pip

20 Tips for Using Python Pip

Introduction

Python has become one of the most popular programming languages due to the easy to use syntax as well as the thousands of open-source libraries developed by the Python community. Almost every problem you want to solve, you can find a solution with these third-party libraries, so that you do not need to reinvent the wheels. Majority of these libraries are hosted in the repository called Pypi and you can install these libraries with the Python pip command.

Python pip module helps you to manage the downloading, installation of the packages, and solving the dependency requirements. Although you probably have used pip for some time, you may not spend much time to read through it’s user guide for some of the useful operations. In this article, we have summarize the 20 useful tips for managing Python third party packages with Python pip.

Check the current pip version

Since Python version 3.4, the pip module has been included by default within the Python binary installer, so you do not need to install it separately once you have Python program installed. To check the version of the pip package, you can use the below:

pip --version

Sample output:

Python Pip version

Install package from Pypi

Installing package is very simple with pip command, you can use “install” option followed by one or multiple package names:

pip install requests

By default, pip looks for the latest release and install the latest version for you together with the dependency packages. Sample output as per below:

Python Pip install package

You can also specify the version number of the package to be installed:

py -m pip install pip==21.1.1

Sample output:

Python Pip install package with version number

Pip also supports a list of version specifier such as >=1.2, <2.0, ~=2.0, !=2.0 or ==1.9.* for matching the correct version of the package to be installed.

When you are not in a virtual environment, the package will be installed into the global folder (system-site) by default, you can use the “–user” option to specify the installation folder in user-site in case of any permission issue. E.g.:

pip install --user requests

Output as per below:

Python Pip install package to user-site

Although you can specify your own customized installation path for your different projects, using virtual environment is still the best way to manage dependencies and conflicts.

Show package version and installation location

To check the basic information such as version number or installation location for an existing package, you can use the “show” option:

pip show colorama

You can see the below information about the package:

Python Pip show package version and installation location

And you can also use the “–verbose” mode to display the additional meta info.

List all the packages installed

To list out all the packages installed, you can use the “list” option:

py -m pip list

You shall see the output format similar to below:

Python Pip list packages

You can add a “–user” option to list all packages installed in your user-site, e.g.:

py -m pip list --user

When you are using virtual environment with “–system-site-packages” (allowing virtual environment to access system-site packages), you can use the “list –local” option to show only the packages installed in your virtual environment:

py -m pip list --local

List all the outdated packages

To check if any installed packages are outdated, you can use the “–outdated” option:

py -m pip list -o
# or
py -m pip list --outdated

Below is the sample output:

Python Pip list outdated packages

Upgrade package to the latest version

Once identified the outdated packages, you can manually use the “–upgrade” option to upgrade the package to the latest version. Multiple package name can be specified with whitespaces:

py -m pip install --upgrade pip
#or 
py -m pip install --U pip setuptools

Sample output as per below:

Python Pip upgrade package

Auto upgrade packages to the latest version

Pip does not have an option to auto upgrade the outdated packages, but you can make use of the result from “list -o” and create a simple script to achieve it, e.g.:

#in Windows command line
for /F "skip=2 delims= " %i in ('pip list --o --local') do pip install -U %i

#in linux
pip list --o --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Export installed packages

You can use “freeze” option to export all your installed package names into a text file, so that you can re-create exactly the same project environment in another PC. For instance:

py -m pip freeze -l > requirements_demo.txt

Result in the output text file:

Python Pip install requirement file

Install multiple packages from requirement file

For the packages you’ve exported with “freeze” option, you can re-install all the packages in another environment with the below “-r” option:

py -m pip install -r requirements.txt

You may see the below output when you have package name “numpy” in your requirements.txt file:

Python Pip install package with requirement file

The requirements.txt also allows to include other requirement files. This may be useful when you have a sub module requires extra packages and can be run independently as a separate application. So you may put the common packages in the requirements.txt and the additional packages in the requirements_module1.txt file,  the include the requirements.txt file in your module file.

E.g. the content in the requirements_module1.txt:

#opencv-python
#comment out some packages
python-dateutil

-r requirements.txt

When you run the “install” command:

py -m pip install -r requirements_module1.txt

You shall the sample output as per below:

Python Pip install package with multiple requirement files

Uninstall packages

Uninstalling an existing package can be done with below command:

pip uninstall numpy

Output as per below:

Python Pip uninstall package

Install package from wheel file

When you have a binary wheel file downloaded in your local folder, you can also use the “install” option to install the wheel file directly:

py -m pip install --force-reinstall C:\Users\codef\Downloads\python_dateutil-2.8.2-py2.py3-none-any.whl

Output as per below:

pip install whl

Install package from non-Pypi index

If the package is not hosted in Pypi index, you can manually specify the index url with “–index-url” or simply “-i” :

py -m pip install -i https://mirrors.huaweicloud.com/repository/pypi/simple/ colorama

Above command would download and install the package from huawei cloud repository (a PyPi mirror):

Python Pip install package from Pypi mirrors

This would be also helpful when you are not able to access the Pypi directly due to the firewall or proxy issue in your network, you can find a Pypi mirror repository and download the packages from there. Usually these mirrors synchronize with Pypi in a few minutes interval which should not cause any issue for your development work.

Configure global index url

To permanently save the index url so that you do not have to key in the url for every package installation, you can use the “config” option to set the url globally. e.g:

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple

With the above setting, you can install package from the mirror repository as per normal without specifying the url option.

Check package compatibility

When you manually install the packages, sometimes you may encounter issues that some dependency packages

having incompatible version installed. To check if you have any such issue, you can use the “check” option :

python -m pip check

You may see something similar to below when there is any conflict:

Python Pip check package compatibility

Download package into local folder

You can download the package wheel files into your local folder when you need:

pip download requests -d .\requests

The “-d” option allows you specify the target folder where you want to save the wheel files. You may get multiple wheel files if the package has any dependency packages. (you can use “–no-deps” when you do not want to download the dependency files)

Below is the sample result:

 

Python Pip download wheels file

Install package into local folder

To install the package from a folder, you can use the “-f” with the file path:

pip install requests -f .\requests

This is the same as installing the package from Pypi:

Python Pip install package offline

Conclusion

In this article we have summarized some useful tips for using Python pip to manage the installation and upgrading of the third party packages for your Python projects. For more advanced usage of this module, you may refer to it’s official document.

tips for handle python exception, raise and catch multiple exception, user defined exception

7 Tips for Handling Python Exception

Introduction

Exception handling is commonly seen in Java, C++ or any other modern programming languages. Similarly, in Python, it has two types of errors – syntax errors and exceptions. Syntax errors refer to the syntactic problems which are detected by interpreter when translating the source code into byte code, while the exceptions are only detected during the runtime when the particular line is evaluated and executed. Python exceptions can be handled by application code but not the syntax errors.

In this article, we will discuss about the different ways to handle Python exceptions and when shall we use them.

Raise Exception

Python has a list of built-in exceptions which can be generated by interpreter or the built-in functions. Before we touch on the exception handling, let’s first look at how to raise a built-in exception.

Without implementing any new exception, you can use the raise keyword to throw out a built-in exception from your code. For instance, the below code checks the data type and value for mode variable, and raises built-in exceptions when wrong data type or value specified:

if not isinstance(mode, int):
    raise TypeError("mode must be numeric values")
        
if not mode in [0, 1]:
    raise ValueError("mode has to be 0 or 1")

When the mode variable is passed as a string “1”, you shall see exception raised as per below:

tips for handle python exception, raise and catch multiple exception, user defined exception

Catch Exception

To catch an exception, Python uses try-except which is similar to the try-catch syntax in Java/C++. In the except block, you can either write your own logic to handle the exception, or re-raise the exception to the callers.

For example, the below code would catch the KeyError exception and re-raise it as ValueError to the caller:

def read_config(config):
    
    try:
        mode = config["mode"]
    except KeyError as err:
        raise ValueError(f"No mode has been specified.")        
                
    return mode

When calling the read_config function with an empty dictionary:

read_config(config={})

You shall see the below exception raised:

tips for handle python exception, raise and catch multiple exception, user defined exception

The traceback shows both the cause of the exception and the last exception raised. If you want to hide the details of the cause (e.g. KeyError exception), you can use the raise exception from None to disable the chaining of the exceptions:

raise ValueError(f"No mode has been specified.") from None

When re-run the code, you shall see that only the last exception was showing in the trackback:

tips for handle python exception, raise and catch multiple exception, user defined exception

You can also specify a different exception when you think a explicit cause is needed rather than the original exception, e.g. instead of raise from KeyError, you can raise from AttributeError for better clarity to the caller.

Catch Multiple Exceptions

Catching multiple exceptions are as simple as adding multiple else if statements. Be cautious about the sequence of the exception classes as the derived exception classes must be placed before the base class, otherwise all your exceptions would be caught by the base exception class.

Below is an example to catch all the file operation exceptions, since the OSError is the base class, it shall be put as the last except block:

file_path = r"c://config.txt"

try:
    with open(file_path, "w") as f:
        config = f.write("password")
except FileNotFoundError:
    print(f"file {file_path} is not found")
except PermissionError:
    print(f"you have no permission to write to {file_path}")
except OSError as e:
    print("os error {e}")

When you run the above code, you shall see the output similar to below:

you have no permission to write to c://config.txt

You can also put multiple exceptions with parentheses to handle them within the same except block when there is no need to distinguish them:

except (OSError, FileNotFoundError, PermissionError) as e: 
    print(e)

This is similar to an IN operator, the exception will be caught as as long as it is an instance of any of these classes.

Suppress Exception with Pass Keyword

There are cases that you cannot do anything when the exception happens, but you do not wish your program to halt due to these exceptions. In this case, you shall just suppress the exception with pass keyword in the except block.

Below is an example to perform some clean-up in a class destructor method for a selenium automation project. When exception happens, you cannot do anything but letting the program exit peacefully:

def __del__(self):

    if self.driver:
        try:
            self.driver.quit()
        except:
            #suppress any exceptions
            pass

Use of the else block

The Python try-except statement also has a optional else clause, which allows you to execute some code when no exception is raised. This probably useful when you do not want some exceptions to be caught by the try-except which is not intended for them. For instance, when you put a few lines of codes in the try block, and intend to catch the exception for the first line, you may accidently catch some exceptions raised in the subsequent lines. To avoid that, you can split your code and put the rest of the code in the else block where it only get executed when no exception raised for the first line.

Another useful example would be, if you have a for-loop inside the try-except block, and you need to check if all iterations are successful without exception, instead of having a flag to keep track it, you can use the else block:

import math
from random import randint

def check_all_successful(num):
       
    while(True):
        result = {}
        try:
            #some dummy logic for demo purpose
            #exception would not raise if all random numbers generated are positive
            for i in [randint(-num, num) for _ in range(6)]:
                result[i] = math.sqrt(i)
                
        except Exception as e:
            print(f"failed to process: {i}, error: {e}")
        else:
            #when all iterations are successful
            print(result)
            break

When calling this function:

check_all_successful(10)

You may see the function ended when no exception raised, and below is the output:

tips for handle python exception, raise and catch multiple exception, user defined exception

Perform Clean-up action with Finally Keyword

If there is any operation you must perform regardless of whether the exception is raised or not, such as closing an IO object or network connection, you shall consider to use the finally statement. It is designed for any clean-up logic after the exception is raised, and make sure the code is executed before the exception raised to the caller.

For instance, below is some code snippet to connect to SQLite and update the data for a user table. When exception happens, the finally block shall close the connection properly before program exits:

import sqlite3
con = sqlite3.connect('example.db')

try:
    cur = con.cursor()
    cur.execute("UPDATE USER SET PASSWORD = 'password' WHERE NAME = 'admin'")
    con.commit()
finally:
    con.close()

One thing to be noted, if you have return statements in both try/except and finally block, only the result from finally block will be returned to the caller no matter the exception happens or not. For instance, the below will always return 9:

def test():
    try:
        return 1/0
    except:
        return 2
    finally:
        return 9

Create User Defined Exceptions

When writing your own application, you may want to define your own exceptions based on your business scenario. To create custom exceptions, you shall first create your base exception class which is derived from the Python base exception – Exception, then you can create your other exceptions deriving from this base exception class.

For instance, you may create below exception classes:

# exceptions.py

class AppException(Exception):
    pass

class CurrencyNotSupported(AppException):
    pass

And in your application code, you can raise your own exceptions whenever appropriate. E.g. :

def convert_to_usd(local_currency, local_amount):
    
    if local_currency.lower() not in ["sgd"]:
        raise CurrencyNotSupported(f"The currency {local_currency} is not supported")
        
    return local_amount * 1.32

When you passed in some invalid input to the function:

convert_to_usd("eur", 100)

You shall see the exception raised in the traceback:

tips for handle python exception, raise and catch multiple exception, user defined exception

Conclusion

In this article, we have reviewed through the different scenarios for raising and handling Python exceptions. As a general rule, you shall not suppress any exception when you are not 100% sure what to do with it. Because this may need to other system bugs or troubleshooting issues later on. The exceptions shall be handled by parties who has the best knowledge to handle it, so raise rather than suppress it when you are not sure how to handle it.

reading email from outlook with python pywin32

5 Useful Tips for Reading Email From Outlook In Python

Introduction

Pywin32 is one of the most popular packages for automating your daily work for Microsoft outlook/excel etc. In my previous post, we discussed about how to use this package to read emails and save attachments from outlook. As there were quite many questions raised in the comments which were not covered in the original post, this article is intended to review through some of the advanced topic for reading emails from outlook via Python Pywin32 package.

If you have not yet read through the previous post, you may check it out from here.

Prerequisites:

Assuming you have already installed the latest Pywin32 package and imported below necessary packages in your script, and you shall not encounter any error after executing the GetNamespace method to establish the outlook connection:

import win32com.client

#other libraries to be used in this script 
import os 
from datetime import datetime, timedelta

outlook = win32com.client.Dispatch('outlook.application') 
mapi = outlook.GetNamespace('MAPI')

When using below code to iterate the Accounts property, you shall see whichever accounts you have configured in your outlook:

for account in mapi.Accounts: 
    print(account.DeliveryStore.DisplayName)

#Assuming below accounts have been configured:
#abc@hotmail.com
#def@gmail.com

Now let’s move on to the topics we are going to discuss in this article.

Reading Email from Multiple Outlook Accounts

If you have multiple accounts configured in your outlook application, to access one of the accounts, you can use the Folders method and specify the account name or index of the account, e.g.:

for idx, folder in enumerate(mapi.Folders):
    #index starts from 1
    print(idx+1, folder)

#Assuming below output:
# 1  abc@hotmail.com
# 2  def@gmail.com

And to access the sub folders under a particular email account, you can continue to use the folders method to specify the sub folder name or index of the folder. Before that, you may want to check what are the available sub folders and it’s index value as per below:

for idx, folder in enumerate(mapi.Folders("abc@hotmail.com").Folders):
    print(idx+1, folder)
# or using index to access the folder
for idx, folder in enumerate(mapi.Folders(1).Folders): 
    print(idx+1, folder)

You shall see something similar to the below:

reading email from outlook with Python pywin32

With the above folder index and name, you shall be able to access the email messages as per below:

messages = mapi.Folders("abc@company.com").Folders("Inbox").Items
# or
messages = mapi.Folders(1).Folders(2).Items
for msg in list(messages):
    print(msg.Subject)

Although the index would not get changed when you move up/down of your folders in outlook, obviously using folder name still is much better than index in terms of readability of the code.

Filter Email Based on Receiving Time Window

When reading emails from outlook inbox, you may want to zoom into the emails within a specific receiving time window rather than scanning through thousands of emails you have received in the inbox. To filter emails based on certain conditions, you can use restrict method together with the logical operators.

For instance, to filter the emails received from 1st day of the current month until today 12am:

today = datetime.today()

# first day of the month
start_time = today.replace(month=1, hour=0, minute=0, second=0).strftime('%Y-%m-%d %H:%M %p')

#today 12am
end_time = today.replace(hour=0, minute=0, second=0).strftime('%Y-%m-%d %H:%M %p')

messages = messages.Restrict("[ReceivedTime] >= '" + start_time
+ "' And [ReceivedTime] <= '" + end_time + "'")

With logical operators like AND, OR and NOT, you are able to combine multiple criteria together. For instance, to check the email with certain subject but not from a particular sender email:

messages = messages.Restrict("[Subject] = 'Sample Report'" 
                             + " And Not ([SenderEmailAddress] = 'abc@company.com')")

And you can also use the Restrict method as many times as you wish if it makes your code more readable than combining all conditions in one filter, e.g.:

messages = messages.Restrict("[Subject] = 'Sample Report'")
messages = messages.Restrict("Not ([SenderEmailAddress] = 'abc@company.com')")

Getting First N emails

When using Restrict method for filtering email messages, you would not be able to specify max number of emails you want to read. If you wish to get the first/last N emails based on the receiving time, you can use the Sort method to sort the messages based on certain email properties before you slice the list. Below is the sample code to get the latest 10 email messages based on the receiving time:

messages.Sort("[ReceivedTime]", Descending=True)

#read only the first 10 messages
for message in list(messages)[:10]:
    print(message.Subject, message.ReceivedTime, message.SenderEmailAddress)

Wildcard Matching for Filtering

With the Restrict method, you cannot do wildcard matching such as searching whether the email subject or body contains certain keywords. To be able to achieve that, you will need to use the DASL query.

For instance, with the below DASL query syntax, you can filter email subject which contains “Sample Report” keyword:

messages = messages.Restrict("@SQL=(urn:schemas:httpmail:subject LIKE '%Sample Report%')")

You may want to check here to see what are the fields supported in ADSL query and the correct namespace to be used.

Include/Exclude Multiple Email Domains

To filter the emails only from a particular domain, you can use the ADSL query similar to the previous example:

messages = messages.Restrict("@SQL=(urn:schemas:httpmail:SenderEmailAddress LIKE '%company.com')")

And to exclude the emails from a few domains, you can use multiple conditions with logical operators:

messages = messages.Restrict("@SQL=(Not(urn:schemas:httpmail:senderemail LIKE '%@abc%') \
And Not(urn:schemas:httpmail:senderemail LIKE '%@123%') \
And Not(urn:schemas:httpmail:senderemail LIKE '%@xyz%'))")

Conclusion

In this article, we have reviewed through some advanced usage of the Pywin32 package for filtering emails. You may not find many Python tutorials for this package from online directly, but you shall be able to see the equivalent VBA code from its official website for most of the code you have seen in this article. In the event that you cannot find a solution for you problem, you may check and see whether there is something implemented in VBA code that you can convert it into Python syntax.

Link for the previous post Reading Email From Outlook In PythonFollow me on twitter for more updates.

common python mistakes for beginners

8 Common Python Mistakes You Shall Avoid

Introduction

Python is a very powerful programming language with easily understandable syntax which allows you to learn by yourself even you are not coming from a computer science background. Through out the learning journey, you may still make lots mistakes due to the lack of understanding on certain concepts. Learning how to fix these mistakes will further enhance your understanding on the fundamentals as well as the programming skills.

In this article, I will be summarizing a few common Python mistakes that many people may have encountered when they started the learning journey and how they can be fixed or avoided.

Reload Modules after Modification

Have you ever wasted hours to debug and fix an issue and eventually realized you were not debugging on your modified source code? This usually happens to the beginners as they did not realize the entire module was only loaded into memory once when import statement was executed. So if you are modifying some code in separate module and import to your current code, you will have to reload the module to reflect the latest changes.

To reload a module, you can use the reload function from the importlib module:

from importlib import reload

# some module which you have made changes
import externallib

reload(externallib)

Naming Conflict for Global and Local Variables

Imagine you have defined a global variable named app_config, and you would like to use it inside the init_config function as per below:

app_config = "app.ini"

def init_config():
    app_config = app_config or "default.ini"
    print(app_config)

You may expect to print out “app.ini” since it’s already defined globally, but surprisedly you would get the “UnboundLocalError” exception due to the variable app_config is referenced before assignment. If you comment out the assignment statement and just print out the variable, you would see the value printed out correctly. So what is going on here?

The above exception is due to Python tries to create a variable in local scope whenever there is an assignment expression, and since the local variable and global variable have the same name, the global variable being shadowed in local scope. Thus Python throws an error saying your local variable app_config is used before it’s initialized.

To solve this naming conflict, you shall use different name for your global variable and local variables to avoid any confusion, e.g.:

app_config = "app.ini"

def init_config():
    config = app_config or "default.ini"
    print(config)

Checking Falsy Values

Examining true or false of a variable in if or while statement sometimes can also go wrong. It’s common for Python beginners to mix None value and other falsy values and eventually write some buggy code. E.g.:  assuming you want to check when price is not None and below 5, trigger some selling alert:

def selling_alert(price):
    if price and price < 5:
        print("selling signal!!!")

Everything looks fine, but when you test with price = 0, you would not get any alert:

selling_alert(0)
# Nothing has been printed out

This is due to both None and 0 are evaluated as False by Python, so the printing statement would be skipped although price < 5 is true.

In python, empty sequence objects such as “” (empty string), list, set, dict, tuple etc are all evaluated as False, and also zero in any numeric format like 0 and 0.0. So to avoid such issue, you shall be very clear whether your logic need to differentiate the None and other False values and then split the logic if necessary, e.g.:

if price is None:
   print("invalid input")
elif price < 5:
   print("selling signal!!!")

Default Value and Variable Binding

Default value can be used when you want to make your function parameter optional but still flexible to change. Imagine you need to implement a logging function with an event_time parameter, which you would like to give a default value as current timestamp when it is not given. You can happily write some code as per below:

from datetime import datetime

def log_event_time(event, event_time=datetime.now()):
    print(f"log this event - {event} at {event_time}")

And you would expect as long as the event_time is not provided during log_event_time function call, it shall log an event with the timestamp when the function is invoked. But if you test it with below:

log_event_time("check-in")

# log this event - check-in at 2021-02-21 14:00:56.046938

log_event_time("check-out")

# log this event - check-out at 2021-02-21 14:00:56.046938

You shall see that all the events were logged with same timestamp. So why the default value for event_time did not work?

To answer this question, you shall know the variable binding happens during the function definition time. For the above example, the default value of the event_time was assigned when the function is initially defined. And the same value will be used each time when the function is called.

To fix the issue, you can assign a None as default value and check to overwrite the event_time inside your function call when it is None. For instance:

def log_event_time(event, event_time=None):
    event_time = event_time or datetime.now()
    print(f"log this event - {event} at {event_time}")

Similar variable binding mistakes can happens when you implement your lambda functions. For your further reading, you may check my previous post why your lambda function does not work for more examples.

Default Value for Mutable Objects

Another mistake Python beginners trend to make is to set a default value for a mutable function parameter. For instance, the below user_list parameter in the add_white_list function:

def add_white_list(user, user_list=[]):
    user_list.append(user)
    return user_list

You may expect when user_list is not given, a empty list will be created and then new user will be added into this list and return back. It is working as expected for below:

my_list = add_white_list('Jack')

# ['Jack']

my_list = add_white_list('Jill', my_list)

#['Jack', 'Jill']

But when you want to start with a empty list again, you would see some unexpected result:

my_new_list = add_white_list('Joe')
# ['Jack', 'Jill', 'Joe']

From the previous variable binding example, we know that the default value for user_list is created only once at the function definition time. And since list is mutable, the changes made to the list object will be referred by the subsequent function calls.

To solve this problem, we shall give None as the default value for user_list and use a local variable to create a new list when user_list is not given during the call. e.g.:

def add_white_list(user, user_list=None):
    if user_list is None:
        user_list = []
    user_list.append(user)
    return user_list

Someone may get confused that datetime.now() shall create a Python class instance, which supposed to be mutable also. If you checked Python documentation, you would see the implementation of datetime is actually immutable.

Misunderstanding of Python Built-in Functions

Python has a lot of powerful built-in functions and some of them look similar by names, and if you do not spend some time to read through the documentation, you may end up using them in the wrong way.

For instance, you know built-in sorted function or list sort function both can be used to sort sequence object. But occasionally, you may make below mistake:

random_ints = [80, 53, 7, 92, 30, 31, 42, 10, 42, 18]

# The sorting is done in-place, and function returns None
even_integers_first = random_ints.sort(key=lambda x: x%2)

# Sorting is not done in-place, function returns a new list
sorted(random_ints)

Similarly for reverse and reversed function:

# The reversing is done in-place, and function returns None
random_ints = random_ints.reverse()
# reversing is not done in-place, function returns a new generator
reversed(random_ints)

And for list append and extend function:

crypto = ["BTC", "ETH"]

# the new list will be added as 1 element to crypto list
crypto.append(["XRP", "BNB"])

print(crypto)
#['BTC', 'ETH', ['XRP', 'BNB']]

# the new list will be flattened when adding to crypto list
crypto.extend(["UNI"])

print(crypto)
# ['BTC', 'ETH', ['XRP', 'BNB'], 'UNI']

Modifying Elements While Iterating

When iterating a sequence object, you may want to filter out some elements based on certain conditions.

For instance, if you want to iterate below list of integers and remove any elements if it is below 5. You probably would write the below code:

a = [1, 2, 3, 4, 5, 6, 2]

for b in a:
    if b < 5:
        a.remove(b)

But when checking the output of the list a, you would see the result is not as per you expected:

print(a)
# [4, 5, 6, 2]

This is because the for statement will evaluate the expression and create a generator for iterating the elements. Since we are deleting elements from the original list, it will also change the state of the generator, and then further cause the unexpected result. To fix this issue, you can make use of the list comprehension as per below if your filter condition is not complex:

[b for b in a if b >= 5]

Or if you wish, you can use the the filterfalse together with the lambda function:

from itertools import filterfalse
list(filterfalse(lambda x: x < 5, a))

Re-iterate An Exhausted Generator

Many Python learners started writing code without understanding the difference between generator and iterator. This would cause the error that re-iterating an exhausted generator. For instance the below generator, you can print out the values in a list:

some_gen = (i for i in range(10))
print(list(some_gen))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

And sometimes you may forget you have already iterated the generator once, and when you try to execute the below:

for x in some_gen:
    print(x)

You would not be able to see anything printed out.

To fix this issue, you shall save your result into a list first if you’re not dealing with a lot of data, or you can use the tee function from itertools module to create multiple copies of the generator so that you can iterate multiple times:

from itertools import tee

# create 3 copies of generators from the original iterable
x, y, z = tee(some_gen, n=3)

Conclusion

In this article, we have reviewed through some common Python mistakes that you may encounter when you start writing Python codes. There are definitely more mistakes you probably would make if you simply jump into the coding without understanding of the fundamentals. But as the old saying, no pain no gain, ultimately you will get there when you drill down all the mistakes and clear all the roadblocks.

Python one-liners with list comprehension and ternary operation

15 Most Powerful Python One-liners You Can’t Skip

Introduction

One-liner in Python refers to a short code snippet that achieves some powerful operations. It’s popular and widely used in Python community as it makes the code more concise and easier to understand. In this article, I will be sharing some most commonly used Python one-liners that would definitely speed up your coding without compromising any clarity.

Let’s start from the basis.

Ternary operations

Ternary operation allows you to evaluate a value based on the condition being true or false. Instead of writing a few lines of if/else statements, you can simply do it with one line of code:

x = 1
y = 2

result = 1 if x > 0 and y > x else -1
print(result)
# 1

#re-assign x to 6 if it is evaluated as False
x = x or 6

Assign values for multiple variables

You can assign values for multiple variables simultaneously as per below. (You may want to check this article to understand what is going on under the hood)

key, value = "user", "password"

print(key, value)
#('user', 'password')

Swap variables

To swap the values of the variables, simply perform the below without having a temp variable which is usually required by other programming languages like Java or C.

key, value = value, key

print(key, value) 
#('password', 'user')

Swap elements in a list

Imagine you have a list of users as per below, and you would like to swap the first element with last element:

users = ["admin", "anonymous1", "anonymous2"]

Since list is mutable, you can re-assign the values of first and last elements by swapping their sequence as per below:

users[0], users[2] = users[2], users[0]
# or users[0], users[-1] = users[-1], users[0]

print(users)
#['anonymous2', 'anonymous1', 'admin']

Further more, if the requirement is to swap the elements at the odd and even positions in a list, e.g. in the below list:

numbers = list(range(10))
#[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

We would like to swap the elements of the 2nd and 1st, 4th and 3rd, and so on. It can be achieved by performing the below list slicing with assignment operation:

numbers[::2], numbers[1::2] = numbers[1::2], numbers[0::2]

print(numbers)
#[1, 0, 3, 2, 5, 4, 7, 6, 9, 8]

Replace elements in a list

To further expend the above example, if we want to replace the elements on every odd/even position in a list, for instance to 0, we can do the re-assignment with below:

numbers[1::2] = [0]*len(numbers[1::2])

print(numbers)
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Of course there is an alternative way with list comprehension, we shall touch on it later.

Generate list with list comprehension

By using list comprehension, you can easily generate new a list with certain filtering conditions from the current sequence object. For instance, the below will generate a list of even numbers between 1 to 20:

even_nums = [i for i in range(1, 20) if i%2 == 0]

print(even_nums)
#[2, 4, 6, 8, 10, 12, 14, 16, 18]

Create sub list from a list

Similarly, you can get a sub list from the existing list with the list comprehension as per below:

[i for i in even_nums if i <5]
# 2, 4

Manipulating elements in the list

With list comprehension, you can also transform your list of elements into another format. For instance, to convert the integers to alphabets:

alphabets = [chr(65+i) for i in even_nums]
# ['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

Or convert the upper case into lower case:

[i.lower() for i in alphabets]
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

And all the above can be done without list comprehension as well:

list(map(lambda x : chr(65+x), even_nums))
#['C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S']

list(map(str.lower, alphabets))
#['c', 'e', 'g', 'i', 'k', 'm', 'o', 'q', 's']

Another real world example would be to use list comprehension to list out all the .ipynb files from current folder and its sub folders (excluding the checkpoint files):

import os

[f for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Flatten a list of sequences

If you have a list of sequence objects as per below, and you would like to flatten them into 1 dimensional:

a = [[1,2], [3,4], [5,6,7]]

You can use multiple for expressions in list comprehension to flatten it:

b = [y for x in a for y in x]

print(b)
#[1, 2, 3, 4, 5, 6, 7]

Alternatively, you can make use of the itertools module to get the same result:

import itertools

list(itertools.chain.from_iterable(a))

Ternary operation with list comprehension

In the previous ternary operation example, we have discussed how to replace the elements in the even position of a list. Here is the alternative using list comprehension in conjunction with a ternary expression:

numbers = list(range(10))
[y if i % 2 == 0 else 0 for i, y in enumerate(numbers)]
#[0, 0, 2, 0, 4, 0, 6, 0, 8, 0]

Generate a dictionary with dictionary comprehension

To derive a dictionary from a list, you can use dictionary comprehension as per below:

even_nums_dict = {chr(65+i):v for i, v in enumerate(even_nums)}

#{'A': 2, 'B': 4, 'C': 6, 'D': 8, 'E': 10, 'F': 12, 'G': 14, 'H': 16, 'I': 18}

Generate a set with set comprehension

Similar operation is also available for set data type when you want to derive elements from a list to set:

even_nums_set = {chr(65+i) for i in even_nums}
#{'C', 'E', 'G', 'I', 'K', 'M', 'O', 'Q', 'S'}

When using the built-in data type set, you shall expect that it only keeps the unique values. For instance, you can use set to remove duplicate values:

a = [1,2,2,4,6,7]
unique = set(a)

print(unique)
#{1, 2, 4, 6, 7}

More Python comprehension examples can be found here.

Read file into generator

Reading files can be done in one-liner as per below:

text = (line.strip() for line in open('response.html', 'r'))

Take note the parentheses are used in above generator expression rather than [], when [] is used, it returns a list.

One-liner with Python -c command

Sometimes you may want to run code snippets without entering into the Python interactive mode. You can execute the code with Python -c option in command line window. For instance, check the current Python version:

python -c "import sys; print(sys.version.split()[0])"

#3.7.2

Or check the value of the environment variable:

python -c "import os;print(os.getenv('PATH').split(';'))"

Conclusion

In this article, we have reviewed through some commonly used Python one-liners which would greatly improve your code readability and coding productivity. There are definitely more to be covered, and every Pythonista would have his/her own list of favorite one-liners. Sometimes you will also need to consider the code performance before you use it rather simply pursuing the conciseness of the code.

As the general rule of thumb, you shall not use/innovate something that confusing ,difficult to read or totally not benefiting either in readability or productivity.

python visualize google trends data with word cloud

Python – Visualize Google Trends Data in Word Cloud

Christmas is just around the corner, the snowfall, beautiful festive lights and joyful songs from the last year still floating in your mind. But this year, things are getting unusual due to the Covid-19. A lot of celebration events are cancelled or suspended and people are advised to avoid gathering and stay at home as much as possible. Although staying at home became new norm, there is still a way that we can get to know what people are thinking about during this festive season since nowadays most of us search a lot from Google every day. With a few lines of Python code, we will be able to extract and visualize the data from Google Trends.

Let’s dive into the code examples.

Python to get Google trends data

To get the search trends from Google, we will need to use a Python package – pytrends. It’s not an official API for Google trends but It provides a convenience way to automatically download Google trends data same as what we can do manually from Google Trends website.

You can use the pip command to install the package:

pip install --upgrade pytrends

And import the necessary modules at the beginning of our code:

from pytrends.request import TrendReq

To use it, we can initiate the request object by providing the language for searching as well as the time zone information. For instance, I am specifying English as the language and time zone offset as -480 which is UCT + 8 in the below. The default value for this time zone offset is 360 (CST), so you can roughly see how this offset is calculated based on the UCT time zone.

pytrend = TrendReq(hl='en-US', tz=-480)

To get the search trends for a particular keyword, we shall specify it in a keyword list. For example, we use “christmas” to see what people have searched in Google related to this keyword. There are a few more parameters you need to specify in the build_payload function in order to narrow down the results:

cat – The category you are interested in, you can see the full list here.

timeframe – The date range when the search happened. You can specify the range as past X hours/days/months/years (the list of available options you can see from Google Trends web page) or even a specific start date and end date. For our case, we use “now 7-d” for the past 7 days.

geo –  The geolocation which can be two characters country code or leave it empty to see the results from globally

gprop – The source which you can leave it as empty for web search, other options can be images, news, youtube, or froogle

Let’s build up our query as per below:

kw_list = ["christmas"]
pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='SG', gprop='')

With all these criteria, we can check what are the related topics people searched in Google from Singapore. The related_queries function will give you a dictionary of both top & rising queries related to the keywords:

trends = pytrend.related_queries()

If you examine the trends variable, you shall see something similar to below:

python visualize google trends data with word cloud

 

The dictionary consists of results for both “top” and “rising” results in pandas dataframe objects, and you can access the top queries as per below:

df_sg = trends["christmas"]["top"]

Examine the first a few records in df_sg, you can see that people in Singapore are still in celebration mood as most of records are related to greetings, light shows or gifts etc.

python visualize google trends data with word cloud

On the other hand, let’s also take a look at the search trends for UK since It has just announced some new restrictions on travelling recently.

pytrend.build_payload(kw_list, cat=0, timeframe='now 7-d', geo='GB', gprop='')
trends = pytrend.related_queries()
df_gb = trends["christmas"]["top"]

Examining the df_gb variable, you can see some people started worrying about the new rules and restrictions for this Christmas although majority of the searching results are still around of the festival celebration.

python visualize google trends data with word cloud

 

Visualize the results in word cloud

Since we have all the keywords and popularity that people used for search, the most straightforward to visualize them would be using word cloud to generate a picture. To do so, we will need use another python package – wordcloud which is a pure Python library for generating word cloud image. And you also need to use some supporting packages like PIL and numpy for manipulating the images.

You can use pip command to install these packages if you do not have them yet:

pip install --upgrade wordcloud
pip install Pillow==2.2.2
pip install --upgrade numpy

Let’s import all the necessary modules into our code:

from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS
from PIL import Image
import os
import numpy as np

From previous section, we have already got the search keywords in dataframe. wordcloud supports both text string and words frequencies, for simplicity, let’s convert only keywords into a space separated string and forget about the value (popularity).

text = ' '.join(df_sg["query"].to_list())

And as all the keywords contain “christmas”, we shall filter out this word before generating the word cloud. In wordcloud package, it has a list of predefined words to be excluded,  and you can append more words to be excluded as per the below:

stopwords = set(STOPWORDS) 
stopwords.add("christmas")

Now let’s use this featured image as our background for generating word cloud. We shall load it as a 3-demensional array as the background mask for later use:

bg_mask = np.array(Image.open(os.path.join(os.getcwd(), "christmas tree.jpg")))

With all these ready, we can initiate a word cloud object with below parameters. The name of the parameters are quite self-explanatory, so I will not go through them one by one. You can check the official document from here.

wc = WordCloud(
    width = 600, 
    height = 1000,    
    background_color = 'white',
    colormap = 'rainbow',
    mask = bg_mask,
    stopwords = stopwords,
    max_words = 1000,
    max_font_size = 150,
    min_font_size = 15,
    contour_width = 2, 
    contour_color = 'dodgerblue'
)

Then we can supply our words to the generate_from_text function which will process the text and generate the image. Next we can save the output into an image file as per below code:

wc.generate_from_text(text)
wc.to_file("SG_christmas_cloud.jpg")

When opening the output image file, you shall see something like the below. Isn’t that cool?

python visualize google trends data with word cloud

Similarly, when you pass the UK searching result and generate the word cloud, you would see “covid” and “rules” are most concerned by UK people.

python visualize google trends data with word cloud

Note: since we are passing through a text string, the frequency is based on how many times the words repeated rather than the popularity from Google.

Conclusion

In this article, we have discussed how to use pytrends to automatically get the Google search data for any particular keyword and then use wordcloud to visualize the information. It only covers some basic usage of these two packages, you may check further on their documents to understand what else are provided in these packages. One thing to take note is that pytrends is using some scrapping techniques to get the data from Google Trends, it may break when there is any structural change in the way that Google makes the requests or sends the response. So frequent code upgrade is required by the project team. By the way, they are looking for maintainers, just in case you are interested.

 

web scraping with python requests and lxml

Web Scraping From Scratch With 3 Simple Steps

Introduction

Web scraping or crawling refers to the technique to extract the information from a website and transform into structured data for later analysis. There are generally a few reasons that you may need to implement a web scraping scripts to automate the data collection process:

  • There isn’t any public API available for you to get data from the source sites
  • The information is updated from time to time, such as the exchange rate, you cannot manage it in a manual way
  • The final data you need is piecemeal from multiple sites; and so on

Before you decide to implement a scraping script, you will also need to check to be sure that you are not violating the term of use for the data you are going to scrape. Some sites are against the scraping robot. This article is intended for education purpose to help you to understand the overall processes of web scraping, so we will assume you already know the implication of the web scraping and possible legal issues on how the data is used.

Scraping a website sometimes can be difficult depends on how the target website is designed and where the data is resided. But generally you can split the process into 3 steps. Let’s walk through them one by one.

Understand the structure of your target website

As the first step, you shall take a quick look at your target website to see how the front end interacts with the backend, and how the data is populated to the web page. To keep our example simple, let’s assume user authentication is not required and our target is to extract the price change for the top 20 cryptocurrencies from coindesk for further analysis.

The first thing we shall do is to understand how this information is organized on the website. Below is the screenshot of the data presented on the web page:

web scraping with python requests and lxml

In Chrome browser, if you right click on the web page to inspect the HTML elements, you shall see that the entire data table is under <section class=”cex-table”>…</section>. You can verify this by hovering your mouse to this element, you would see there is a light blue overlay on the data table as per below:web scraping in python with requests and lxml

Next, you may want to inspect each text field on the page to further understand how the table header and records are arranged. For instance, when you check the “Asset” text field, you would see the below HTML structure:

<section class="cex-table">
	<section class="thead">
		<div>...</div>
		<div class="tr-wrapper">
			<div class="tr-left">
				<div class="tr">
					<div>...</div>
					<div style="flex:7" class="th">
						<span class="cell">
						<i class="sorting-icon">
						</i>
						<span class="cell-text">Asset</span>
						</span>
					</div>
				</div>
			</div>
		</div>
		...
	</section>
</section>

And similarly you can find the structure of the first row in the table body as per below:

<section class="tbody">
	<section class="tr-section">
		<a href="/price/bitcoin">
			<div class="tr-wrapper">
				<div class="tr-left">
					<div class="tr">
						<div style="flex:2" class="td">
							<span class="cell cell-rank">
							<strong>01</strong>
							</span>
						</div>
						<div style="flex:7" class="td">
							<span class="cell cell-asset">
							<img>...</img>
							<strong class="cell-asset-title">Bitcoin</strong>
							<span class="cell-asset-iso">BTC</span>
							</span>
						</div>
					</div>
				</div>
			</div>
		</a>
	</section>
</section>

You may notice that majority of these HTML elements does not have a id or name attribute as the unique identifier, but the style sheet (“class” attribute) is quite consistent for the same row of data. So in this case, we shall consider to use the style sheet as a reference to find our data elements.

Locate and parse the target data element with XPath

With the initial understanding on HTML structure of our target website, we shall start to find a way to locate the data elements programmatically.

For this demonstration, we will use requests and lxml libraries to send the http requests and parse the results. There are other package for parsing DOM such as beautifulsoup, but personally I find using XPath expression is more straightforward when locating an element although the syntax may not as intuitive as the way beautifulsoup does.

Below is the pip command if you do not have these two packages installed:

pip install requests
pip install lxml

Let’s import the packages and send a GET request to our target URL:

import requests
from lxml import html

target_url = "https://www.coindesk.com/coindesk20"
result = requests.get(target_url)

Our target URL does not require any parameters, in case you need to pass in parameters, you can pass via the params argument as per below:

payload = {"q" : "bitcoin", "s" : "relevant"}
result = requests.get("https://www.coindesk.com/search", params=payload)

The result is a response object which has a status_code attribute to indicate if correct response has been returned from the target website. To simplify the code, let’s assume we can always get the correct response with the return HTML in string format from the text attribute.

We then pass our HTML string to lxml and use it to parse the DOM tree as per below:

tree = html.fromstring(result.text)

Now we come to the most important step, we will need to use XPath syntax to locate the data elements we want and extract the data out.

Since the id or name attributes are not available for these elements, we will need to use the style sheet to locate our data elements. To locate the table header, we need to perform the below:

  • Find the section tag with style sheet class as “cex-table” from the entire DOM
  • Find its child section node with style sheet class as “thead
  • Further find its child div node with style sheet as “tr-wrapper

Below is how the syntax looks like in XPath:

table_header = tree.xpath("//section[@class='cex-table']/section[@class='thead']/div[@class='tr-wrapper']")

It will scan through the entire DOM tree to find if any element matches this structure and return a list of nodes matched.

If everything goes well, the table_header list should only contain 1 element which is the div with “tr-wrapper” style sheet. Sometimes if it returns multiple nodes, you may need recheck your path expression to see how you can fine-tune it to get only the unique node that you need.

From the wrapper div, there are still a few levels before we can reach to the node with the text. But you may notice that all the data fields we need are under the span tag which has a style name “cell-text“. So we can actually locate all these span tags with CSS class and extract its text with text() function. Below is how it works in XPath expression:

headers = table_header[0].xpath(".//span[@class='cell']/span[@class='cell-text']/text()")

Note that “.” means to start from current node, and “//” is to indicate the following path expression is relative path

If you examine the headers now, you can see all the column headers are extracted into a list as per below:

['Asset',
 'Price',
 'Market Cap',
 'Total Exchange Volume',
 'Returns (24h)',
 'Total Supply',
 'Category',
 'Value Proposition',
 'Consensus Mechanism']

Let’s continue to move the table body. Following the same logic, we shall be able to locate to the section with “tr-section” in below syntax:

table_body = tree.xpath("//section[@class='cex-table']/section[@class='tbody']/section[@class='tr-section']")

This means that we have already collected all the nodes for rows in the table body. We can now loop through the rows to get the elements. We will use the style sheet to locate our elements, but for the “Asset” column, it actually contains a few child nodes with different style sheet, so we need to handle them separately from the rest of the columns. Below is the code to extract the data row by row and add it into a record list:

records = []
for row in table_body:    
    tokens = row.xpath(".//span[contains(@class, 'cell-asset-iso')]/text()")
    ranks = row.xpath(".//span[contains(@class, 'cell-rank')]/strong/text()")
    assets = row.xpath(".//span[contains(@class, 'cell-asset')]/strong/text()")
    spans = row.xpath(".//div[contains(@class,'tr-right-wrapper')]/div/span[contains(@class, 'cell')]")
    rest_cols = [span.text_content().strip() for span in spans]
    row_data = ranks + tokens + assets + rest_cols
    records.append(row_data)

Note that we are using “contains” in order to match the node with class like cell cell-rank“, and use text_content() to extract all the text from its current nodes and child nodes.

Occasionally you may find that the number of columns we extracted does not tally with the original column header due to header column merged or hidden, such as our above ranking and token ticker column. So let’s also give them column name as “Rank” and “Token”:

column_header = ["Rank", "Token"] + headers

Save the scraping result

With both the header and data ready, we can easily load the data into pandas as per below:

import pandas as pd
df = pd.DataFrame(records, columns=column_header)

You can see the below result in pandas dataframe, which looks pretty good except some formatting to be done to convert all the amount into proper number format.

web scraping to get cryptocurrency price

Or you can also write the scrapped data into a csv file with the csv module:

import csv
with open("token_price.csv", "w", newline="") as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(column_header)
    for row in records:
        writer.writerow(row)

Limitations & Constraints

In your real scraping project, you may encounter more complicated scenarios rather than directly getting the data from a GET request. So it’s better to understand how are the constraints/limitations for our above mentioned approach.

  • Go through the authentication process can be time-consuming with requests

If your target website requires authentication before you can retrieve the data, you may need to create a session and send multiple POST/GET requests to the server in order to get yourself authorized. Depends on how complicated the authentication process is, you will need to understand what are the parameters to be supplied and how the requests are chained together. This process may take some time and effort.

  • You cannot trigger JavaScript code to get your data

If the response from your target website returns some JavaScript code to populate the data, or you need to trigger some JavaScript function in order to have the data populated on the web page, you may find requests package simply would not work.

For both scenarios, you may consider to use selenium which I have mentioned in one of my past post. It has a headless mode where you can simulate user’s action such as key in user credentials or click buttons without actually showing the browser, and you can also execute JavaScript code to interact with the web page. The downside is that you will have to periodically upgrade your driver file to match with the browser’s version.

Conclusion

In this article, we have reviewed through a very basic example to scrape data with requests and lxml packages, and we have also discussed a few limitations where you may start looking for alternatives such as selenium or even the scrapy framework in case you have more complicated scenarios to be handled. No matter which libraries you choose to use, the fundamental remains the same. Hope this article gives you some hints on how to start your web scraping journey.

 

gspread read and write google sheet

Read and write Google Sheet with 5 lines of Python code

Introduction

Google Sheet is a very powerful tool in terms of collaboration, it allows multiple users to work on the same rows of data simultaneously. It also provides fine-grained APIs in various programming languages for your application to connect and interact with Google Sheet. Sometimes when you just need some simple operations like reading/writing data from a sheet, you may wonder if there is any higher level APIs that can complete these simple tasks easily. The short answer is yes. In this article, we will be discussing how can we read/write Google Sheet in 5 lines of Python code.

Prerequisites

As the prerequisite, you will need to have a Google service account in order for you to go through the Google cloud service authentication for your API calls. You can follow the guide from here for a free account setup. Once you have completed all the steps, you shall have a JSON file similar to below which contains your private key for accessing the Google cloud services. You may rename it to “client_secret.json” for our later use.

{
  "type": "service_account",
  "project_id": "new_project",
  "private_key_id": "xxxxxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\xxxxx\n-----END PRIVATE KEY-----\n",
  "client_email": "xxx@developer.gserviceaccount.com",
  "client_id": "xxx",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxx%40developer.gserviceaccount.com"
}

From this JSON file, you can also find the email address for your newly created service account, if you need to access your existing Google Sheet files, you will need to grant access of your files to this email address.

Note: There is a limit of 100 requests per every 100 seconds for the free Google service account, you may need to upgrade your account to the paid account if this free quota is not sufficient for your business.

In addition to the service account, we need another two libraries google-auth and gspread which are the core modules to be used for authentication and manipulating the Google Sheet file.

Below is the pip command to install the two libraries:

pip install gspread
pip install google-auth

Lastly, let’s create a Google Sheet file namely “spreadsheet1” with some sample data from US 2020 election result:

gspread write and read google sheet

Once you have all above ready, let’s dive into our code examples.

Read Google Sheet data into pandas

Let’s first import the necessary libraries at the top of our script:

import gspread
from google.oauth2.service_account import Credentials
import pandas as pd

To get the access to Google Sheet, we will need to define the scope (API endpoint). For our case, we specify the scope to read and write the Google Sheet file.  If you would like to restrict your program from updating any data, you can specify spreadsheets.readonly and drive.readonly in the scope.

scope = ['https://www.googleapis.com/auth/spreadsheets',
        'https://www.googleapis.com/auth/drive']

And then we can build a Credentials object with our JSON file and the above defined scope:

creds = Credentials.from_service_account_file("client_secret.json", scopes=scope)

Next, we call the authorize function from gspread library to pass in our credentials:

client = gspread.authorize(creds)

With this one line of code, it will be going through the authentication under the hood. Once authentication passed, it establishes the connection between your application and the Google cloud service. From there, you can send request to open your spreadsheet file by specifying the file name:

google_sh = client.open("spreadsheet1")

Besides opening file by name, you can also use open_by_key with the sheet ID or open_by_url with the URL of the sheet.

If the proper access has been given to your service account, you would be able to gain the control to your Google Sheet, and you can continue to request to open a particular spreadsheet tab.

For instance, below returns the first sheet of the file:

sheet1 = google_sh.get_worksheet(0)

With the above, you can simply read all records into a dictionary with get_all_records function, and pass into a pandas DataFrame:

df = pd.DataFrame(data=sheet1.get_all_records())

Now if you examine the df object, you shall see the below output:

gspread write and read google sheet

So that’s it! With a few lines of code, you’ve successfully downloaded your data from Google Sheet into pandas, and now you can do whatever you need in pandas.

If you have duplicate column names in your Google Sheet, you may consider to use get_all_values function to get all the values into a list, so that duplicate column remains:

df = pd.DataFrame(data=sheet1.get_all_values())

All the column and row labels will default to RangeIndex as per below:

gspread write and read google sheet

Now let’s take a further look at what else we can achieve with the this library.

Add/Delete work sheets

With gspread, you can easily add new sheets or duplicate from the existing sheets. Below is an example to create a new sheet namely “Sheet2” with max number of rows and columns specified. The index parameter tells Google Sheet where you want to insert your new sheet. index=0 indicates the new sheet to be inserted as the first sheet.

sheet2 = google_sh.add_worksheet(title="Sheet2", rows="10", cols="10", index=0)

Duplicating an existing sheet can be done by specifying the source sheet ID and the new sheet name:

google_sh.duplicate_sheet(source_sheet_id=google_sh.worksheet("Sheet1").id, 
    new_sheet_name="Votes Copy")

Similarly, you can delete an existing sheet by passing in the worksheet object as per below:

google_sh.del_worksheet(sheet2)

If you would like to re-order your worksheets, you can do it with reorder_worksheets function. Assuming you want the sheet2 to be shown before sheet1:

google_sh.reorder_worksheets([sheet2, sheet1])

Read/Write Google Sheet cells

The worksheet object has the row_count and col_count properties which indicate the max rows and columns in the sheet file. But it’s not that useful when you want to know how many rows and columns of actual data you have:

print(sheet1.row_count, sheet1.col_count)
#1000, 26

To have a quick view of the number of rows and columns of your data, you can use:

#Assuming the first row and first column have the full data
print("no. of columns:", len(sheet1.row_values(1)))
#no. of columns: 3
print("no. of rows:", len(sheet1.col_values(1)))
#no. of rows: 8

To access the individual cells, you can either specify the row and column indexes, or use the A1 notation. For instance:

#Access the row 1 and column 2
sheet1.cell(1, 2).value
# or using A1 notation
sheet1.acell('B1').value

Note: the row/column index and A1 notation are all one-based numbers which is similar to the MS excel

Similarly, you can update the value for each individual cell as per below:

sheet1.update_cell(1, 2, "BIDEN VOTES")
#or
sheet1.update_acell("B1", "BIDEN VOTES")

To update multiple cells, you shall use the worksheet update function with the list of cells and values to be updated. For instance, below code will replace the values in row 8:

sheet1.update("A8:C8", [["Texas", 5261485, 5261485]])

Or use batch_update to update multiple ranges:

sheet1.batch_update([{"range": "A8:C8", 
                    "values" : [["Texas", 5261485, 5261485]]},
                     {"range": "A9:C9", 
                    "values" : [["Wisconsin", 1630673, 1610065]]},
                    ])

or use append_rows to insert a row at the last row:

sheet1.append_rows(values=[["Pennsylvania", 3458312, 3376499]])

Besides updating cell values, you can also update the cell format such as the font, background etc. For instance, the below will update the background of the 6th row to red color:

sheet1.format("A6:C6", 
              {"backgroundColor": {
                  "red": 1.0,
                  "green": 0.0,
                  "blue": 0.0,
                  "alpha": 1.0
              }
    })

Note that Google is using RGBA color model, so the color values must be numbers between 0-1.

Below is how it looks like in Google Sheet:

gspread write and read google sheet , format google sheet

Sometimes, it might be difficult to locate the exact index of the cell to be updated. You can find the cell by it’s text with the find function. It will return the first item from the matches.

cell = sheet1.find("Michigan")
print(cell.row, cell.col, cell.value)
#6, 1, 'Michigan'

You can also use Python regular express to find all matches. For instance, to find all cells with text ending as “da”:

import re
query = re.compile(".*da")
cells = sheet1.findall(query)
print(cells)
#[<Cell R4C1 'Florida'>, <Cell R7C1 'Nevada'>]

Add/Remove permission for your Google Sheet

Adding or removing permission for a particular Google Sheet file can be also super easy with gspread. Before adding/removing permission, you shall check who are the users currently have access to your file. You can use list_permission function to retrieve the list of users with their roles:

google_sh.list_permissions()

To give access of your file to other users, you can use:

#perm_type can be : user, group or domain
#role can be : owner, writer or reader
google_sh.share('someone@gmail.com', perm_type='user', role='reader')

When you check your file again, you shall see the email address you shared is added into the list of authorized users.

To revert back the access for a particular user, you can use remove_permissions function. By default, it removes all the access that has been granted to the user:

google_sh.remove_permissions('someone@gmail.com', role="writer")

When the role you’ve specifying does not match with the roles the user currently has, the function returns without doing anything.

Conclusion

Google Sheet API provides comprehensive interfaces for manipulating the sheets from the normal operations like reading/writing of the data, to validation, formatting, building pivot tables and charts etc. Often you find that you may just need some simple APIs to read and write Google Sheet files.

In this article, we have reviewed though the gspread package which provides the high level APIs for working with Google Sheets to serve for this purpose. With gspread, you are able to open existing sheet or create new Google Sheet file, read/write the data as well as do simply formatting. In fact, there are a few other libraries such as gspread-formatting and gspread-pandas which offer extensive functionalities for sheet formatting and interacting sheets with pandas dataframe, you may take a look in case you need something more complicated than what we have covered here.