Tutorials

Supreme Tutorials for Python Learning

Tutorial

python virtual environment, isolated environment

3 Ways for Managing Python Virtual Environment

 Introduction Python virtual environment refers to an isolated execution environment for managing Python versions, dependencies, and indirectly permissions. When you have multiple projects working on and there are potential conflicting requirements such as different Python versions or libraries to be used in these projects, you need to consider using a virtual environment so that installing […]

Read More
combine data in pandas with merge vs join

Pandas Tricks - Combine Data in Different Ways

Introduction If you have used pandas for your data analysis work, you may already get some idea on how powerful and flexible it is in terms of data processing. Many times there are more than one way to solve your problem, and choosing the best approach become another tough decision. For instance, in one of […]

Read More
group consecutive rows of same values in pandas

How to group consecutive rows of same values in pandas

Problem Statement You have a data set which you would like to group consecutive rows if one of the columns has the same values. If there is a different value in between the rows, these records shall be split into separate groups. To better elaborate the issue, let’s use an example. Assuming you have the […]

Read More
python logging, queuehandler

8 Tips for Using Python Logging

Python Logging is a built-in module typically used for capturing runtime events for diagnostic purposes. If you are using Python to build a tool for yourself or just for proof of concept, you may not necessarily need any logging. But if the code is to be used in production environment, it would be a better […]

Read More
plot route on Google Maps with Python Google Maps API, optimize route with Google Maps API, Singapore

Plot Route on Google Maps with Python

Introduction You probably use Google Maps a lot in your daily life, such as locate a popular restaurant, check the distance between one place to another, or find the nearest driving route and shortest travelling time etc. If you have been thrown with a big set of location data where you need check and validate […]

Read More
create animated charts and gif in python with pandas-alive

Create Animated Charts In Python

Introduction If you are working as a data analyst or data scientist for some time, you may have already known how to use matplotlib to visualize and present data in various charts. The matplotlib library provides an animation module to generate dynamic charts to make your data more engaging, however it still takes you a […]

Read More
Photo by Ali Yahya on Unsplash

Master python closure with 3 real-world examples

Introduction

Python closure is a technique for binding function with an environment where the function gets access to all the variables defined in the enclosing scope. Closure typically appears in the programming language with first class function, which means functions are allowed to be passed as arguments, return value or assigned to a variable.

This definition sounds confusing to the python beginners, and sometimes the examples found from online also not intuitive enough in the way that most of the examples are trying to illustrate with some printing statement, so the readers may not get the whole idea of why and how the closure should be used. In this article, I will be using some real-world example to explain how to use closure in your code.

Nested function in Python

To understand closure, we must first know that Python has nested function where one function can be defined inside another. For instance, the below inner_func is the nested function and the outer_func returns it’s nested function as return value.

def outer_func():    
    print("starting outer func")
    def inner_func():
        pi = 3.1415926
        print(f"pi is : {pi}")
    return inner_func

When you invoke the outer_func, it returns the reference to the inner_func, and subsequently you can call the inner_func. Below is the output when you run in Jupyter Notebook:

python closure nested function example

After you have got some feeling about the nested function, let’s continue to explore how nested function is related to closure. If we modify our previous function and move the pi variable into outer function, surprisedly it generates the same result as previously.

def outer_func():    
    print("starting outer func")
    #move pi variable definition to outer function
    pi = 3.1415926
    def inner_func():
        print(f"pi is : {pi}")
    return inner_func

You may wonder the pi variable is defined in outer function which is a local variable to outer_func, why inner_func is able access it since it’s not a global scope? This is exactly where closure happens, the inner_func has the full access to the environment (variables) in it’s enclosing scope. The inner_func refers to pi variable as nonlocal variable since there is no other local variable called pi.

If you want to modify the value of the pi inside the inner_func, you will have to explicitly specify “nonlocal pi” before you modify it since it’s immutable data type.

With the above understanding, now let’s walk through some real-world examples to see how we can use closure in our code.

Hide data with Python closure

Let’s say we want to implement a counter to record how many time the word has been repeated. The first thing you may want to do is to define a dictionary in global scope, and then create a function to add in the words as key into this dictionary and also update the number of times it repeated. Below is the sample code:

counter = {}

def count_word(word):    
    global counter
    counter[word] = counter.get(word, 0) + 1
    return counter[word]

To make sure the count_word function updates the correct “counter”, we need to put the global keyword to explicitly tell Python interpreter to use the “counter” defined in global scope, not any variable we accidentally defined with the same name in the local scope (within this function).

Sample output:

python closure word counter sample output

The above code works as expected, but there are two potential issues: Firstly, the global variable is accessible to any of the other functions and you cannot guarantee your data won’t be modified by others. Secondly, the global variable exists in the memory as long as the program is still running, so you may not want to create so many global variables if not necessary.

To address these two issues, let’s re-implement it with closure:

def word_counter():
    counter = {}
    def count(word):
        counter[word] = counter.get(word, 0) + 1
        return counter[word]
    return count

If we run it from Jupyter Notebook, you will see the below output:

python closure word counter example output

With this implementation, the counter dictionary is hidden from the public access and the functionality remains the same. (you may notice it works even after the word_counter function is deleted)

Convert small class to function with Python closure

Occasionally in your project, you may want to implement a small utility class to do some simple task. Let’s take a look at the below example:

import requests

class RequestMaker:
    def __init__(self, base_url):
        self.url = base_url
    def request(self, **kwargs):
        return requests.get(self.url.format_map(kwargs))

You can see the below output when you call the make_request from an instance of RequestMaker:

python closure small class example

Since you’ve already seen in the word counter example, the closure can also hold the data for your later use, the above class can be converted into a function with closure:

import requests

def request_maker(url):
    def make_request(**kwargs):
        return requests.get(url.format_map(kwargs))
    return make_request

The code becomes more concise and achieves the same result. Take note that in the above code, we are able to pass in the arguments into the nested function with **kwargs (or *args).

python closure convert small class to closure

Replace text with case matching

When you use regular express to find and replace some text, you may realize if you are trying to match text in case insensitive mode, you will not able to replace the text with proper case. For instance:

import re

paragraph = 'To start Python programming, you need to install python and configure PYTHON env.'
re.sub("python", "java", paragraph, flags=re.I)

Output from above:

python closure replace with case

It indeed replaced all the occurrence of the “python”, but the case does not match with the original text. To solve this problem, let’s implement the replace function with closure:

def replace_case(word):
    def replace(m):
        text = m.group()
        if text.islower():
            return word.lower()
        elif text.isupper():
            return word.upper()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

In the above code, the replace function has the access to the original text we intend to replace with, and when we detect the case of the matched text, we can convert the case of original text and return it back.

So in our original substitute function, let’s pass in a function replace_case(“java”) as the second argument. (You may refer to Python official doc in case you want to know what is the behavior when passing in function to re.sub)

re.sub("python", replace_case("java"), paragraph, flags=re.IGNORECASE)

If we run the above again, you should be able to see the case has been retained during the replacement as per below:

python closure replace with case

Conclusion

In this article, we have discussed about the general reasons why Python closure is used and also demonstrated how it can be used in your code with 3 real-world examples. In fact, Python decorator is also a use case of closure, I will be discussing this topic in the next article.

 

pandas tricks pass multiple columns to lambda

Pandas tricks – pass multiple columns to lambda

Pandas is one of the most powerful tool for analyzing and manipulating data. In this article, I will be sharing with you the solutions for a very common issues you might have been facing with pandas when dealing with your data – how to pass multiple columns to lambda or self-defined functions.

Prerequisite

You will have to install pandas on your working environment:

pip install pandas

When dealing with data, you will always have the scenario that you want to calculate something based on the value of a few columns, and you may need to use lambda or self-defined function to write the calculation logic, but how to pass multiple columns to lambda function as parameters?

Let me use some real world example, so that easier for you to understand the issue that I am talking about. Below table shows partial of the e-com delivery charges offered by some company, so the delivery charges are determined by package size (H+L+W), package weight and the delivery mode you are choosing.

Size (cm/kg) 3 hours express Next Day Delivery Same Day Delivery
<60 CM (H+L+W) & MAX 1KG 12 8 10
<80 CM (H+L+W) & MAX 5KG 15 9 11
<100 CM (H+L+W) & MAX 8KG 17 11 13
<120 CM (H+L+W) & MAX 10KG 19 14 16

And assuming we have the below order data, and we want to simulate the delivery charges. Let’s create the data in a pandas dataframe.

import pandas as pd

df = pd.DataFrame({
    "Order#" : ["1", "2", "3", "4"], 
    "Weight" : [5.0, 2.1, 8.1, 7.5], 
    "Package Size" : [80, 45, 110, 90],
    "Delivery Mode": ["Same Day", "Next Day", "Express", "Next Day"]})

If you view dataframe from Jupyter Notebook (you can sign up here to use it for free), you shall be able to see the data as per below.

Pandas pass multiple columns to lambda same data

Let’s also implement a calculate_rate function where we need to pass in the weight, package size, and delivery mode in order to calculate the delivery charges:

def calculate_rate(weight, package_size, delivery_mode):
    #set the charges as $20 since we do not have the complete rate card
    charges = 20
    if weight <=1 and package_size <60:
        if delivery_mode == "Express":
            charges = 13
        elif delivery_mode == "Next Day":
            charges = 8
        else:
            charges = 10
    elif weight <=5 and package_size <80:
        if delivery_mode == "Express":
            charges = 15
        elif delivery_mode == "Next Day":
            charges = 9
        else:
            charges = 11
    elif weight <=8 and package_size <100:
        if delivery_mode == "Express":
            charges = 17
        elif delivery_mode == "Next Day":
            charges = 11
        else:
            charges = 13
    return charges

Pass multiple columns to lambda

Here comes to the most important part. You probably already know data frame has the apply function where you can apply the lambda function to the selected dataframe. We will also use the apply function, and we have a few ways to pass the columns to our calculate_rate function.

 Option 1

We can select the columns that involved in our calculation as a subset of the original data frame, and use the apply function to it.

And in the apply function, we have the parameter axis=1 to indicate that the x in the lambda represents a row, so we can unpack the x with *x and pass it to calculate_rate.

df["Delivery Charges"] = df[["Weight", "Package Size", "Delivery Mode"]].apply(lambda x : calculate_rate(*x), axis=1)

If we check the df again in Jupyter Notebook, you should see the new column “Delivery Charges” with the figures calculated based on the logic we defined in calculate_rate function.

Pandas pass multiple columns to lambda

Option 2:

If you do not want to get a subset of the data frame and then apply the lambda, you can also directly use the apply function to the original data frame. In this case, you will need to select the columns before passing to the calculate_rate function. Same as above, we will need to specify the axis=1 to indicate it’s applying to each row.

df["Delivery Charges"] = df.apply(lambda x : calculate_rate(x["Weight"], x["Package Size"], x["Delivery Mode"]), axis=1)

This will produce the same result as option 1. And you can also use x.Weight instead of x[“Weight”] when passing in the parameter.

 

Conclusion

The two options we discussed to pass multiple columns to lambda are basically the same, and it’s either applying to the subset or the original data frame. I have not yet tested with a large set of data, so there might be some differences in terms of the performance, you may need to take a note if you are dealing with a lot of data.

You may also interested to read some other articles related to pandas.

 

pandas tricks calculate percentage within group

Pandas Tricks – Calculate Percentage Within Group

Pandas groupby probably is the most frequently used function whenever you need to analyse your data, as it is so powerful for summarizing and aggregating data. Often you still need to do some calculation on your summarized data, e.g. calculating the % of vs total within certain category. In this article, I will be sharing with you some tricks to calculate percentage within groups of your data.

Prerequisite

You will need to install pandas if you have not yet installed:

pip install pandas
#or conda install pandas

I am going to use some real world example to demonstrate what kind of problems we are trying to solve. The sample data I am using is from this link , and you can also download it and try by yourself.

Let’s first read the data from this sample file:

import pandas as pd

# You can also replace the below file path to the URL of the file
df = pd.read_excel(r"C:\Sample Sales Data.xlsx", sheet_name="Sheet")

The data will be loaded into pandas dataframe, you will be able to see something as per below:

pandas tricks - calculate percentage within group

Let’s first calculate the sales amount for each transaction by multiplying the quantity and unit price columns.

df["Total Amount"] = df["Quantity"] * df["Price Per Unit"]

You can see the calculated result like below:

pandas tricks - calculate percentage within group

Calculate percentage within group

With the above details, you may want to group the data by sales person and the items they sold, so that you have a overall view of their performance for each person. You can do with the below :

#df.groupby(["Salesman","Item Desc"])["Total Amount"].sum()
df.groupby(["Salesman", "Item Desc"]).agg({"Total Amount" : "sum"})

And you will be able to see the total amount per each sales person:

pandas tricks - calculate percentage within group

This is good as you can see the total of the sales for each person and products within the given period.

Calculate the best performer

Now let’s see how we can get the % of the contribution to total revenue for each of the sales person, so that we can immediately see who is the best performer.

To achieve that, firstly we will need to group and sum up the “Total Amount” by “Salemans”, which we have already done previously.

df.groupby(["Salesman"]).agg({"Total Amount" : "sum"})

And then we calculate the sales amount against the total of the entire group. Here we can get the “Total Amount” as the subset of the original dataframe, and then use the apply function to calculate the current value vs the total. Take note, here the default value of axis is 0 for apply function.

[["Total Amount"]].apply(lambda x: 100*x/x.sum())

With the above, we should be able get the % of contribution to total sales for each sales person. And let’s also sort the % from largest to smallest:

sort_values(by="Total Amount", ascending=False)

Let’s put all together and run the below in Jupyter Notebook:

df.groupby(["Salesman"])\
.agg({"Total Amount" : "sum"})[["Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by="Total Amount", ascending=False)

You shall be able to see the below result with the sales contribution in descending order. (Do not confuse with the column name “Total Amount”, pandas uses the original column name for the aggregated data. You can rename it to whatever name you want later)

pandas tricks - calculate percentage within group for salesman

 

Calculate the most popular products

Similarly, we can follow the same logic to calculate what is the most popular products. This time we want to summarize the sales amount by product, and calculate the % vs total for both “Quantity” and “Total Amount”. And also we want to sort the data in descending order for both fields. e.g.:

df.groupby(["Item Desc"])\
.agg({"Quantity": "sum", "Total Amount" : "sum"})[["Quantity", "Total Amount"]]\
.apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Quantity","Total Amount"], ascending=[False,False])

This will produce the below result, which shows “Whisky” is the most popular product in terms of number of quantity sold. But “Red Wine” contributes the most in terms of the total revenue probably because of the higher unit price.

pandas tricks - calculate percentage within group for products

 

Calculate best sales by product for each sales person

What if we still wants to understand within each sales person, what is the % of sales for each product vs his/her total sales amount?

In this case, we shall first group the “Salesman” and “Item Desc” to get the total sales amount for each group. And on top of it, we calculate the % within each “Salesman” group which is achieved with groupby(level=0).apply(lambda x: 100*x/x.sum()).

Note: After grouping, the original datafram becomes multiple index dataframe, hence the level = 0 here refers to the top level index which is “Salesman” in our case.

df.groupby(["Salesman", "Item Desc"])\
.agg({"Total Amount" : "sum"})\
.groupby(level=0).apply(lambda x: 100*x/x.sum())\
.sort_values(by=["Salesman", "Item Desc","Total Amount"], ascending=[True, True, False])

You will be able see the below result which already sorted by % of sales contribution for each sales person.

pandas tricks - calculate percentage within group - for salesman and product

 

Conclusion

This is just some simple use cases where we want to calculate percentage within group with the pandas apply function, you may also be interested to see what else the apply function can do from here.

 

python send email with attachment via smtplib

How to send email with attachment via python smtplib

In one of my previous article, I have discussed about how to send email from outlook application. That has assumed you have already installed outlook and configured your email account on the machine where you want to run your script. In this article, I will be sharing with you how to automatically send email with attachments via lower level API, to be more specific, by using python smtplib where you do not need to set up anything in your environment to make it work.

For this article, I will demonstrate to you to send a HTML format email from a gmail account with some attachment. So besides the smtplib module, we will need to use another two modules – ssl and email.

Let’s get started!

First, you will need to find out the SMTP server and port info to send email via google account. You can find this information from this link. For your easy reading, I have captured in the below screenshot.

codeforests - google smtp server configuration info

So we are going to use the server: smtp.gmail.com and port 587 for our case. (you may search online to find out more info about the SSL & TLS, we will not discuss much about it in this article)

Let’s start to import all the modules we need:

import smtplib, ssl
from email.mime.multipart import MIMEMultipart 
from email.mime.text import MIMEText 
from email.mime.application import MIMEApplication

As we are going to send the email in HTML format (which are you able to unlock a lot features such as adding in styles, drawing tables etc.), we will need to use the MIMEText. And also the MIMEMultipart and MIMEApplication for the attachment.

Build up the email message

To build up our email message, we need to create mixed type MIMEMultipart object so that we can send both text and attachment. And next, we shall specify the from, to, cc and subject attributes.

smtp_server = 'smtp.gmail.com'
smtp_port = 587 
#Replace with your own gmail account
gmail = '[email protected]'
password = 'your password'

message = MIMEMultipart('mixed')
message['From'] = 'Contact <{sender}>'.format(sender = gmail)
message['To'] = '[email protected]'
message['CC'] = '[email protected]'
message['Subject'] = 'Hello'

You probably do not want anybody can see your hard coded password here, you may consider to put this email account info into a separate configuration file. Check my another post on the read/write configuration files.

For the HTML message content, we will wrap it into the MIMEText, and then attach it to our MIMEMultipart message:

msg_content = '<h4>Hi There,<br> This is a testing message.</h4>\n'
body = MIMEText(msg_content, 'html')
message.attach(body)

Let’s assume you want to attach a pdf file from your c drive, you can read it in binary mode and pass it into MIMEApplication with MIME type as pdf. Take note on the additional header where you need to specify the name your attachment file.

attachmentPath = "c:\\sample.pdf"
try:
	with open(attachmentPath, "rb") as attachment:
		p = MIMEApplication(attachment.read(),_subtype="pdf")	
		p.add_header('Content-Disposition', "attachment; filename= %s" % attachmentPath.split("\\")[-1]) 
		message.attach(p)
except Exception as e:
	print(str(e))

If you have a list of the attachments, you can loop through the list and attach them one by one with the above code.

Once everything is set properly, we can convert the message object into to a string:

msg_full = message.as_string()

Send email

Here comes to the most important part, we will need to initiate the TLS context and use it to communicate with SMTP server.

context = ssl.create_default_context()

And we will initialize the connection with SMTP server and set the TLS context, then start the handshaking process.

Next it authenticate our gmail account, and in the send mail method, you can specify the sender, to and cc (as a list), as well as the message string. (cc is optional)

with smtplib.SMTP(smtp_server, smtp_port) as server:
	server.ehlo()  
	server.starttls(context=context)
	server.ehlo()
	server.login(gmail, password)
	server.sendmail(gmail, 
				to.split(";") + (cc.split(";") if cc else []),
				msg_full)
	server.quit()

print("email sent out successfully")

Once sendmail completed, you will disconnect with the server by server.quit().

With all above, you shall be able to receive the email triggered from your code. You may want to wrap these codes into a class, so that you can reuse it as service library in your multiple projects.

 

As per always, please share if you have any questions or comments.

python cache

How to print colored message on command line terminal window

When you are developing a python script with some output messages printed on the terminal window, you may find a little bit boring that all the messages are printed in black and white, especially if some messages are meant for warning, and some just for information only. You may wonder how to print colored message to make them look differently, so that your users are able to pay special attention to those warning or error messages.

In this article, I will be sharing with you a library which allows you to print colored message in your terminal.

Let’s get started!

The library I am going to introduce called colorama, which is a small and clean library for styling your messages in both Windows, Linux and Mac os.

Prerequisite :

You will need to install this library, so that you will be able to run the following code in this article.

pip install colorama

To start using this library, you will need to import the modules, and call the init() method at the beginning of your script or your class initialization method.

import colorama
from colorama import Fore, Back, Style
colorama.init()

Print colored message with colorama

The init method also accepts some **kwargs to overwrite it’s default behaviors. E.g. by default, the style will not be reset back after printing out a message,  and the subsequent messages will be following the same styles. You can pass in autoreset = true to the init method, so that the style will be reset after each printing statement.

Below are the options you can use when formatting the font, background and style.

Fore: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Back: BLACK, RED, GREEN, YELLOW, BLUE, MAGENTA, CYAN, WHITE, RESET.
Style: DIM, NORMAL, BRIGHT, RESET_ALL

To use it in your message, you can do as per below to wrap your messages with the styles:

print(Fore.CYAN + "Cyan messages will be printed out just for info only" + Style.RESET_ALL)
print(Fore.RED + "Red messages are meant to be to warning or error" + Style.RESET_ALL)
print(Fore.YELLOW + Back.GREEN +  "Yellow messages are debugging info" + Style.RESET_ALL)

This is how it would look like in your terminal:

Python printed colored message with colorama

As I mentioned earlier, if you don’t set the autoreset to true, you will need to reset the style at the end of your each message, so that different message applies different styles.

What if you want to apply the styles when asking user’s input ? Let’s see an example:

print(Fore.YELLOW)
choice = input("Enter YES to confrim:")
print(Style.RESET_ALL)
if str.upper(choice) in ["YES",'Y']:
    print(Fore.GREEN + "You have just confirmed to proceed." + Style.RESET_ALL)
else:
    print(Fore.RED + "You did not enter yes, let's stop here" + Style.RESET_ALL)

By wrapping the input inside Fore.YELLOW and Style.RESET_ALL, whatever output messages from your script or user entry, the same style will be applied.

Let’s put all the above into a script and run it in the terminal to check how it looks like.

Python printed colored message with colorama

Yes, that’s exactly what we want to achieve! Now you can wrap your printing statement into a method e.g.: print_colored_message, so that you do not need to repeat the code everywhere.

As per always, please share if you have any comments or questions.

 

python unpack objects

Python how to unpack tuple, list and dictionary

There are various cases that you want to unpack your python objects such as tuple, list or dictionary into individual variables, so that you can easily access the individual items. In this article I will be sharing with you how to unpack these different python objects and how it can be useful when working with the *args and **kwargs in the function.

Let’s get started.

Unpack python tuple objects

Let’s say we have a tuple object called shape which describes the height, width and channel of an image, we shall be able to unpack it to 3 separate variables by doing below:

shape = (500, 300, 3)
height, width, channel = shape
print(height, width, channel)

And you can see each item inside the tuple has been assigned to the individual variables with a meaningful name, which increases the readability of your code. Below is the output:

500 300 3

It’s definitely more elegant than accessing each items by index, e.g. shape[0], shape[1], shape[2].

What if we just need to access a few items in a big tuple which has many items? Here we need to introduce the _ (unnamed variable) and * (unpack arbitrary number of items)

For example,  if we just want to extract the first and the last item from the below tuple, we can let the rest of the items go into a unnamed variable.

toto_result = (4,11,14,23,28,47,24)
first, *_, last = toto_result
print(first, last)

So the above will give the below output:

4 24

If you are curious what is inside the “_”, you can try to print it out. and you would see it’s actually a list of the rest of items between the first and last item.

[11, 14, 23, 28, 47]

The most popular use case of the packing and unpacking is to pass around as parameters to function which accepts arbitrary number of arguments (*args). Let’s look at an example:

def sum(*numbers):
    total = 0
    for n in numbers:
        total += n
    return total

For the above sum function, it accepts any number of arguments and sum up the values. The * here is trying to pack all the arguments passed to this function and put it into a tuple called numbers. If you are going to sum up the values for all the items in toto_result, directly pass in the toto_result would not work.

toto_resut = (4,11,14,23,28,47,24)
#sum(toto_result) would raise TypeError

So what we can do is to unpack the items from the tuple then pass it the sum function:

total = sum(*toto_resut)
print(total)
#output should be 151

Unpack python list objects

Unpacking the list object is similar to the unpacking operations on tuple object. If we replace the tuple to list in the above example, it should be working perfectly.

shape = [500, 300, 3]
height, width, channel = shape
print(height, width, channel)
#output shall be 500 300 3

toto_result = [4,11,14,23,28,47,24]
first, *_, last = toto_result
print(first, last)
#output shall be 4 24

total = sum(*toto_resut) 
print(total) 
#output should be also 151

Unpack python dictionary objects

Unlike the list or tuple, unpacking the dictionary probably only useful when you wants to pass the dictionary as the keyword arguments into a function (**kwargs).

For instance, in the below function, you can pass in all your keyword arguments one by one.

def print_header(**headers):
    for header in headers:
        print(header, headers[header])

print_header(Host="Mozilla/5.0", referer = "https://www.codeforests.com")

Or if you have a dictionary like below, you can just unpack it and pass to the function:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
print_header(**headers)

It will generate the same result as previously, but the code is more concise.

Host www.codeforests.com
referer https://www.codeforests.com

With this unpacking operator, you can also combine multiple dictionaries as per below:

headers = {'Host': 'www.codeforests.com', 'referer' : 'https://www.codeforests.com'}
extra_header = {'user-agent': 'Mozilla/5.0'}

new_header = {**headers, **extra_header}

The output of the new_header will be like below:

{'Host': 'www.codeforests.com',
 'referer': 'https://www.codeforests.com',
 'user-agent': 'Mozilla/5.0'}

Conclusion

The unpacking operation is very usefully especially when dealing with the *args and **kwargs. There is one thing worth noting on the unamed variable (_) which I mentioned in the previous paragraph. Please use it with caution, as if you notice, the python interactive interpreter also uses _ to store the last executed expression. So do take note on this potential conflict. See the below example:

codeforests interactive interpreter conflicts

As per always, welcome any comments or questions.

web scraping and automate twitter post with selenium

Automate Your Tweets with Selenium

Introduction

In the previous post, we have discussed about how to start web scraping with requests and lxml libraries, and we also summarized two limitations with this approach:

  • Time & effort required to chain all the requests for some complicated operations such as user authentication
  • Triggering a button click or calling JavaScript code is not possible from the HTML response

To solve these two issues, I recommended to use selenium package. In fact you have checked this post, you may still remember that we can use selenium to simulate human actions such as open URL on browser or trigger a button click on the web page and so on.

In this post, I will demonstrate how to use selenium to automatically login to tweeter account, view and post tweets, where the same approach can be used for your web scraping project.

Prerequisites

In order to use selenium to launch browser, you will need to download a web driver for the browser you are using. You can check all the supported browsers as well as the download links from here.

For the below code example, I will use Chrome version 86 and download the driver with this version supported. For simplicity, I will save the chromedriver.exe into my current code directory.

Besides the driver file, you will also need to install selenium in your working environment. Below is the pip command for installation of the latest version:

pip install --upgrade selenium

Let’s also import all the modules at the beginning of our code. Explanation will be given later where these modules are used:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as ec

With the above ready, let’s dive into our code example.

Login to twitter account with Selenium

Similar to a human behavior on the browser, Selenium does not allow you to interact with the invisible elements, and you would encounter ElementNotVisibleException when trying to access the element if it is not fully loaded or not in the view. So the best practice is to always maximize your browser window, so that majority of the information you need are visible and interactable.

To maximize the browser upon launching, you can set –start-maximized in the chrome operations as per below:

chromeOptions = Options()
chromeOptions.add_argument("--start-maximized")

(You can also launch the browser first and later call the maximize_window function to maximize it)

This Chrome options shall be passed into the web driver constructor when it is initiated. We also need to specify the full path of driver exe file, for our case, it’s under the current directory.

driver = webdriver.Chrome(executable_path="chromedriver.exe", options=chromeOptions)

With the above code, a new Chrome browser will be launched. The web driver object has a get method which accepts a URL parameter and opens it from the browser. Below will open the twitter login page on your browser:

tweeter_url = "https://twitter.com/login"
driver.get(tweeter_url)

As there are many factors impact how fast the web page can be fully loaded, you may need to add in some delays at certain steps to make sure that the current action has been completed successfully before moving to the next step.

In Selenium, there are two types of waiting approaches: implicit wait and explicit wait. The implicit wait will just instruct web driver to wait for maximum of certain time when polling the DOM, while explicit wait will check the presence/visibility of the element periodically until the condition is met or the maximum waiting time reached. As implicit wait applies to the entire lifecycle of the web driver, the explicit wait is relatively more flexible. Let’s define our explicit wait for a max of 10 seconds in our example:

wait = WebDriverWait(driver, 10)

Now, we shall follow what we have discussed in the previous post to find a unique identifier of the login username and password fields. By inspecting the web page HTML, you can easily find out the name attribute of the username and password field. Below is the screenshot of the HTML structure for username field:

web scraping and automating twitter post with selenium

 

To locate the username element, we can use the XPath with its element name. And let’s also use the explicit wait to locate it until the element is fully loaded and visible on the page:

username_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[username_or_email]")))

Once we located the username input field, we can send our login ID to this field with send_keys function as per below:

username_input.send_keys(username)

Note: you will need to replace this username/password variable with your twitter login credentials

Similarly, we can locate our password field by its name and send in our password:

password_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[password]")))
password_input.send_keys(password)

Once we have successfully set the values into these two fields, we can simulate the button click on the login button:

  • Firstly we shall locate to the login button by its attribute data-testid=’LoginForm_Login_Button’
  • Then call the WebElement click function to simulate how user clicks on the button

With the below code, you shall be able to login into your tweeter account and view the tweets on your home screen:

login_button = wait.until(ec.visibility_of_element_located((By.XPATH, "//div[@data-testid='LoginForm_Login_Button']")))
login_button.click()

To showcase how to interact with your web page like a normal user, let’s move to the next example to search a tweeter posts with some keywords.

Search tweeter posts by keywords

Same as previously, we shall first locate our search input box by its data-testid attribute as per below:

search_input = wait.until(ec.visibility_of_element_located((By.XPATH, 
"//div/input[@data-testid='SearchBox_Search_Input']")))

As a normal user, I can key in some keywords in the search box and hit ENTER for a search. We can do the same from Selenium via the send_keys function. Let’s first clear the input box and then send a keyword “ethereum” together with a ENTER key:

search_input.clear()
search_input.send_keys("ethereum" + Keys.ENTER)

Upon receiving the ENTER key event, you shall see the search results are loading on the page. The next is to extract the tweeter posts from the searching results.

Below is the sample code that I extracted all the text from the tweets and printed as output:

tweet_divs = driver.find_elements_by_xpath("//div[@data-testid='tweet']")
for div in tweet_divs:
    spans = div.find_elements_by_xpath(".//div/span")
    tweets = ''.join([span.text for span in spans])
    print(tweets)

You shall see the output similar to below:

web scraping and automating twitter post with selenium

With this plain text results, you may use some text processing tools to further analyze what people are discussing around to this keyword.

Automatically post new tweets

Since we are able to search within tweeter, we shall also be able to post a new tweet with Selenium.

Let’s first locate the below text area by the data-testid attribute:

web scraping and automating twitter post with selenium

Below is the code to locate to the span of the text area by it’s ancestor div:

tweet_text_span = driver.find_element_by_xpath("//div[@data-testid='tweetTextarea_0']/div/div/div/span")

Then we can send whatever text we want to tweet:

tweet_text_span.send_keys("Do you know we can tweet with selenium?")

Once the text is written into the span, the tweet button will be enabled. You can locate the button and click to submit the post:

tweet_button = wait.until(ec.visibility_of_element_located((By.XPATH, 
                                                           "//div[@data-testid='tweetButtonInline']")))
tweet_button.click()

Upon submission, you shall see a new post added to your timeline as per below:

web scraping and automating twitter post with selenium

 

Move invisible element into visible view

There are always cases that you need to scroll up and down or left and right to view some information on the web page. You will also need to make sure your elements are in the view before you can do any operation such as getting its attributes or performing clicks.

To move the elements into the view, you can execute some JavaScript code to scroll to the element as per below:

who_to_follow = driver.find_element_by_xpath("//div/span[text() = 'Who to follow']")
driver.execute_script("arguments[0].scrollIntoView(true);", who_to_follow)

Hide your browser with headless mode

When you use Selenium for some automation or scraping job, you may not wish to see the web pages jumping around in front of you. To make it running peacefully in the background, you can set the headless parameter in the Chrome option before the initialization of the web driver:

chromeOptions.add_argument('--headless')

With this parameter, we would not see the browser launched and everything will be running quietly in the background. It’s good that you always test your code properly before you enable this headless mode.

Conclusion

In this article, we have demonstrated how to use Selenium to automatically login to tweeter account, and read or post tweets. And we have also reviewed through how to trigger the JavaScript code with Selenium web driver and run everything totally in the background. In your real project, you may not want to use the same approach to scrap website like tweeter since it has already provided developer account with all the API access. So this article is more to showcase the capability of the Selenium package.

With Selenium, dealing with complicated operations such as user authentication become much simpler as everything is performed like a normal browser user, and it also provides action chains to support all sorts of mouse movement actions such hover over or drag and drop etc. You shall consider to use it in your automation project or web scraping project if your target website relies heavily on the front-end JavaScript code.

python regular expression match, search and findall

Python regular expression match, search and findall

Python beginners may sometimes get confused by this match and search functions in the regular expression module, since they are accepting the same parameters and return the same result in most of the simple use cases.  In this article, let’s discuss about the difference between these two functions.

match vs search in Python regular expression

Let’s start from an example. Let’s say if we want to get the words which ending with “ese” in the languages, both of the below match and search return the same result in match objects.

import re
languages = "Japanese,English"
m = re.match("\w+(?=ese)",languages)
#m returns : <re.Match object; span=(0, 5), match='Japan'>

m = re.search("\w+(?=ese)",languages)
#m returns : <re.Match object; span=(0, 5), match='Japan'>

But if the sequence of your languages changed, e.g. languages = “English, Japanese”, then you will see some different results:

languages = "English,Japanese" 
m = re.match("\w+(?=ese)",languages) 
#m returns empty
m = re.search("\w+(?=ese)",languages) 
#m returns : <re.Match object; span=(8, 13), match='Japan'>

The reason is that match function only starts the matching from the beginning of your string, while search function will start matching from anywhere in your string. Hence if the pattern you want to match may not start from the beginning, you shall always use search function.

In this case, if you want to restrict the matching only start from the beginning, you can also achieve it with search function by specifying “^” in your pattern:

languages = "English,Japanese,Chinese" 
m = re.search("^\w+(?=ese)",languages) 
#m returns empty
m = re.search("\w+(?=ese)",languages)
#m returns: <re.Match object; span=(8, 13), match='Japan'>

findall in Python regular expression

You may also notice when there are multiple occurrences of the pattern, search function only returns the first matched. This sometimes may not be desired when you actually want to see the full list of matched patterns. To return all the occurrences, you can use the findall function:

languages = "English,Japanese,Chinese,Burmese"
m = re.findall("\w+(?=ese)", languages)
#m returns: ['Japan', 'Chin', 'Burm']

 

 

python read and write json file

Read and write json file in python

Json file format is commonly used in most of the programming languages to store data or exchange the data between back end and front end, or between different applications and systems. In this article, I will be explaining how to read and write json file in python programming language.

Read from a JSON file

Python has a json module which makes the read and write json pretty easy. First, let’s assume we have the below example.json file to be read.

{
"link": "www.codeforests.com",
"name": "ken", 
"member": true, 
"hobbies": ["jogging", "watching movie"]
}

To read the file, we can simply use the load method and pass in the file descriptor.

example = json.load(open("example.json"))

Now you can access the example dictionary for the data, e.g.

print(config["hobbies"])

The output would be :

['jogging', 'watching movie']

Write into JSON file

Let’s continue to use the previous example, and try to add one more hobby into the hobbies. Then save the json object into a file.

This time, you can use the json.dump and pass in the file descriptor to be written to:

example["hobbies"].append("badminton")
with open("example.json", "w") as f:
    json.dump(example, f)

If you look at the json documentation, there are two more methods : json.loads and json.dumps. The main difference of this two methods vs json.load & json.dumps is that the loads and dumps take the str representation of the json object. e.g.:

obj = json.loads('{"json":"obj"}')
print(obj)
print(json.dumps({"json":"obj"}))

 

openpyxl write excel with styles

Openpyxl library to write excel file with styles

openpyxl to write excel files with styles

Openpyxl probably the most popular python library for read/write Excel xlsx files. In this article, I will be sharing with you how to write data into excel file with some formatting.

Let’s get started with openpyxl.

If you have not yet installed this openpyxl library in your working environment, you may use the below command to install it.

pip install openpyxl

And we shall import the library and modules at the beginning of the script:

import openpyxl
from openpyxl.styles import Alignment, Border, Side, Font

Now I am going to create a new excel with the sheet name as “Demo”:

workbook = openpyxl.Workbook()
sheet = workbook.active
sheet.title = "Demo"

Assuming if you have the below data that you want to write into the excel file:

raw_data = [["University Name", "No. of Students", "Address", "Contact"],
 ["National University of Singapore", "35908", "21 Lower Kent Ridge Rd, Singapore 119077", "68741616"],
 ["Nanyang Technological University", "31687", "50 Nanyang Ave, 639798", "67911744"],
 ["Singapore Management University", "8182", "81 Victoria St, Singapore 188065", "68280100"]]

You can loop through the list to get the value and assign it to a particular excel cell. Note that the excel row and columns always starts from 1.

for row_idx, rec in enumerate(raw_data):
    for col_idx, val in enumerate(rec):
        sheet.cell(row=row_idx+1, column=col_idx+1).value = val

if you save your data now via the below code, you will see that the saved excel does not come with any formatting (default formatting)

workbook.save("Demo.xlsx")

openpyxl write excel file with styles

As you can see, the format does not look good and some of the column width needs to be adjusted in order to see the full content. Let’s apply some styling to the cells.

Let’s draw the borders for each of the cells, you can specify the color of the border as well as the border style. for more border styles, you can refer to this openpyxl document. you can also use different style for different side of the borders.

thin = Side(border_style="thin", color="303030")
black_border = Border(top=thin, left=thin, right=thin, bottom=thin)

you can also give different width for the different columns as per below :

sheet.column_dimensions["A"].width = 27
sheet.column_dimensions["B"].width = 12
sheet.column_dimensions["C"].width = 33
sheet.column_dimensions["D"].width = 8

Define your own font style:

font = Font(name='Calibri', size=9, bold=True, color='07101c')

Define the alignment style, and you can definitely use different alignment style for different columns. Here I just defined 1 style for all cells.

align = Alignment(horizontal="center", wrap_text= True, vertical="center")

Next, Let’s apply the above styles to each of the cell and save the worksheet:

for label in ["A", "B", "C", "D"]:
    for col_idx in range(row_num):
        idx = label + str(col_idx + 1)
	sheet[idx].alignment = algin
	sheet[idx].font = font
	sheet[idx].border = black_border

workbook.save("Demo.xlsx")

The final output should be similar to the below, which looks much better with the styling.

openpyxl write excel with styles

 

As per always, welcome to any comments or questions.

python read excel with xlrd

How to use xlrd library to read excel file

Python xlrd is a very useful library when you are dealing with some older version of the excel files (.xls). In this tutorial, I will share with you how to use this library to read data from .xls file.

Let’s get started with xlrd

You will need to install this library with below pip command:

pip install xlrd

and import the library to your source code:

import xlrd
import os

To open your excel file, you will need to pass in the full path of your file into the open_workbook function.It returns the workbook object, and in the next line you will be able to access the sheet in the opened workbook.

workbook = xlrd.open_workbook(r"c:\test.xls")

There are multiple ways for doing it, you can access by sheet name, sheet index, or loop through all the sheets

sheet = workbook.sheet_by_name("Sheet")
#getting the first sheet
sheet_1 = workbook.sheet_by_index(0)

for sh in workbook.sheets():
    print(sh.name)

To get the number of rows and columns in the sheet, you can access the following attributes. By default,
all the rows are padded out with empty cells to make them same size, but in case you want to ignore the
empty columns at the end, you may consider ragged_rows parameter when you call the open_workbook function.

row_count = sheet.nrows
col_count = sheet.ncols
# use sheet.row_len() to get the effective column length when you set ragged_rows = True

With number of rows and columns, you will be able to access the data of each of the cells

for cur_row in range(0, row_count):
    for cur_col in range(0, col_count):
        cell = sheet.cell(cur_row, cur_col)
        print(cell.value, cell.ctype)

Instead of accessing the data cell by cell, you can also access it by row or by column, e.g. assume your first row is the column header, you can get all the headers into a list as below:

header = sheet.row_values(0, start_colx=0, end_colx=None)
# row_slice returns the cell object(both data type and value) in case you also need to check the data type
#row_1 = sheet.row_slice(1, start_colx=0, end_colx=None)

Get the whole column values into a list:

col_a = sheet.col_values(0, start_rowx=0, end_rowx=None)
# col_slice returns the cell object of the specified range
col_a = sheet.col_slice(0, start_rowx=0, end_rowx=None)

There is a quite common error when handling the xls files, please check this article for fixing the CompDocError.

Conclusion

xlrd is a clean and easy to use library for handling xls files, but unfortunately there is no active maintenance for this library as Excel already evolved to xlsx format. There are other libraries such as openpyxl which can handle xlsx files very well for both data and cell formatting. I would suggest you to use xlsx file format in your new project whenever possible, so that more active libraries are supported.

If you would like to understand more about openpyxl , please read my next article about this library.

As per always, welcome to any comments or questions. Thanks.