Tutorials

web scraping and automate twitter post with selenium

Automate Your Tweets with Selenium

Introduction

In the previous post, we have discussed about how to start web scraping with requests and lxml libraries, and we also summarized two limitations with this approach:

  • Time & effort required to chain all the requests for some complicated operations such as user authentication
  • Triggering a button click or calling JavaScript code is not possible from the HTML response

To solve these two issues, I recommended to use selenium package. In fact you have checked this post, you may still remember that we can use selenium to simulate human actions such as open URL on browser or trigger a button click on the web page and so on.

In this post, I will demonstrate how to use selenium to automatically login to tweeter account, view and post tweets, where the same approach can be used for your web scraping project.

Prerequisites

In order to use selenium to launch browser, you will need to download a web driver for the browser you are using. You can check all the supported browsers as well as the download links from here.

For the below code example, I will use Chrome version 86 and download the driver with this version supported. For simplicity, I will save the chromedriver.exe into my current code directory.

Besides the driver file, you will also need to install selenium in your working environment. Below is the pip command for installation of the latest version:

pip install --upgrade selenium

Let’s also import all the modules at the beginning of our code. Explanation will be given later where these modules are used:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as ec

With the above ready, let’s dive into our code example.

Login to twitter account with Selenium

Similar to a human behavior on the browser, Selenium does not allow you to interact with the invisible elements, and you would encounter ElementNotVisibleException when trying to access the element if it is not fully loaded or not in the view. So the best practice is to always maximize your browser window, so that majority of the information you need are visible and interactable.

To maximize the browser upon launching, you can set –start-maximized in the chrome operations as per below:

chromeOptions = Options()
chromeOptions.add_argument("--start-maximized")

(You can also launch the browser first and later call the maximize_window function to maximize it)

This Chrome options shall be passed into the web driver constructor when it is initiated. We also need to specify the full path of driver exe file, for our case, it’s under the current directory.

driver = webdriver.Chrome(executable_path="chromedriver.exe", options=chromeOptions)

With the above code, a new Chrome browser will be launched. The web driver object has a get method which accepts a URL parameter and opens it from the browser. Below will open the twitter login page on your browser:

tweeter_url = "https://twitter.com/login"
driver.get(tweeter_url)

As there are many factors impact how fast the web page can be fully loaded, you may need to add in some delays at certain steps to make sure that the current action has been completed successfully before moving to the next step.

In Selenium, there are two types of waiting approaches: implicit wait and explicit wait. The implicit wait will just instruct web driver to wait for maximum of certain time when polling the DOM, while explicit wait will check the presence/visibility of the element periodically until the condition is met or the maximum waiting time reached. As implicit wait applies to the entire lifecycle of the web driver, the explicit wait is relatively more flexible. Let’s define our explicit wait for a max of 10 seconds in our example:

wait = WebDriverWait(driver, 10)

Now, we shall follow what we have discussed in the previous post to find a unique identifier of the login username and password fields. By inspecting the web page HTML, you can easily find out the name attribute of the username and password field. Below is the screenshot of the HTML structure for username field:

web scraping and automating twitter post with selenium

 

To locate the username element, we can use the XPath with its element name. And let’s also use the explicit wait to locate it until the element is fully loaded and visible on the page:

username_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[username_or_email]")))

Once we located the username input field, we can send our login ID to this field with send_keys function as per below:

username_input.send_keys(username)

Note: you will need to replace this username/password variable with your twitter login credentials

Similarly, we can locate our password field by its name and send in our password:

password_input = wait.until(ec.visibility_of_element_located((By.NAME, "session[password]")))
password_input.send_keys(password)

Once we have successfully set the values into these two fields, we can simulate the button click on the login button:

  • Firstly we shall locate to the login button by its attribute data-testid=’LoginForm_Login_Button’
  • Then call the WebElement click function to simulate how user clicks on the button

With the below code, you shall be able to login into your tweeter account and view the tweets on your home screen:

login_button = wait.until(ec.visibility_of_element_located((By.XPATH, "//div[@data-testid='LoginForm_Login_Button']")))
login_button.click()

To showcase how to interact with your web page like a normal user, let’s move to the next example to search a tweeter posts with some keywords.

Search tweeter posts by keywords

Same as previously, we shall first locate our search input box by its data-testid attribute as per below:

search_input = wait.until(ec.visibility_of_element_located((By.XPATH, 
"//div/input[@data-testid='SearchBox_Search_Input']")))

As a normal user, I can key in some keywords in the search box and hit ENTER for a search. We can do the same from Selenium via the send_keys function. Let’s first clear the input box and then send a keyword “ethereum” together with a ENTER key:

search_input.clear()
search_input.send_keys("ethereum" + Keys.ENTER)

Upon receiving the ENTER key event, you shall see the search results are loading on the page. The next is to extract the tweeter posts from the searching results.

Below is the sample code that I extracted all the text from the tweets and printed as output:

tweet_divs = driver.find_elements_by_xpath("//div[@data-testid='tweet']")
for div in tweet_divs:
    spans = div.find_elements_by_xpath(".//div/span")
    tweets = ''.join([span.text for span in spans])
    print(tweets)

You shall see the output similar to below:

web scraping and automating twitter post with selenium

With this plain text results, you may use some text processing tools to further analyze what people are discussing around to this keyword.

Automatically post new tweets

Since we are able to search within tweeter, we shall also be able to post a new tweet with Selenium.

Let’s first locate the below text area by the data-testid attribute:

web scraping and automating twitter post with selenium

Below is the code to locate to the span of the text area by it’s ancestor div:

tweet_text_span = driver.find_element_by_xpath("//div[@data-testid='tweetTextarea_0']/div/div/div/span")

Then we can send whatever text we want to tweet:

tweet_text_span.send_keys("Do you know we can tweet with selenium?")

Once the text is written into the span, the tweet button will be enabled. You can locate the button and click to submit the post:

tweet_button = wait.until(ec.visibility_of_element_located((By.XPATH, 
                                                           "//div[@data-testid='tweetButtonInline']")))
tweet_button.click()

Upon submission, you shall see a new post added to your timeline as per below:

web scraping and automating twitter post with selenium

 

Move invisible element into visible view

There are always cases that you need to scroll up and down or left and right to view some information on the web page. You will also need to make sure your elements are in the view before you can do any operation such as getting its attributes or performing clicks.

To move the elements into the view, you can execute some JavaScript code to scroll to the element as per below:

who_to_follow = driver.find_element_by_xpath("//div/span[text() = 'Who to follow']")
driver.execute_script("arguments[0].scrollIntoView(true);", who_to_follow)

Hide your browser with headless mode

When you use Selenium for some automation or scraping job, you may not wish to see the web pages jumping around in front of you. To make it running peacefully in the background, you can set the headless parameter in the Chrome option before the initialization of the web driver:

chromeOptions.add_argument('--headless')

With this parameter, we would not see the browser launched and everything will be running quietly in the background. It’s good that you always test your code properly before you enable this headless mode.

Conclusion

In this article, we have demonstrated how to use Selenium to automatically login to tweeter account, and read or post tweets. And we have also reviewed through how to trigger the JavaScript code with Selenium web driver and run everything totally in the background. In your real project, you may not want to use the same approach to scrap website like tweeter since it has already provided developer account with all the API access. So this article is more to showcase the capability of the Selenium package.

With Selenium, dealing with complicated operations such as user authentication become much simpler as everything is performed like a normal browser user, and it also provides action chains to support all sorts of mouse movement actions such hover over or drag and drop etc. You shall consider to use it in your automation project or web scraping project if your target website relies heavily on the front-end JavaScript code.

The key for understanding python positional and keyword arguments

In one of the previous article, we have summarized the different ways of passing arguments to Python script. In this article, we will be reviewing through the various approaches for defining and passing arguments into a function in python.

First, let’s start from the basis.

Parameter vs argument

By definition, parameters are the identifiers you specified when you define your function, while arguments are the actual values you supplied to the function when you make the function call. Sometimes you may see people mix up the word parameter and argument, but ultimately they are all referring to the same thing.

Basically Python function supports two types of parameters: positional and keyword arguments. Positional argument is designed to accept argument by following its position during the definition time, while for keyword arguments, you will need to specify the identifier (keyword) followed by the values.

You are allowed to use a combination of positional and keyword arguments when you define your function parameters. Below are the 4 types of variations:

Positional or keyword parameter

By default when you define a Python function, you can either pass in the arguments as positional or keyword. For instance, the below function requires two parameters – file_name and separator:

def read_file(file_name, separator):    
    print(f"file_name={file_name}, separator={separator}")
    file_text  = "Assuming this is the first paragraph from the file."
    return file_text.split(separator)

You can make the function call by supplying the arguments by parameter position or providing the parameter keywords:

read_file("text.txt", " ")
#or
read_file(file_name="text.txt", separator=" ")
#or
read_file("text.txt", separator=" ")
#or
read_file(separator=" ", file_name="text.txt")

All above 4 calls would give you the same results as per below:

file_name=text.txt, separator= 
['Assuming', 'this', 'is', 'the', 'first', 'paragraph', 'from', 'the', 'file.']

Python accepts these arguments regardless of the arguments are provided in positional form or keyword form. When all arguments are by keywords, you can provide them in any order. But take note that positional arguments must be placed before the keyword arguments. For instance, the below throws syntax error,

read_file(file_name="text.txt", " ")

which shows “positional argument follows keyword argument”.

python positional argument and keyword argument

keyword only parameter

For clarity, you may want to implement functions that only accept keyword arguments, and your callers are restricted to only use keyword arguments to invoke the function. To achieve that, you can tweak a bit on your function definition with an additional “*” to indicate all parameters after it must be passed as keywords arguments. E.g.:

def write_file(file_name, *, separator, end, flush):
    print(f"file={file_name}, sep={separator}, end={end}, flush flag={flush}")

For the above function, the separator, end, flush parameters will only accept keyword arguments. You can call it as per below:

write_file("test.txt", separator=" ", end="\r\n", flush=True)

And you shall see output as per below:

file=test.txt, sep= , end=
, flush flag=True

But if you try to pass in all as positional arguments:

write_file("test.txt", " ", "\n", False)

You would see the below error message, which shows the last 3 positional arguments were not accepted.

python positional argument and keyword argument

To further restrict all parameters to be keyword only, you just need to shift the “*” to the beginning of all parameters:

def write_file(*, file_name, separator, end, flush):
    print(f"file={file_name}, sep={separator}, end={end}, flush flag={flush}")

This would make all parameters to only accept keyword arguments. And you can unpack an existing dictionary and pass it arguments into the function:

options = dict(file_name="test.txt", separator=",", end="\n", flush=True)
write(**options)

Positional only arguments

Many Python built-in functions only accept positional arguments, for instance the pow, divmod and abs etc. Prior to Python version 3.8, there is no easy way to define your own function with positional-only parameters. But from version 3.8 onward, you can restrict your function to only accept positional arguments by specifying the “/” (PEP 570) in the function definition. All the parameters come before “/” will be positional-only arguments. For example:

def read(file_name, separator, /):
    print(f"file_name={file_name}, separator={separator}")

For these parameters, if you try to pass in the keyword argument as per below:

read("test.txt", separator=",")

You would see error message indicating the particular parameter is positional-only argument.

    read("test.txt", separator=",")
TypeError: read() got some positional-only arguments passed as keyword arguments: 'separator'

Arbitrary number of arguments

There are cases that you have a number of positional or keyword arguments and you do not want to have a long list of parameters in the function definition. For such case, you can use *args and **kwargs to define arbitrary number of arguments to be accepted by the function. For instance:

def log(*args,**kwargs):
    print(f"args={args}, kwargs={kwargs}")

The above function accepts any number of positional arguments and keyword arguments. All the positional arguments will be packed into a tuple, and keyword arguments are packed into a dictionary.

log("start", "debugging", program_name="python", version=3.7)

When you make a function call as per above, you can see the below output:

args=('start', 'debugging'), kwargs={'program_name': 'python', 'version': 3.7}

This is especially useful when you are just trying to capture a snapshot of all the input arguments (such as logging) or implement a wrapper function for decorator where you do not need to know the exact arguments being passed in. The arbitrary arguments give the callers more flexibility on what they want to pass into your function.

The disadvantage are also obvious, it’s unclear to the new reader what are the parameters to be provided in order to get the correct result from the function call; and you shall not expect all the arguments to be present during the call, so you will have to write some logic to handle the various scenarios when the particular parameters are missing or present.

The best practice for function arguments

When you only have 1 or 2 parameters for the function, you won’t typically see any issue with the code readability/clarity. Problems usually emerge when you have more parameters and some are mandatory and some are optional. Consider the below send_email example:

def send_email(subject, to, cc, 
               bcc, message, attachment, 
               onbehalf, important, read_receipt_requested):
    print(f"{subject}, {to}, {cc}, \
          {bcc} {onbehalf}, {message}, \
          {attachment}, {important}, {read_receipt_requested}")

When you try to include as many parameters as possible to make the function generic for everybody, you’ll have to maintain a very long list of the parameters in the function definition. Calling this function by passing the positional arguments can be very confusing, as you will have to follow exactly the same sequence to provide the arguments as per in the function definition without omitting any single optional argument. E.g.:

send_email("hello", "abc@company.com", "bcd@company.com",
           None, None, 
           "hello", None, False, True)

Without referring back to the function definition, it would be very hard to know which argument represents for which parameter. The best way to make handle such scenario is to always use keyword arguments and set the optional arguments with a default value. For example, the below improved version:

def send_email(subject, to, message, 
               cc=None, bcc=None, attachment=None, 
               onbehalf=None, important=False, read_receipt_requested=False):
    print(f"{subject}, {to}, {cc}, \
          {bcc} {onbehalf}, {message}, \
          {attachment}, {important}, {read_receipt_requested}")

#specify all parameters with keyword arguments
send_email(subject="hello", 
  to="abc@company.com", 
  cc="bcd@company.com", 
  message="hello", 
  read_receipt_requested=True)

With keyword arguments, it improves readability of your code and also allows you to specify default values for the optional arguments. Further more, it gives you the flexibility to extend your parameter list in the future without refactoring your existing code, so that it provides the backwards compatibility at the very beginning.

Conclusion

In this article, we have reviewed through the different approaches for defining and passing arguments to Python function as well as their advantages and disadvantages. Based on your own scenario, you may need to evaluate whether to use positional-only, keyword-only, mix of positional and keyword, or even arbitrary arguments. For code clarity, the general recommendation is to use keyword argument as much as possible, so that all the arguments are understandable at the first glance, and reduces the chances of error.

Python comprehension Photo by Karsten Würth on Unsplash

Python comprehensions for list, set and dictionary

Introduction

Python comprehension is a set of looping and filtering instructions for evaluating expressions and producing sequence output. It is commonly used to construct list, set or dictionary objects which are known as list comprehension, set comprehension and dictionary comprehension. Comparing to use map or filter or nested loops to generate the same result, comprehension has more concise syntax and improved readability. In this article, I will be explaining these three types of comprehensions with some practical examples.

Python comprehension basic syntax

You may have seen people using list/set/dict comprehension in their code, but if you have not yet checked the Python documentation, below is the syntax for Python comprehension.

Assignment Expression for target in iterable [if condition]

It requires a single assignment expression with at least one for statement, and following by zero or more for or if statements.

With this basic understanding, let’s dive into the examples.

List comprehension

List comprehension uses for and if statements to construct a list literal. The new list can be derived from any sequence like string, list, set and dictionary etc. For instance, if we have a list like below:

words = [
    "Serendipity",
    "Petrichor",
    "Supine",
    "Solitude",
    "Aurora",
    "Idyllic",
    "Clinomania",
    "Pluviophile",
    "Euphoria",
    "Sequoia"]

Single loop/if statement

You can use list comprehension to derive a new list which only keeps the elements with length less than 8 from the original list:

short_words = [word for word in words if len(word) < 8 ]

If you examine the short_words, you shall see only the short words (length less than 8) were selected to form a new list:

['Supine', 'Aurora', 'Idyllic', 'Sequoia']

Multiple if statements

As described earlier in the syntax, you can have multiple if conditions to filter the elements:

short_s_words = [word for word in words if len(word) < 8 if word.startswith("S") ]
#short_s_words = [word for word in words if len(word) < 8 and word.startswith("S") ]

The above two would generate the same result as per below:

['Supine', 'Sequoia']

Similarly, you can also use or in the if statement:

short_or_s_words = [word for word in words if len(word) < 8 or word.startswith("S") ]

You shall see the below result for the short_or_s_words variable:

['Serendipity', 'Supine', 'Solitude', 'Aurora', 'Idyllic', 'Sequoia']

Multiple loop/if statements

Sometimes you may have nested data structure and you would like to make it a flatten structure. For instance, to transform a nested list into a flatten list, you can make use of the list comprehension with multiple loops as per below:

lat_long = [[1.291173,103.810535], [1.285387,103.846082], [1.285803,103.845392]]
[x for pos in lat_long for x in pos]

Python will evaluate these expressions from left to right until the innermost block is reached. You shall see the nested list has been transformed into a flatten list:

[1.291173, 103.810535, 1.285387, 103.846082, 1.285803, 103.845392]

And similarly if you have multiple sequences to be iterated through, you can have multiple for statements in your comprehension or use zip function depends on what kind of results you what to achieve:

[(word, num) for word in words if word.startswith("S") for num in range(4) if num%2 == 0]

The above code would generate the output as per below:

[('Serendipity', 0),
 ('Serendipity', 2),
 ('Supine', 0),
 ('Supine', 2),
 ('Solitude', 0),
 ('Solitude', 2),
 ('Sequoia', 0),
 ('Sequoia', 2)]

If you use zip as per below, it will generates some different results.

[(word, num) for word, num in zip(words, range(len(words))) if word.startswith("S") and num%2 == 0]

Another practical example would be using list comprehension to return the particular type of files from the current and its sub folders. For instance, below code would list out all all the ipynb files from current and its sub folder but excluding the checkpoints folder:

import os

[os.path.join(d[0], f) for d in os.walk(".") if not ".ipynb_checkpoints" in d[0]
             for f in d[2] if f.endswith(".ipynb")]

Generate tuples from list comprehension

As you can see from the above examples, the list comprehension supports to generate list of tuples, but do take note that you have to use parenthesis e.g.: (word, len(word)) in the expression to indicate the expected output to be a tuple, otherwise there will be a syntax error:

[(word, len(word)) for word in words]

Set comprehension

Similar to list comprehension, the set comprehension uses the same syntax but constructs a set literal. For instance:

words_set = set(words)
short_words_set = {word for word in words_set if len(word) < 8}

The only difference between list comprehension and set comprehension is the square braces “[]” changed to curly braces “{}”.  And you shall see the same result as previous example except the data type now is a set:

{'Aurora', 'Idyllic', 'Sequoia', 'Supine'}

And same as list comprehension, any iterables can be used in the set comprehension to derive a new set. So using the list directly in below will also produce the same result as the above example.

short_words_set = {word for word in words if len(word) < 8}

Due to the nature of the set data structure, you shall expect the duplicate values to be removed when forming a new set with set comprehension.

Dictionary comprehension

With enough explanation in above, i think we shall directly jump into the examples, since everything is the same as list and set comprehension except the data type.

Below is an example:

dict_words = {word: len(word) for word in words}

It produces a new dictionary as per below:

{'Serendipity': 11,
 'Petrichor': 9,
 'Supine': 6,
 'Solitude': 8,
 'Aurora': 6,
 'Idyllic': 7,
 'Clinomania': 10,
 'Pluviophile': 11,
 'Euphoria': 8,
 'Sequoia': 7}

And similarly, you can do some filtering with if statements:

s_words_dict = {word: length for word, length in dict_words.items() if word.startswith("S")}

You can see only the keys starting with “s” were selected to form a new dictionary:

{'Serendipity': 11, 'Supine': 6, 'Solitude': 8, 'Sequoia': 7}

You can check another usage of dictionary comprehension from this post – How to swap key and value in a python dictionary

Limitations and constraints

With all the above examples, you may find comprehension makes our codes more concise and clearer comparing to use map and filter:

list(map(lambda x: x, filter(lambda word: len(word) < 8, words)))

But do bear in mind not to overuse it, especially if you have more than two loop/if statements, you shall consider to move the logic into a function, rather than put everything into a singe line which causes the readability issue.

The Python comprehension is designed to be simple and only support the for and if statements, so you will not be able to put more complicated logic inside a single comprehension.

Finally, if you have a large set of data, you shall avoid using comprehension as it may exhaust all the memory and causes your program to crash. An alternative way is to use the generator expression, which has the similar syntax but it produces a generator for later use. For instance:

w_generator = ((word, length) for word, length in dict_words.items() if word.startswith("S"))

It returns a generator and you can consume the items one by one:

for x in w_generator:
    print(x)

You can see the same result would be produced:

('Serendipity', 11)
('Supine', 6)
('Solitude', 8)
('Sequoia', 7)

Conclusion

In this article, we have reviewed though the basic syntax of the Python comprehension with some practical examples for list comprehension, set comprehension and dictionary comprehension. Although it is so convenient to use Python comprehension, you will still need to think about the readability/clarity of your code to other people and also the potential memory issues when you are working on a huge data set.

pandas split data into buckets with cut and qcut

Pandas – split data into buckets with cut and qcut

If you do a lot of data analysis on your daily job, you may have encountered problems that you would want to split data into buckets or groups based on certain criteria and then analyse your data within each group. For instance, you would like to check the popularity of your products or website within each age groups, or understand how many percent of the students fall under each score range. The most straightforward way might be to categorize your data based on the conditions and then summarize the information, but this usually requires some additional effort to massage the data. In this article, I will be sharing with you a simple way to bin your data with pandas cut and qcut function.

Prerequisite

You will need to install pandas package if you do not have it yet in your working environment. Below is the command to install pandas with pip:

pip install pandas

And let’s import the necessary packages and create some sample sales data for our later examples.

import pandas as pd
import numpy as np
df = pd.DataFrame({"Consignee" : ["Patrick", "Sara", "Randy", "John", "Patrick", "Joe"],
                   "Age" : [44, 51, 23, 30, 44, 39],
                  "Order Date" : pd.date_range(start='2020-08-01', end="2020-08-05", periods=6),
                  "Item Desc" : ["White Wine", "Whisky", "Red Wine", "Whisky", "Red Wine", "Champagne"],
                  "Price Per Unit": [10, 20, 30, 20, 30, 30], 
                  "Order Quantity" : [50, 60, 40, 20, 10, 50],
                  "Order Dimensions" : [0.52, 0.805, 0.48, 0.235,0.12, 0.58]})

With the above codes, we can do a quick view of how the data looks like:

pandas split data into segments

And let’s also calculate the total sales amount by multiplying the price per unit and the order quantity:

df["Total Amount"] = df["Price Per Unit"] * df["Order Quantity"]

Once this data is ready, let’s dive into the problems we are going to solve today.

split data into buckets by cut

If we would like to classify our customers into a few age groups and have a overall view of how much money each age group has spent on our product, how shall we do it ? As I mentioned earlier, we are not going to apply some lambda function with conditions like : if the age is less than 30 then classify the customer as young, because this can easily drive you crazy when you have hundreds or thousands of groups to be defined. Instead, we will be using a powerful data frame cut function to achieve this.

The cut function has two mandatory arguments:

  • x – an array of values to be binned
  • bins – indicate how you want to bin your values

For instance, if you supply the df[“Age”] as the first argument, and indicate bins as 2, you are telling pandas to split your age data into 2 equal groups. In our case, the minimum age value is 23, and maximum age value is 51, so the first group will be from 23 to 23 + (51-23)/2, and second group from 23 + (51-23)/2 to 51. When you run the below code:

pd.cut(df["Age"],2)

You shall see the output similar to below:

pandas split data segment category

Pandas already classified our age data into these two groups and the output shows that data type is a pandas category object. This is very useful as you can actually assign this category column back to the original data frame, and do further analysis based on the categories from there.

Since we don’t want the decimal points for age data, we can set precision = 0, and we also want to label our age data into 3 groups as Yong, Mid-Aged and Old.

Below is the code that we assign our binned age data into “Age Group” column:

df["Age Group"] = pd.cut(df["Age"],3, precision=0, labels=["Young","Mid-Aged","Old"])

If you examine the data again, you would see:

pandas split data into buckets - age group

Pandas mapped out our age data into 3 groups evenly based on the min and max of the age values. But you may have noticed that age 44 has been classified as “Old” which does not sound that true. In this case, we would want to give our own definition of young, mid-aged and old in the bins argument. Let’s delete the “Age Group” column and redo it with below:

df["Age Group"] = pd.cut(df["Age"],[20, 30, 50, 60], precision=0, labels=["Young","Mid-Aged","Old"])

With this list of integer intervals, we are telling pandas to split our data into 3 groups (20, 30], (30, 50] and (50, 60], and label them as Young, Mid-Aged and Old respectively. (here “(” means exclusive, and “]” means inclusive)

If we check the data again:

df[["Age", "Age Group"]]

You shall see the correct result as per we expected:

pandas split data into buckets- age groups with custom intervals

Now with this additional column, you can easily find out how much each age group contributed to the total sales amount. For example:

df.groupby("Age Group").agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

This would calculate the contribution % to the total sales amount within each group (more details from here):

pandas split data into buckets - cut age groups

If you do not wish to have any intermediate data column (for our case, the “Age Group”) added to you data frame, you can directly pass the output of the cut into the groupby function:

df.groupby(pd.cut(df["Age"],[20, 30, 50, 55], precision=0, labels=["Young","Mid-Aged","Old"])).agg({"Total Amount": "sum"})[["Total Amount"]].apply(lambda x: 100*x/x.sum())

The above code will produce the same result as previously.

There are times you may want to define your bins with a start point & end point at a fixed interval, for instance, to understand for order dimensions at each 0.1, how much is the total sales amount.

For such case, we can make use of the arange function from numpy package, e.g.:

np.arange(0, 1, 0.1)

This would give us an array of values between 0 and 1 with interval of 0.1, and we can supply it as the bins to cut function:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1, 0.1))).agg({"Total Amount": "sum"})

With the above code, we can see pandas split the order dimensions into small chunks of every 0.1 range, and then summarized the sales amount for each of these ranges:

pandas split data into buckets - order dimensions

Note that arange does not include the stop number 1, so if you wish to include 1, you may want to add an extra step into the stop number, e.g.: np.arange(0, 1 + 0.1, 0.1). And cut function also has two arguments – right and include_lowest to control how you want to include the left and right edge. E.g.:

df.groupby(pd.cut(df["Order Dimensions"],np.arange(0, 1 + 0.1, 0.1), right=False, include_lowest=True)).agg({"Total Amount": "sum"})

This will make the left edge inclusive and right edge exclusive, the output will be similar to below:

pandas split data into buckets - order dimensions left inclusive

cut vs qcut

Pandas also provides another function qcut, which helps to split your data based on quantiles (the cut points based on the distribution of the data). For instance, if you use qcut for the “Age” column:

pd.qcut(df["Age"],2, duplicates="drop")

You would see the age data has been split into two groups : (22.999, 41.5] and (41.5, 51.0]. 

pandas split data into buckets - age groups qcut

If you examine the data inside each group:

pd.qcut(df["Age"],2, duplicates="drop").value_counts()

You would see qcut has split the total of 6 rows of age data equally into 2 groups, and the cut point is at 41.5:

pandas split data into buckets - age groups qcut - value_counts1

So if you would like to understand what are the 4 age groups spent similar amount of money on your product, you can do as below:

df.groupby(pd.qcut(df["Age"],4, duplicates="drop")).agg({"Total Amount" : "sum"})

And you would see if we split our data into these 4 groups, the total sale amount are relatively the same:

pandas split data into buckets - age groups qcut - sales amount

Conclusion

In this article, we have reviewed through the pandas cut and qcut function where we can make use of them to split our data into buckets either by self defined intervals or based on cut points of the data distribution.

Hope this gives you some hints when you are solving the problems similar to what we have discussed here.

 

python decorators

Why we should use Python decorator

Introduction

Decorator is one of the very important features in Python, and you may have seen it many places in Python code, for instance, the functions with annotation like @classmethod, @staticmethod, @property etc. By definition, decorator is a function that extends the functionality of another function without explicitly modifying it. It makes the code shorter and meanwhile improve the readability. In this article, I will be sharing with you how we shall use the Python decorators.

Basic Syntax

If you have checked my this article about the Python closure, you may still remember that we have discussed about Python allows to pass in a function into another function as argument. For example, if we have the below functions:

add_log – to add log to inspect all the positional and keyword arguments of a function before actually calling it

send_email – to accept some positional and keyword arguments for sending out emails

def add_log(func):
    def log(*args, **kwargs):
        for arg in args:
            print(f"{func.__name__} - args: {arg}")
        for key, val in kwargs.items():
            print(f"{func.__name__} - {key}, {val}")
        return func(*args, **kwargs)
    return log

def send_email(subject, to, **kwargs):  
    #send email logic 
    print(f"email sent to {to} with subject {subject}.")

We can pass in the send_email function to add_log as argument, and then we trigger the sending of the email.

sender = add_log(send_email)
sender("hello", "contact@codeforests.com", attachment="debug.log", urgent_flag=True)

This code will generate the output as per below:

python decorator pass function as argument

You can see that the send_email function has been invoked successfully after all the arguments were printed out. This is exactly what decorator is doing – extending the functionality of the send_email function without changing its original structure. When you directly call the send_email again, you can still see it’s original behavior without any change.

python decorator pass function as argument

Python decorator as a function

Before Python 2.4, the classmethod() and staticmethod() function were used to decorate functions by passing in the decorated function as argument. And later the @ symbol was introduced to make the code more concise and easier to read especially when the functions are very long.

So let implement our own decorator with @ syntax.

Assuming we have the below decorator function and we want to check if user is in the whitelist before allowing he/she to access certain resources. We follow the Python convention to use wrapper as the name of the inner function (although it is free of your choice to use any name).

class PermissionDenied(Exception):
    pass

def permission_required(func):
    whitelist = ["John", "Jane", "Joe"]
    def wrapper(*args, **kwargs):
        user = args[0]
        if not user in whitelist:
            raise PermissionDenied
        func(*args, **kwargs)
    return wrapper

Next, we decorate our function with permission_required as per below:

@permission_required
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

When we call our function as per normal, we shall expect the decorator function to be executed first to check if user is in the whitelist.

read_file("John", r"C:\pwd.txt")

You can see the below output has been printed out:

python decorator read file output -1

If we pass in some user name not in the whitelist:

read_file("Johnny", r"C:\pwd.txt")

You would see the permission denied exception raised which shows everything works perfect as per we expected.

python decorator read file permission denied

But if you are careful enough, you may find something strange when you check the below.

python decorator read file output -3

So it seems there is some flaw with this implementation although the functional requirement has been met. The function signature has been overwritten by the decorator, and this may cause some confusing to other people when they want to use your function.

Use of the functools.wraps

To solve this problem, we will need to introduce one more Python module functools, where we can use the wraps method to update back the metadata info for the original function.

Let update our decorator function again by adding @wraps(func) to the wrapper function:

from functools import wraps

def permission_required(func):
    ...
    @wraps(func)
    def wrapper(*args, **kwargs):
       ...
    return wrapper

Finally, when we check the function signature and name again, it shows the correct information now.

python decorator read file output -4

So what happened was that, the @wraps(func) would invoke a update_wrapper function which updates the metadata of the original function automatically so that you will not see the wrapper’s metadata. You may want to check the update_wrapper function in the functools module to further understand how the metadata is updated.

Beside decorating normal function, the decorator function can be also used to decorate the class function, for instance, the @staticmethod and @property are commonly seen in Python code to decorate the class functions.

Python decorator as a class

Decorator function can be also implemented as a class in case you find your wrapper function has grown too big or has nested too deeply. To make this happen, you will need to implement a __call__ function so that the class instance become callable with the decorated function as argument.

Below is the code that implements our earlier example as a class:

from functools import update_wrapper
class PermissionRequired:
    def __init__(self, func):
        self._whitelist = ["John", "Jane", "Joe"]
        update_wrapper(self, func)
        self._func = func
        
    def __call__(self, *args, **kwargs):  
        user = args[0]
        if not user in self._whitelist:
            raise PermissionDenied
        return self._func(*args, **kwargs)

Take note that we will need to call the update_wrapper function to manually update the metadata for our decorated function. And same as before, we can continue using @ with class name to decorate our function.

@PermissionRequired
def read_file(user, file_path):
    with open(file_path, "r") as f:
        #print out the first line of the file
        print(f.readline())

Conclusion

In this article, we have reviewed through the reasons of Python decorators being introduced with the basic syntax of implementing our own decorators. And we also discussed about the decorator as function and class with some examples. Hopefully this article would help you to enhance your understanding about Python decorator and guide you on how to use it in your project.